Cites: 1240 ( according to Google Scholar, Updated 03.05.2018)
A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the dcase 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.
Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection
Toni Heittola, Emre Çakır, and Tuomas Virtanen. The Machine Learning Approach for Analysis of Sound Scenes and Events, pages 13–40. Springer International Publishing, Cham, 2018.
The Machine Learning Approach for Analysis of Sound Scenes and Events
This chapter explains the basic concepts in computational methods used for analysis of sound scenes and events. Even though the analysis tasks in many applications seem different, the underlying computational methods are typically based on the same principles. We explain the commonalities between analysis tasks such as sound event detection, sound scene classification, or audio tagging. We focus on the machine learning approach, where the sound categories (i.e., classes) to be analyzed are defined in advance. We explain the typical components of an analysis system, including signal pre-processing, feature extraction, and pattern classification. We also preset an example system based on multi-label deep neural networks, which has been found to be applicable in many analysis tasks discussed in this book. Finally, we explain the whole processing chain that involves developing computational audio analysis systems.
Annamaria Mesaros, Toni Heittola, and Dan Ellis. Datasets and Evaluation, pages 147–179. Springer International Publishing, Cham, 2018.
Datasets and Evaluation
Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.
Environmental noise monitoring using source classification in sensors
Environmental noise monitoring systems continuously measure sound levels without assigning these measurements to different noise sources in the acoustic scenes, therefore incapable of identifying the main noise source. In this paper a feasibility study is presented on a new monitoring concept in which an acoustic pattern classification algorithm running in a wireless sensor is used to automatically assign the measured sound level to different noise sources. A supervised noise source classifier is learned from a small amount of manually annotated recordings and the learned classifier is used to automatically detect the activity of target noise source in the presence of interfering noise sources. The sensor is based on an inexpensive credit-card-sized single-board computer with a microphone and associated electronics and wireless connectivity. The measurement results and the noise source information are transferred from the sensors scattered around the measurement site to a cloud service and a noise portal is used to visualise the measurements to users. The proposed noise monitoring concept was piloted on a rock crushing site. The system ran reliably over 50 days on site, during which it was able to recognise more than 90% of the noise sources correctly. The pilot study shows that the proposed noise monitoring system can reduce the amount of required human validation of the sound level measurements when the target noise source is clearly defined.
Environmental noise monitoring, Acoustic pattern classification, Wireless sensor network, Cloud service
Cites: 4 (see at Google Scholar)
Tuomas Virtanen, Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Emmanuel Vincent, Emmanouil Benetos, and Benjamin Martinez Elizalde. (Eds.) Proceedings of the detection and classification of acoustic scenes and events 2017 workshop (DCASE2017). 2017. ISBN: 978-952-15-4042-4. 1 cite
Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp 85–92. November 2017. 61 cites
DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events
Cites: 61 (see at Google Scholar)
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 319–323. IEEE Computer Society, 2017. 2 cites
Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study
Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.
Cites: 2 (see at Google Scholar)
Zhao Shuyang, Toni Heittola, and Tuomas Virtanen. Learning vocal mode classifiers from heterogeneous data sources. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 16–20. IEEE Computer Society, 2017.
Learning Vocal Mode Classifiers from Heterogeneous Data Sources
This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. However, previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering the cases that training and testing data are recorded in mismatched condition. Experiments revealed a big difference between homogeneous recognition scenario and heterogeneous recognition scenario, using a new dataset TUT-vocal-2016. In the homogeneous recognition scenario, the classification accuracy using cross-validation on TUT-vocal-2016 was 95.5\%. In heterogeneous recognition scenario, seven existing datasets were used as training material and TUT-vocal-2016 was used for testing, the classification accuracy was only 69.6\%. Several feature normalization methods were tested to improve the performance in heterogeneous recognition scenario. The best performance (96.8\%) was obtained using the proposed subdataset-wise normalization.
Emre Cakir, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. Convolutional recurrent neural networks for polyphonic sound event detection. Transactions on Audio, Speech and Language Processing: Special issue on Sound Scene and Event Analysis, 25(6):1291–1303, June 2017. 19 cites
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.
Cites: 19 (see at Google Scholar)
Zhao Shuyang, Toni Heittola, and Tuomas Virtanen. Active learning for sound event classification by clustering unlabeled data. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp 751-755. New Orleans, USA, 2017. IEEE Computer Society. 1 cite
Active Learning for Sound Event Classification by Clustering Unlabeled Data
This paper proposes a novel active learning method to save annotation effort when preparing material to train sound event classifiers. K-medoids clustering is performed on unlabeled sound segments, and medoids of clusters are presented to annotators for labeling. The annotated label for a medoid is used to derive predicted labels for other cluster members. The obtained labels are used to build a classifier using supervised training. The accuracy of the resulted classifier is used to evaluate the performance of the proposed method. The evaluation made on a public environmental sound dataset shows that the proposed method outperforms reference methods (random sampling, certainty-based active learning and semi-supervised learning) with all simulated labeling budgets, the number of available labeling responses. Through all the experiments, the proposed method saves 50%-60% labeling budget to achieve the same accuracy, with respect to the best reference method.
active learning, sound event classification, K-medoids clustering
Cites: 1 (see at Google Scholar)
Tuomas Virtanen, Annamaria Mesaros, Toni Heittola, Mark D. Plumbley, Peter Foster, Emmanouil Benetos, and Mathieu Lagrange. (Eds.) Proceedings of the detection and classification of acoustic scenes and events 2016 workshop (DCASE2016). 2016. ISBN: 978-952-15-3807-0.
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)
Emre Cakir, Toni Heittola, and Tuomas Virtanen. Domestic audio tagging with convolutional neural networks. Technical Report, DCASE2016 Challenge, September 2016. 7 cites
Domestic Audio Tagging with Convolutional Neural Networks
In this paper, the method used in our submission for DCASE2016 challenge task 4 (domestic audio tagging) is described. The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated. A relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.
Cites: 7 (see at Google Scholar)
Sharath Adavanne, Giambattista Parascandolo, Pasi Pertila, Toni Heittola, and Tuomas Virtanen. Sound event detection in multichannel audio using spatial and harmonic features. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp 6–10. September 2016. 39 cites
Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features
In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database. The proposed method improves the F-score by 3.75% while reducing the error rate by 6%
Sound event detection, multichannel, time difference of arrival, pitch, recurrent neural networks, long short term memory
Cites: 39 (see at Google Scholar)
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016), pp 1128-1132. Budapest, Hungary, Aug 2016. 154 cites
TUT Database for Acoustic Scene Classification and Sound Event Detection
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.
audio recording;audio signal processing;Gaussian mixture models;TUT database;acoustic scene classification;binaural recordings;environmental sound research;mel frequency cepstral coefficients;sound event detection;Automobiles;Databases;Europe;Event detection;Mel frequency cepstral coefficient;Signal processing
Cites: 154 (see at Google Scholar)
Metrics for Polyphonic Sound Event Detection
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
Cites: 64 (see at Google Scholar)
Aleksandr Diment, Emre Cakir, Toni Heittola, and Tuomas Virtanen. Automatic recognition of environmental sound events using all-pole group delay features. In 23rd European Signal Processing Conference 2015 (EUSIPCO 2015). Nice, France, 2015. 7 cites
Automatic recognition of environmental sound events using all-pole group delay features
A feature based on the group delay function from all-pole models (APGD) is proposed for environmental sound event recognition. The commonly used spectral features take into account merely the magnitude information, whereas the phase is overlooked due to the complications related to its interpretation. Additional information concealed in the phase is hypothesised to be beneficial for sound event recognition. The APGD is an approach to inferring phase information, which has shown applicability for analysis of speech and music signals and is now studied in environmental audio. The evaluation is performed within a multi-label deep neural network (DNN) framework on a diverse real-life dataset of environmental sounds. It shows performance improvement compared to the baseline log mel-band energy case. In combination with the magnitude-based features, APGD demonstrates further improvement.
Cites: 7 (see at Google Scholar)
Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. Multi-label vs. combined single-label sound event detection with deep neural networks. In 23rd European Signal Processing Conference 2015 (EUSIPCO 2015). Nice, France, 2015. 16 cites
Multi-Label vs. Combined Single-Label Sound Event Detection With Deep Neural Networks
In real-life audio scenes, many sound events from different sources are simultaneously active, which makes the automatic sound event detection challenging. In this paper, we compare two different deep learning methods for the detection of environmental sound events: combined single-label classification and multi-label classification. We investigate the accuracy of both methods on the audio with different levels of polyphony. Multi-label classification achieves an overall 62.8% accuracy, whereas combined single-label classification achieves a very close 61.9% accuracy. The latter approach offers more flexibility on real-world applications by gathering the relevant group of sound events in a single classifier with various combinations.
Cites: 16 (see at Google Scholar)
Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. Polyphonic sound event detection using multi label deep neural networks. In The International Joint Conference on Neural Networks 2015 (IJCNN 2015). Cill Airne, Eire, 2015. 94 cites
Polyphonic Sound Event Detection Using Multi Label Deep Neural Networks
In this paper, the use of multi label neural networks are proposed for detection of temporally overlapping sound events in realistic environments. Real-life sound recordings typically have many overlapping sound events, making it hard to recognize each event with the standard sound event detection methods. Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work. The model is evaluated with recordings from realistic everyday environments and the obtained overall accuracy is 58.9%. The method is compared against a state-of-the-art method using non-negative matrix factorization as a pre-processing stage and hidden Markov models as a classifier. The proposed method improves the accuracy by 19% percentage points overall.
Cites: 94 (see at Google Scholar)
Annamaria Mesaros, Onur Dikmen, Toni Heittola, and Tuomas Virtanen. Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp 151-155. Brisbane, Australia, 2015. IEEE Computer Society. 49 cites
Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations
Methods for detection of overlapping sound events in audio involve matrix factorization approaches, often assigning separated components to event classes. We present a method that bypasses the supervised construction of class models. The method learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix. In testing, the dictionaries are used to estimate directly the class activations. For dealing with large amount of training data, two methods are proposed for reducing the size of the dictionary. The methods were tested on a database of real life recordings, and outperformed previous approaches by over 10%.
coupled non-negative matrix factorization, non-negative dictionaries, sound event detection
Cites: 49 (see at Google Scholar)
Aleksandr Diment, Rajan Padmanabhan, Toni Heittola, and Tuomas Virtanen. Group delay function from all-pole models for musical instrument recognition. Sound, Music, and Motion, Lecture Notes in Computer Science, pp 606-618, 2014. 3 cites
Group Delay Function from All-Pole Models for Musical Instrument Recognition
In this work, the feature based on the group delay function from all-pole models (APGD) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The APGD is an elegant approach to inferring phase information, which lacks of the issues related to interpreting the phase and does not require extensive parameter adjustment. Having shown applicability for speech-related problems, it is now explored in terms of instrument recognition. The evaluation is performed with various instrument sets and shows noteworthy absolute accuracy gains of up to 7% compared to the baseline mel-frequency cepstral coefficients (MFCCs) case. Combined with the MFCCs and with feature selection, APGD demonstrates superiority over the baseline with all the evaluated sets.
Musical instrument recognition, music information retrieval, all-pole group delay feature, phase spectrum
Cites: 3 (see at Google Scholar)
Method for creating location-specific audio textures
An approach is proposed for creating location-specific audio textures for virtual location-exploration services. The presented approach creates audio textures by processing a small amount of audio recorded at a given location, providing a cost-effective way to produce a versatile audio signal that characterizes the location. The resulting texture is non-repetitive and conserves the location-specific characteristics of the audio scene, without the need of collecting large amount of audio from each location. The method consists of two stages: analysis and synthesis. In the analysis stage, the source audio recording is segmented into homogeneous segments. In the synthesis stage, the audio texture is created by randomly drawing segments from the source audio so that the consecutive segments will have timbral similarity near the segment boundaries. Results obtained in listening experiments show that there is no statistically significant difference in the audio quality or location-specificity of audio when the created audio textures are compared to excerpts of the original recordings. Therefore, the proposed audio textures could be utilized in virtual location-exploration services. Examples of source signals and audio textures created from them are available at www.cs.tut.fi/~heittolt/audiotexture.
Cites: 5 (see at Google Scholar)
Aleksandr Diment, Toni Heittola, and Tuomas Virtanen. Sound event detection for office live and office synthetic aasp challenge. Technical Report, Tampere University of Technology, 2013. 22 cites
Sound Event Detection for Office Live and Office Synthetic AASP Challenge
We present a sound event detection system based on hidden Markov models. The system is evaluated with development material provided in the AASP Challenge on Detection and Classification of Acoustic Scenes and Events. Two approaches using the same basic detection scheme are presented. First one, developed for acoustic scenes with non-overlapping sound events is evaluated with Office Live development dataset. Second one, developed for acoustic scenes with some degree of overlapping sound events is evaluated with Office Synthetic development dataset.
Sound event detection
Cites: 22 (see at Google Scholar)
Aleksandr Diment, Rajan Padmanabhan, Toni Heittola, and Tuomas Virtanen. Modified group delay feature for musical instrument recognition. In 10th International Symposium on Computer Music Multidisciplinary Research (CMMR). Marseille, France, October 2013. 6 cites
Modified Group Delay Feature for Musical Instrument Recognition
In this work, the modified group delay feature (MODGDF) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features used in instrument recognition take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The MODGDF is a method of incorporating phase information, which lacks of the issues related to phase unwrapping. Having shown its applicability for speech-related problems, it is now explored in terms of musical instrument recognition. The evaluation is performed on separate note recordings in various instrument sets, and combined with the conventional mel frequency cepstral coefficients (MFCCs), MODGDF shows the noteworthy absolute accuracy gains of up to 5.1% compared to the baseline MFCCs case.
Musical instrument recognition; music information retrieval; modified group delay feature; phase spectrum
Cites: 6 (see at Google Scholar)
Aleksandr Diment, Toni Heittola, and Tuomas Virtanen. Semi-supervised learning for musical instrument recognition. In 21st European Signal Processing Conference 2013 (EUSIPCO 2013). Marrakech, Morocco, September 2013. 8 cites
Semi-supervised Learning for Musical Instrument Recognition
In this work, the semi-supervised learning (SSL) techniques are explored in the context of musical instrument recognition. The conventional supervised approaches normally rely on annotated data to train the classifier. This implies performing costly manual annotations of the training data. The SSL methods enable utilising the additional unannotated data, which is significantly easier to obtain, allowing the overall development cost maintained at the same level while notably improving the performance. The implemented classifier incorporates the Gaussian mixture model-based SSL scheme utilising the iterative EM-based algorithm, as well as the extensions facilitating a simpler convergence criteria. The evaluation is performed on a set of nine instruments while training on a dataset, in which the relative size of the labelled data is as little as 15%. It yields a noteworthy absolute performance gain of 16% compared to the performance of the initial supervised models.
Music information retrieval; musical instrument recognition; semi-supervised learning
Cites: 8 (see at Google Scholar)
Annamaria Mesaros, Toni Heittola, and Kalle Palomäki. Query-by-example retrieval of sound events using an integrated similarity measure of content and labels. In Image Analysis for Multimedia Interactive Services (WIAMIS), 2013 14th International Workshop on, pp 1-4. Paris, France, 2013. 3 cites
Query-by-example retrieval of sound events using an integrated similarity measure of content and labels
This paper presents a method for combining audio similarity and semantic similarity into a single similarity measure for query-by-example retrieval. The integrated similarity measure is used to retrieve sound events that are similar in content to the given query and have labels containing similar words. Through the semantic component, the method is able to handle variability in labels of sound events. Through the acoustic component, the method retrieves acoustically similar examples. On a test database of over 3000 sound event examples, the proposed method obtains a better retrieval performance than audio-based retrieval, and returns results closer acoustically to the query than a label-based retrieval.
audio signal processing;content-based retrieval;semantic networks;acoustic component;audio similarity;integrated similarity measure;label-based retrieval;query-by-example retrieval;semantic similarity;sound events;
Cites: 3 (see at Google Scholar)
Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, and Moncef Gabbouj. Supervised model training for overlapping sound events based on unsupervised source separation. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp 8677-8681. Vancouver, Canada, 2013. IEEE Computer Society. 27 cites
Supervised Model Training for Overlapping Sound Events Based on Unsupervised Source Separation
Sound event detection is addressed in the presence of overlapping sounds. Unsupervised sound source separation into streams is used as a preprocessing step to minimize the interference of overlapping events. This poses a problem in supervised model training, since there is no knowledge about which separated stream contains the targeted sound source. We propose two iterative approaches based on EM algorithm to select the most likely stream to contain the target sound: one by selecting always the most likely stream and another one by gradually eliminating the most unlikely streams from the training. The approaches were evaluated with a database containing recordings from various contexts, against the baseline system trained without applying stream selection. Both proposed approaches were found to give a reasonable increase of 8 percentage units in the detection accuracy.
acoustic event detection;acoustic pattern recognition;sound source separation;supervised model training
Cites: 27 (see at Google Scholar)
Annamaria Mesaros, Toni Heittola, and Kalle Palomäki. Analysis of acoustic-semantic relationship for diversely annotated real-world audio data. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp 813-817. Vancouver, Canada, 2013. IEEE Computer Society. 5 cites
Analysis of Acoustic-Semantic Relationship for Diversely Annotated Real-World Audio Data
A common problem of freely annotated or user contributed audio databases is the high variability of the labels, related to homonyms, synonyms, plurals, etc. Automatically re-labeling audio data based on audio similarity could offer a solution to this problem. This paper studies the relationship between audio and labels in a sound event database, by evaluating semantic similarity of labels of acoustically similar sound event instances. The assumption behind the study is that acoustically similar events are annotated with semantically similar labels. Indeed, for 43% of the tested data, there was at least one in ten acoustically nearest neighbors having a synonym as label, while the closest related term is on average one level higher or lower in the semantic hierarchy.
audio similarity;semantic similarity;sound events
Cites: 5 (see at Google Scholar)
Toni Heittola, Annamaria Mesaros, Antti Eronen, and Tuomas Virtanen. Context-dependent sound event detection. EURASIP Journal on Audio, Speech and Music Processing, 2013. 92 cites
Context-Dependent Sound Event Detection
The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.
Cites: 92 (see at Google Scholar)
Dani Korpi, Toni Heittola, Timo Partala, Antti Eronen, Annamaria Mesaros, and Tuomas Virtanen. On the human ability to discriminate audio ambiances from similar locations of an urban environment. Personal and Ubiquitous Computing, 17(4):761–769, 2013. 1 cite
On the human ability to discriminate audio ambiances from similar locations of an urban environment
When developing advanced location-based systems augmented with audio ambiances, it would be cost-effective to use a few representative samples from typical environments for describing a larger number of similar locations. The aim of this experiment was to study the human ability to discriminate audio ambiances recorded in similar locations of the same urban environment. A listening experiment consisting of material from three different environments and nine different locations was carried out with nineteen subjects to study the credibility of audio representations for certain environments which would diminish the need for collecting huge audio databases. The first goal was to study to what degree humans are able to recognize whether the recording has been made in an indicated location or in another similar location, when presented with the name of the place, location on a map, and the associated audio ambiance. The second goal was to study whether the ability to discriminate audio ambiances from different locations is affected by a visual cue, by presenting additional information in form of a photograph of the suggested location. The results indicate that audio ambiances from similar urban areas of the same city differ enough so that it is not acceptable to use a single recording as ambience to represent different yet similar locations. Including an image was found to increase the perceived credibility of all the audio samples in representing a certain location. The results suggest that developers of audio-augmented location-based systems should aim at using audio samples recorded on-site for each location in order to achieve a credible impression.
Listening experiment; Location recognition; Audio-visual perception; Audio ambiance
Cites: 1 (see at Google Scholar)
Antti Eronen, Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Method and apparatus for providing media event suggestions. US 20130232412, 09 2012.
Method and apparatus for providing media event suggestions
Various methods are described for providing media event suggestions based at least in part on a co-occurrence model. One example method may comprise receiving a selection of at least one media event to include in a media composition. Additionally, the method may comprise determining at least one suggested media event based at least in part on the at least one media events. The method may further comprise causing display of the at least one suggested media event. Similar and related methods, apparatuses, and computer program products are also provided.
Antti Eronen, Miska Hannuksela, Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Method and apparatus for generating an audio summary of a location. US 2013128064, 09 2012. 3 cites
Method and apparatus for generating an audio summary of a location
Various methods are described for generating an audio summary representing a location on a place exploration service. One example method may comprise receiving at least one audio file. The method may further comprise dividing the at least one audio file into one or more audio segments. Additionally, the method may comprise determining a representative audio segment for each of the one or more audio segments. The method may further comprise generating an audio summary of the at least one audio file by combining one or more of the representative audio segments of the one or more audio segments. Similar and related methods, apparatuses, and computer program products are also provided.
Cites: 3 (see at Google Scholar)
Fawad Mazhar, Toni Heittola, Tuomas Virtanen, and Jukka Holm. Automatic scoring of guitar chords. In Audio Engineering Society Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio. 3 2012. 1 cite
Automatic Scoring of Guitar Chords
This paper describes a novel approach for detecting the correctness of musical chords played by guitar. The approach is based on pattern matching technique applied on database of chords and their typical mistakes played with multiple guitars. Spectrum of the chord is whitened and a certain region is selected as a feature vector. The cosine distance is calculated between chord to be tested and a reference chord database, chord detection is done based on the minimum distance. The proposed system is evaluated with isolated chords with different noise conditions. The system shows approximately 77% accuracy in scoring the correctness of played chords with a medium sized database.
Cites: 1 (see at Google Scholar)
Sound Event Detection in Multisource Environments Using Source Separation
This paper proposes a sound event detection system for natural multisource environments, using a sound source separation front-end. The recognizer aims at detecting sound events from various everyday contexts. The audio is preprocessed using non-negative matrix factorization and separated into four individual signals. Each sound event class is represented by a Hidden Markov Model trained using mel frequency cepstral coefficients extracted from the audio. Each separated signal is used individually for feature extraction and then segmentation and classification of sound events using the Viterbi algorithm. The separation allows detection of a maximum of four overlapping events. The proposed system shows a significant increase in event detection accuracy compared to a system able to output a single sequence of events.
Cites: 73 (see at Google Scholar)
Annamaria Mesaros, Toni Heittola, and Anssi Klapuri. Latent semantic analysis in sound event detection. In 19th European Signal Processing Conference (EUSIPCO 2011), pp 1307–1311. 2011. 33 cites
Latent Semantic Analysis in Sound Event Detection
This paper presents the use of probabilistic latent semantic analysis (PLSA) for modeling co-occurrence of overlapping sound events in audio recordings from everyday audio environments such as office, street or shop. Co-occurrence of events is represented as the degree of their overlapping in a fixed length segment of polyphonic audio. In the training stage, PLSA is used to learn the relationships between individual events. In detection, the PLSA model continuously adjusts the probabilities of events according to the history of events detected so far. The event probabilities provided by the model are integrated into a sound event detection system that outputs a monophonic sequence of events. The model offers a very good representation of the data, having low perplexity on test recordings. Using PLSA for estimating prior probabilities of events provides an increase of event detection accuracy to 35%, compared to 30% for using uniform priors for the events. There are different levels of performance increase in different audio contexts, with few contexts showing significant improvement.
sound event detection, latent semantic analysis
Cites: 33 (see at Google Scholar)
Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, and Antti Eronen. Sound event detection and context recognition. In Proceedings of Akustiikkapäivät 2011, pp 51–56. Tampere, Finland, 2011. 3 cites
Toni Heittola, Annamaria Mesaros, Antti Eronen, and Tuomas Virtanen. Audio context recognition using audio event histograms. In 18th European Signal Processing Conference (EUSIPCO 2010), pp 1272–1276. Aalborg, Denmark, 2010. 57 cites
Audio Context Recognition Using Audio Event Histograms
This paper presents a method for audio context recognition, meaning classification between everyday environments. The method is based on representing each audio context using a histogram of audio events which are detected using a supervised classifier. In the training stage, each context is modeled with a histogram estimated from annotated training data. In the testing stage, individual sound events are detected in the unknown recording and a histogram of the sound event occurrences is built. Context recognition is performed by computing the cosine distance between this histogram and event histograms of each context from the training database. Term frequency--inverse document frequency weighting is studied for controlling the importance of different events in the histogram distance calculation. An average classification accuracy of 89% is obtained in the recognition between ten everyday contexts. Combining the event based context recognition system with more conventional audio based recognition increases the recognition rate to 92%.
Cites: 57 (see at Google Scholar)
Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. Acoustic event detection in real-life recordings. In 18th European Signal Processing Conference (EUSIPCO 2010), pp 1267-1271. Aalborg, Denmark, 2010. 154 cites
Acoustic Event Detection in Real-life Recordings
This paper presents a system for acoustic event detection in recordings from real life environments. The events are modeled using a network of hidden Markov models; their size and topology is chosen based on a study of isolated events recognition. We also studied the effect of ambient background noise on event classification performance. On real life recordings, we tested recognition of isolated sound events and event detection. For event detection, the system performs recognition and temporal positioning of a sequence of events. An accuracy of 24% was obtained in classifying isolated sound events into 61 classes. This corresponds to the accuracy of classifying between 61 events when mixed with ambient background noise at 0dB signal-to-noise ratio. In event detection, the system is capable of recognizing almost one third of the events, and the temporal positioning of the events is not correct for 84% of the time.
Cites: 154 (see at Google Scholar)
Anssi Klapuri, Tuomas Virtanen, and Toni Heittola. Sound source separation in monaural music signals using excitation-filter model and em algorithm. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2010), pp 5510-5513. Dallas, Texas, USA, 2010. 33 cites
Sound source separation in monaural music signals using excitation-filter model and em algorithm.
This paper proposes a method for separating the signals of individual musical instruments from monaural musical audio. The mixture signal is modeled as a sum of the spectra of individual musical sounds which are further represented as a product of excitations and filters. The excitations are restricted to harmonic spectra and their fundamental frequencies are estimated in advance using a multipitch estimator, whereas the filters are restricted to have smooth frequency responses by modeling them as a sum of elementary functions on Mel-frequency scale. A novel expectation-maximization (EM) algorithm is proposed which jointly learns the filter responses and organizes the excitations (musical notes) to filters (instruments). In simulations, the method achieved over 5 dB SNR improvement compared to the mixture signals when separating two or three musical instruments from each other. A slight further improvement was achieved by utilizing musical properties in the initialization of the algorithm.
Sound source separation, excitation-filter model, maximum likelihood estimation, expectation maximization
Cites: 33 (see at Google Scholar)
Toni Heittola, Anssi Klapuri, and Tuomas Virtanen. Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In International Conference on Music Information Retrieval (ISMIR), pp 327-332. Kobe, Japan, 2009. Best paper award 125 cites
Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation
This paper proposes a novel approach to musical instrument recognition in polyphonic audio signals by using a source-filter model and an augmented non-negative matrix factorization algorithm for sound separation. The mixture signal is decomposed into a sum of spectral bases modeled as a product of excitations and filters. The excitations are restricted to harmonic spectra and their fundamental frequencies are estimated in advance using a multipitch estimator, whereas the filters are restricted to have smooth frequency responses by modeling them as a sum of elementary functions on the Mel-frequency scale. The pitch and timbre information are used in organizing individual notes into sound sources. In the recognition, Mel-frequency cepstral coefficients are used to represent the coarse shape of the power spectrum of sound sources and Gaussian mixture models are used to model instrument-conditional densities of the extracted features. The method is evaluated with polyphonic signals, randomly generated from 19 instrument classes. The recognition rate for signals having six note polyphony reaches 59%.
Sound source separation, excitation-filter model
Awards: Best paper award
Cites: 125 (see at Google Scholar)
Tuomas Virtanen and Toni Heittola. Interpolating hidden markov model and its application to automatic instrument recognition. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009), pp 49–52. Washington, DC, USA, 2009. IEEE Computer Society. 4 cites
Interpolating hidden Markov model and its application to automatic instrument recognition
his paper proposes an interpolating extension to hidden Markov models (HMMs), which allows more accurate modeling of natural sounds sources. The model is able to produce observations from distributions which are interpolated between discrete HMM states. The model uses Gaussian mixture state emission densities, and the interpolation is implemented by introducing interpolating states in which the mixture weights, means, and variances are interpolated from the discrete HMM state densities. We propose an algorithm extended from the Baum-Welch algorithm for estimating the parameters of the interpolating model. The model was evaluated in automatic instrument classification task, where it produced systematically better recognition accuracy than a baseline HMM recognition algorithm.
Hidden Markov models, acoustic signal processing, musical instruments, pattern classification
Cites: 4 (see at Google Scholar)
Toni Heittola. Azimuth estimation in polyphonic music. In Pertti Koivisto, editor, Digest of TISE Seminar 2009, volume 8, pp 14–16. Kangasala, Finland, 2009.
Azimuth Estimation in Polyphonic Music
Most of the research in music information retrieval (MIR) has been mainly using monophonic source signals, i.e. ignoring stereo information. However, commercially available music recordings typically consist of a two-track stereo mix. The type of mixing process used in the recordings can be roughly categorizes into live recordings and studio recordings. In live recordings, all musical instruments are usually recorded on a single stereo track using stereophonic microphone setup. The listeners localize sounds mainly based on time-differences between left and right channel, using the interaural time difference (ITD). In studio recordings, each musical instrument is recorded on a separate mono or stereo track. In the final mixing stage, audio effects ( e.g. reverberation) can be added artificially. The virtual sound localization at any point between the left and right channel is achieved using proper amplitude for the left and right channel while mixing down tracks to a two-track stereo mix. Amplitude difference between channels is used to simulate interaural intensity difference (IID) by attenuating one channel and causing sound to be localized more in the opposite channel. The phase of a source is coherent between the channels. By assuming this mixing model, we can perform horizontal angle (azimuth) estimation for music signals. Azimuth information can be utilized in different applications of music information retrieval amongst musical instrument recognition and note streaming. In musical instrument recognition with polyphonic notes, the signal-to-noise ratio can be improved with beamforming in the feature extraction stage. Azimuth information can be utilized also in the note streaming of polyphonic audio, where notes can be grouped together based on pitch, timbre and azimuth.
Toni Heittola. Musical instrument recognition in polyphonic music. In Pertti Koivisto, editor, Digest of TISE Seminar 2008, volume 7, pp 48–50. Kangasala, Finland, 2008.
Musical Instrument Recognition in Polyphonic Music
Understanding the timbre and pitch of musical instruments is an important issue for automatic music transcription, music information retrieval and computational auditory scene analysis. In particular, recent worldwide popularization of online music distribution services and portable digital music players makes musical instrument recognition even more important. Musical instruments are one of the main criteria (besides musical genre), which can be used to search certain type of music from music databases. Some classical music are even characterized with the used musical instruments (e.g. piano sonata and string quartet). The purpose of the research is to develop mathematical models for sound sources and apply these in the automatic analysis and coding of polyphonic music. Target signal are musical signals and in limited cases also speech signals. The redundant frequency information of the harmonic sounds will be used in the developed new models. The developed modeling schemes will be tested in two applications, musical instrument recognition in polyphonic music and in music transcription.
Toni Heittola and Anssi Klapuri. TUT acoustic event detection system 2007. In Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, pp 364–370. Berlin, Heidelberg, 2008. Springer-Verlag. 12 cites
TUT Acoustic Event Detection System 2007
This paper describes a system used in acoustic event detection task of the CLEAR 2007 evaluation. The objective of the task is to detect acoustic events (door slam, steps, paper wrapping etc.) using acoustic data from a multiple microphone set up in the meeting room environment. A system based on hidden Markov models and multi-channel audio data was implemented. Mel-Frequency Cepstral Coefficients are used to represent the power spectrum of the acoustic signal. Fully-connected three-state hidden Markov models are trained for 12 acoustic events and one-state models are trained for speech, silence, and unknown events.
Hidden Markov models, acoustic signal processing, musical instruments, pattern classification
Cites: 12 (see at Google Scholar)
Toni Heittola. Automatic classification of music signals. Master's thesis, Department of Information Technology, Tampere University Of Technology, 2004. 29 cites
Automatic Classification of Music Signals
Collections of digital music have become increasingly common over the recent years. As the amount of data increases, digital content management is becoming more important. In this thesis, we are studying content-based classification of acoustic musical signals according to their musical genre (e.g., classical, rock) and the instruments used. A listening experiment is conducted to study human abilities to recognise musical genres. This thesis covers a literature review on human musical genre recognition, state-of-the-art musical genre recognition systems, and related fields of research. In addition, a general-purpose music database consisting of recordings and their manual annotations is introduced. The theory behind the used features and classifiers is reviewed and the results from the simulations are presented. The developed musical genre recognition system uses mel-frequency cepstral coefficients to represent the time-varying magnitude spectrum of a music signal. The class-conditional feature densities are modelled with hidden Markov models. Musical instrument detection for a few pitched instruments from music signals is also studied using the same structure. Furthermore, this thesis proposes a method for the detection of drum instruments. The presence of drums is determined based on the periodicity of the amplitude envelopes of the signal at subbands. The conducted listening experiment shows that the recognition of musical genres is not a trivial task even for humans. On the average, humans are able to recognise the correct genre in 75 % of cases (given five-second samples). Results also indicate that humans can do rather accurate musical genre recognition without long-term temporal features, such as rhythm. For the developed automatic recognition system, the obtained recognition accuracy for six musical genres was around 60 %, which is comparable to the state-of-the-art systems. Detection accuracy of 81 % was obtained with the proposed drum instrument detection method.
Cites: 29 (see at Google Scholar)
Antti Eronen and Toni Heittola. Discriminative training of unsupervised acoustic models for non-speech audio. In Proceedings of the 2003 Finnish Signal Processing Symposium, pp 54–58. Tampere, Finland, 2003. 1 cite
Discriminative Training of Unsupervised Acoustic Models for Non-speech Audio
This paper studies acoustic modeling of non-speech audio using hidden Markov models. Simulation results are presented in two different application areas: audio-based context awareness and music classification, the latter focusing on recognition of musical genres and instruments. Two training methods are evaluated: conventional maximum likelihood estimation using the Baum-Welch algorithm, and discriminative training, which is expected to improve the recognition accuracy on models with a small number of component densities in state distributions. Our approach is unsupervised in the sense that we do not know what are the underlying acoustic classes that are modeled with different HMM states. In addition to reporting the achieved recognition results, analyses are made to study what properties of sound signals are captured by the states.
Cites: 1 (see at Google Scholar)
Toni Heittola and Anssi Klapuri. Locating segments with drums in music signals. In 3rd International Conference on Music Information Retrieval (ISMIR), pp 271–272. Paris, France, 2002. 22 cites
Locating Segments with Drums in Music Signals
A system is described which segments musical signals according to the presence or absence of drum instruments. Two different yet approximately equally accurate approaches were taken to solve the problem. The first is based on periodicity detection in the amplitude envelopes of the signal at subbands. The band-wise periodicity estimates are aggregated into a summary autocorrelation function, the characteristics of which reveal the drums. The other mechanism applies straightforward acoustic pattern recognition with mel-frequency cepstrum coefficients as features and a Gaussian mixture model classifier. The integrated system achieves 88 % correct segmentation over a database of 28 hours of music from different musical genres. For the both methods, errors occur for borderline cases with soft percussive-like drum accompaniment, or transient-like instrumentation without drums.
Cites: 22 (see at Google Scholar)
Note regarding IEEE copyrighted material on this page
The material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.