Publications

Publications: 63 ( Books : 3, Journal articles : 9, Book chapters : 1, Conference papers : 47, Patents : 2 )
Cites: 3054 ( according to Google Scholar, Updated 11.12.2020)

2021

Audio-Visual Scene Classification: Analysis of DCASE 2021 Challenge Submissions

Abstract

This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8\%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1\%.

PDF

Diversity and Bias in Audio Captioning Datasets

Abstract

Describing soundscapes in sentences allows better understanding of the acoustic scene than a single label indicating the acoustic scene class or a set of audio tags indicating the sound events active in the audio clip. In addition, the richness of natural language allows a range of possible descriptions for the same acoustic scene. In this work, we address the diversity obtained when collecting descriptions of soundscapes using crowdsourcing. We study how much the collection of audio captions can be guided by the instructions given in the annotation task, by analysing the possible bias introduced by auxiliary information provided in the annotation process. Our study shows that even when given hints on the audio content, different annotators describe the same soundscape using different vocabulary. In automatic captioning, hints provided as audio tags represent grounding textual information that facilitates guiding the captioning output towards specific concepts. We also release a new dataset of audio captions and audio tags produced by multiple annotators for a subset of the TAU Urban Acoustic Scenes 2018 dataset, suitable for studying guided captioning.

PDF

Low-Complexity Acoustic Scene Classification for Multi-Device Audio: Analysis of DCASE 2021 Challenge Systems

Abstract

This paper presents the details of Task 1A Low-Complexity Acoustic Scene Classification with Multiple Devices in the DCASE 2021 Challenge. The task targeted development of low-complexity solutions with good generalization properties. The provided baseline system is based on a CNN architecture and post-training quantization of parameters. The system is trained using all the available training data, without any specific technique for handling device mismatch, and obtains an overall accuracy of 47.7\%, with a log loss of 1.473. The task received 99 submissions from 30 teams, and most of the submitted systems outperformed the baseline. The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70\% accuracy, and logloss under 0.8. The acoustic scene classification task remained a popular task in the challenge, despite the increasing difficulty of the setup.

PDF

Towards Sonification in Multimodal and User-friendly Explainable Artificial Intelligence

PDF

Crowdsourcing Strong Labels for Sound Event Detection

Abstract

Strong labels are a necessity for evaluation of sound event detection methods, but often scarcely available due to the high resources required by the annotation task. We present a method for estimating strong labels using crowdsourced weak labels, through a process that divides the annotation task into simple unit tasks. Based on estimations of annotators' competence, aggregation and processing of the weak labels results in a set of objective strong labels. The experiment uses synthetic audio in order to verify the quality of the resulting annotations through comparison with ground truth. The proposed method produces labels with high precision, though not all event instances are recalled. Detection metrics comparing the produced annotations with the ground truth show 80% F-score in 1~s segments, and up to 89.5% intersection-based F1-score calculated according to the polyphonic sound detection score metrics.

Keywords

Strong labels, Sound event detection, Crowdsourcing, Multi-annotator data

PDF

Joint Direction and Proximity Classification of Overlapping Sound Events from Binaural Audio

Abstract

Sound source proximity and distance estimation are of great interest in many practical applications, since they provide significant information for acoustic scene analysis. As both tasks share complementary qualities, ensuring efficient interaction between these two is crucial for a complete picture of an aural environment. In this paper, we aim to investigate several ways of performing joint proximity and direction estimation from binaural recordings, both defined as coarse classification problems based on Deep Neural Networks (DNNs). Considering the limitations of binaural audio, we propose two methods of splitting the sphere into angular areas in order to obtain a set of directional classes. For each method we study different model types to acquire information about the direction-of-arrival (DoA). Finally, we propose various ways of combining the proximity and direction estimation problems into a joint task providing temporal information about the onsets and offsets of the appearing sources. Experiments are performed for a synthetic reverberant binaural dataset consisting of up to two overlapping sound events.

Keywords

binaural audio, binaural localization, distance estimation

PDF

Sound Event Detection: A Tutorial

Abstract

The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening. In practice, the goal is to recognize at what temporal instances different sounds are active within an audio signal. This paper gives a tutorial presentation of sound event detection, including its definition, signal processing and machine learning approaches, evaluation, and future perspectives.

PDF

What is the ground truth? Reliability of multi-annotator data for audio tagging

Abstract

Crowdsourcing has become a common approach for annotating large amounts of data. It has the advantage of harnessing a large workforce to produce large amounts of data in a short time, but comes with the disadvantage of employing non-expert annotators with different backgrounds. This raises the problem of data reliability, in addition to the general question of how to combine the opinions of multiple annotators in order to estimate the ground truth. This paper presents a study of the annotations and annotators' reliability for audio tagging. We adapt the use of Krippendorf’s alpha and multi-annotator competence estimation (MACE) for a multi-labeled data scenario, and present how MACE can be used to estimate a candidate ground truth based on annotations from non-expert users with different levels of expertise and competence.

Keywords

crowdsourcing, audio tagging, inter-annotator agreement

PDF

A curated dataset of urban acoustic scenes for audio-visual scene analysis

Abstract

This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8% accuracy compared to 75.8% for the audio-only and 68.4% for the video-only equivalent systems.

Keywords

Audio-visual data, Scene analysis, Acoustic scene, Pattern recognition, Transfer learning

PDF

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

PDF

2020

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1. Acoustic Scene Classification in the DCASE 2020 Challenge. The task consisted of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Each subtask received around 90 submissions, and most of them outperformed the baseline system. The most used techniques among the submissions were data augmentation in Subtask A, to compensate for the device mismatch, and post-training quantization of neural network weights in Subtask B, to bring the model size under the required limit. The maximum classification accuracy on the evaluation set in Subtask A was 76.5\%, compared to the baseline performance of 51.4\%. In Subtask B, many systems are just below the size limit, and the maximum classification accuracy was 96.5\%, compared to the baseline performance of 89.5\%.

Keywords

Acoustic Scene Classification, DCASE 2020 Challenge

Cites: 20 (see at Google Scholar)

PDF

2019

Acoustic Scene Classification in DCASE 2019 challenge: closed and open set classification and data mismatch setups

Abstract

Acoustic Scene Classification is a regular task in the DCASE Challenge, with each edition having it as a task. Throughout the years, modifications to the task have included mostly changing the dataset and increasing its size, but recently also more realistic setups have been introduced. In DCASE 2019 Challenge, the Acoustic Scene Classification task includes three subtasks: Subtask A, a closed-set typical supervised classification where all data is recorded with the same device; Subtask B, a closed-set classification setup with mismatched recording devices between training and evaluation data, and Subtask C, an open-set classification setup in which evaluation data could contain acoustic scenes not encountered in the training. In all subtasks, the provided baseline system was significantly outperformed, with top performance being 85.2% for Subtask A, 75.5% for Subtask B, and 67.4% for Subtask C. This paper presents the outcome of DCASE 2019 Challenge Task 1 in terms of submitted systems performance and analysis.

Keywords

Acoustic Scene Classification, DCASE 2019 Challenge, open set classification

Cites: 21 (see at Google Scholar)

PDF

City classification from multiple real-world sound scenes

Abstract

The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes. We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches.

Keywords

Acoustic scene classification, location identification, city classification, computational sound scene analysis.

Cites: 2 (see at Google Scholar)

PDF

Joint Measurement of Localization and Detection of Sound Events

Abstract

Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and conversely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner, and characterize adequately both aspects.

Keywords

Sound event detection and localization, performance evaluation

Cites: 18 (see at Google Scholar)

PDF

Audio-based Epileptic seizure dectection

Abstract

This paper investigates automatic epileptic seizure detection from audio recordings using convolutional neural networks. The labeling and analysis of seizure events are necessary in the medical field for patient monitoring, but the manual annotation by expert annotators is time-consuming and extremely monotonous. The proposed method treats all seizure vocalizations as a single target event class, and models the seizure detection problem in terms of detecting the target vs non-target classes. For detection, the method employs a convolutional neural network trained to detect the seizure events in short time segments, based on mel-energies as feature representation. Experiments carried out with different seizure types on 900 hours of audio recordings from 40 patients show that the proposed approach can detect seizures with over 80\% accuracy, with a 13\% false positive rate and a 22.8\% false negative rate.

Keywords

Epileptic seizure detection, convolutional neural network (CNN), sound event detection, audio processing and analysis

Cites: 1 (see at Google Scholar)

PDF

Detection and Classification of Acoustic Scenes and Events

Cites: 5 (see at Google Scholar)

PDF

Sound event detection in the DCASE 2017 Challenge

Abstract

Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly-labeled data was available for training. In this paper, we present the three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure, to perform statistical analysis of the challenge results. The analysis indicates that while the 95\\% confidence intervals for many systems overlap, there are significant differences in performance between the top systems and the baseline for all tasks.

Keywords

Sound event detection, weak labels, pattern recognition, jackknife estimates, confidence intervals

Cites: 42 (see at Google Scholar)

PDF

Sound event envelope estimation in polyphonic mixtures

Abstract

Sound event detection is the task of identifying automatically the presence and temporal boundaries of sound events within an input audio stream. In the last years, deep learning methods have established themselves as the state-of-the-art approach for the task, using binary indicators during training to denote whether an event is active or inactive. However, such binary activity indicators do not fully describe the events, and estimating the envelope of the sounds could provide more precise modeling of their activity. This paper proposes to estimate the amplitude envelopes of target sound event classes in polyphonic mixtures. For training, we use the amplitude envelopes of the target sounds, calculated from mixture signals and, for comparison, from their isolated counterparts. The model is then used to perform envelope estimation and sound event detection. Results show that the envelope estimation allows good modeling of the sounds activity, with detection results comparable to current state-of-the art.

Keywords

Sound event detection, Envelope estimation, Deep Neural Networks

PDF

2018

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

Cites: 173 (see at Google Scholar)

PDF

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

Keywords

Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection

Cites: 180 (see at Google Scholar)

PDF

Acoustic Scene Classification: An Overview of DCASE 2017 Challenge Entries

Keywords

Acoustic scene classification, audio classification, DCASE challenge

Cites: 38 (see at Google Scholar)

PDF

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)

Cites: 1 (see at Google Scholar)

PDF

Datasets and Evaluation

Abstract

Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.

Cites: 12 (see at Google Scholar)

2017

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)

PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events

Cites: 345 (see at Google Scholar)

PDF

Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study

Abstract

Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.

Cites: 8 (see at Google Scholar)

PDF

2016

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

Cites: 4 (see at Google Scholar)

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

Cites: 450 (see at Google Scholar)

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

Cites: 333 (see at Google Scholar)

2015

Acoustic context recognition for mobile devices using a reduced complexity SVM

Abstract

Automatic context recognition enables mobile devices to react to changes in the environment and different situations. While many different sensors can be used for context recognition, the use of acoustic cues is among the most popular and successful. Current approaches to acoustic context recognition (ACR) are too costly in terms of computation and memory requirements to support an always-listening mode. This paper describes our work to develop a reduced complexity, efficient approach to ACR involving support vector machine classifiers. The principal hypothesis is that a significant fraction of training data contains information redundant to classification. Through clustering, training data can thus be selectively decimated in order to reduce the number of support vectors needed to represent discriminative hyperplanes. This represents a significant saving in terms of computational and memory efficiency, with only modest degradations in classification accuracy.

Keywords

acoustic signal processing;mobile computing;pattern clustering;signal classification;support vector machines;ACR;SVM;acoustic context recognition;acoustic cues;always-listening mode;automatic context recognition;classification accuracy;clustering;computational efficiency;memory efficiency;mobile devices;reduced complexity;sensors;support vector machine classifiers;Complexity theory;Context;Hidden Markov models;Mobile handsets;Support vector machines;Training;Training data;Acoustic Context Recognition;LDA;SVM;k-means;mobile devices contextualization

Cites: 5 (see at Google Scholar)

PDF

Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations

Abstract

Methods for detection of overlapping sound events in audio involve matrix factorization approaches, often assigning separated components to event classes. We present a method that bypasses the supervised construction of class models. The method learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix. In testing, the dictionaries are used to estimate directly the class activations. For dealing with large amount of training data, two methods are proposed for reducing the size of the dictionary. The methods were tested on a database of real life recordings, and outperformed previous approaches by over 10%.

Keywords

coupled non-negative matrix factorization, non-negative dictionaries, sound event detection

Cites: 95 (see at Google Scholar)

PDF

2014

Method for creating location-specific audio textures

Abstract

An approach is proposed for creating location-specific audio textures for virtual location-exploration services. The presented approach creates audio textures by processing a small amount of audio recorded at a given location, providing a cost-effective way to produce a versatile audio signal that characterizes the location. The resulting texture is non-repetitive and conserves the location-specific characteristics of the audio scene, without the need of collecting large amount of audio from each location. The method consists of two stages: analysis and synthesis. In the analysis stage, the source audio recording is segmented into homogeneous segments. In the synthesis stage, the audio texture is created by randomly drawing segments from the source audio so that the consecutive segments will have timbral similarity near the segment boundaries. Results obtained in listening experiments show that there is no statistically significant difference in the audio quality or location-specificity of audio when the created audio textures are compared to excerpts of the original recordings. Therefore, the proposed audio textures could be utilized in virtual location-exploration services. Examples of source signals and audio textures created from them are available at www.cs.tut.fi/~heittolt/audiotexture.

Cites: 6 (see at Google Scholar)

Unsupervised feature extraction for multimedia event detection and ranking using audio content

Abstract

In this paper, we propose a new approach to classify and rank multimedia events based purely on audio content using video data from TRECVID-2013 multimedia event detection (MED) challenge. We perform several layers of nonlinear mappings to extract a set of unsupervised features from an initial set of temporal and spectral features to obtain a superior presentation of the atomic audio units. Additionally, we propose a novel weighted divergence measure for kernel based classifiers. The extensive set of experiments confirms that augmentation of the proposed steps results in an improved accuracy for most of the event classes.

Keywords

audio signal processing;audio streaming;feature extraction;multimedia systems;signal classification;spectral analysis;unsupervised learning;TRECVID-2013;atomic audio units;audio content;kernel based classifier;multimedia event classification;multimedia event detection;multimedia event ranking;nonlinear mapping;spectral features;temporal features;unsupervised feature extraction;video data;weighted divergence measure;Event detection;Feature extraction;Histograms;Multimedia communication;Noise reduction;Speech;Vectors;Bag of Words;Multimedia Event Detection;Stacked Denoising Autoencoders;Term Weighting;Unsupervised Feature Extraction;Weighted Jensen-Shannon Divergence

Cites: 8 (see at Google Scholar)

PDF

2013

Sound event detection using non-negative dictionaries learned from annotated overlapping events

Abstract

Detection of overlapping sound events generally requires training class models either from separate data for each class or by making assumptions about the dominating events in the mixed signals. Methods based on sound source separation are currently used in this task, but involve the problem of assigning separated components to sources. In this paper, we propose a method which bypasses the need to build separate sound models. Instead, non-negative dictionaries for the sound content and their annotations are learned in a coupled sense. In the testing stage, time activations of the sound dictionary columns are estimated and used to reconstruct annotations using the annotation dictionary. The method requires no separate training data for classes and in general very promising results are obtained using only a small amount of data.

Keywords

audio signal processing;blind source separation;matrix decomposition;annotated overlapping events;annotation dictionary;mixed signals;nonnegative dictionaries;overlapping sound events;sound content;sound dictionary columns;sound event detection;sound source separation;time activations;training class models;Acoustics;Dictionaries;Event detection;Measurement;Signal to noise ratio;Spectrogram;Training;Non-negative matrix factorization;Sound event detection

Cites: 38 (see at Google Scholar)

PDF

Query-by-example retrieval of sound events using an integrated similarity measure of content and labels

Abstract

This paper presents a method for combining audio similarity and semantic similarity into a single similarity measure for query-by-example retrieval. The integrated similarity measure is used to retrieve sound events that are similar in content to the given query and have labels containing similar words. Through the semantic component, the method is able to handle variability in labels of sound events. Through the acoustic component, the method retrieves acoustically similar examples. On a test database of over 3000 sound event examples, the proposed method obtains a better retrieval performance than audio-based retrieval, and returns results closer acoustically to the query than a label-based retrieval.

Keywords

audio signal processing;content-based retrieval;semantic networks;acoustic component;audio similarity;integrated similarity measure;label-based retrieval;query-by-example retrieval;semantic similarity;sound events;

Cites: 3 (see at Google Scholar)

PDF

Supervised Model Training for Overlapping Sound Events Based on Unsupervised Source Separation

Abstract

Sound event detection is addressed in the presence of overlapping sounds. Unsupervised sound source separation into streams is used as a preprocessing step to minimize the interference of overlapping events. This poses a problem in supervised model training, since there is no knowledge about which separated stream contains the targeted sound source. We propose two iterative approaches based on EM algorithm to select the most likely stream to contain the target sound: one by selecting always the most likely stream and another one by gradually eliminating the most unlikely streams from the training. The approaches were evaluated with a database containing recordings from various contexts, against the baseline system trained without applying stream selection. Both proposed approaches were found to give a reasonable increase of 8 percentage units in the detection accuracy.

Keywords

acoustic event detection;acoustic pattern recognition;sound source separation;supervised model training

Cites: 64 (see at Google Scholar)

PDF

Analysis of Acoustic-Semantic Relationship for Diversely Annotated Real-World Audio Data

Abstract

A common problem of freely annotated or user contributed audio databases is the high variability of the labels, related to homonyms, synonyms, plurals, etc. Automatically re-labeling audio data based on audio similarity could offer a solution to this problem. This paper studies the relationship between audio and labels in a sound event database, by evaluating semantic similarity of labels of acoustically similar sound event instances. The assumption behind the study is that acoustically similar events are annotated with semantically similar labels. Indeed, for 43% of the tested data, there was at least one in ten acoustically nearest neighbors having a synonym as label, while the closest related term is on average one level higher or lower in the semantic hierarchy.

Keywords

audio similarity;semantic similarity;sound events

Cites: 8 (see at Google Scholar)

PDF

Context-Dependent Sound Event Detection

Abstract

The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.

Cites: 195 (see at Google Scholar)

Singing voice identification and lyrics transcription for music information retrieval invited paper

Abstract

This paper presents an overview of methods and applications dealing with analysis of singing voice audio signals, related to singer identity and lyrics content of the singing. Singer identification in polyphonic music is based on general audio classification methods. The presence of instruments is detrimental to voice identification performance, and eliminating the effect of instrumental accompaniment is an important aspect of the prob-lem. The results show that classification of singing voices can be done robustly in polyphonic music when using source separation. Lyrics transcription is approached as a speech recognition prob-lem, with specific elements for dealing with singing voice. The variability of phonation in singing poses a significant challenge to the speech recognition approach. The word recognition accuracy of the lyrics transcription from singing is quite low, but it is shown to be useful in a query-by-singing application, for performing a textual search based on the words recognized from the query. A system for automatic alignment of lyrics and audio is also presented, with sufficient performance for facilitating applications such as automatic karaoke annotation or song browsing.

Keywords

audio signal processing;music;query processing;signal classification;source separation;speech recognition;automatic karaoke annotation;general audio classification methods;instrumental accompaniment;lyrics automatic alignment;lyrics transcription;music information retrieval;phonation variability;polyphonic music;query-by-singing application;singer identity;singing voice audio signal analysis;singing voice classification;singing voice identification;song browsing;source separation;speech recognition problem;textual search;word recognition accuracy;Databases;Hidden Markov models;Instruments;Multiple signal classification;Music;Speech;Speech recognition

Cites: 13 (see at Google Scholar)

PicSOM Experiments in TRECVID 2013

Abstract

Our experiments in TRECVID 2013 include participation in the Semantic Indexing (SIN), Multimedia Event Detection (MED), and Multimedia Event Recounting (MER) tasks.

Cites: 12 (see at Google Scholar)

PDF

On the human ability to discriminate audio ambiances from similar locations of an urban environment

Abstract

When developing advanced location-based systems augmented with audio ambiances, it would be cost-effective to use a few representative samples from typical environments for describing a larger number of similar locations. The aim of this experiment was to study the human ability to discriminate audio ambiances recorded in similar locations of the same urban environment. A listening experiment consisting of material from three different environments and nine different locations was carried out with nineteen subjects to study the credibility of audio representations for certain environments which would diminish the need for collecting huge audio databases. The first goal was to study to what degree humans are able to recognize whether the recording has been made in an indicated location or in another similar location, when presented with the name of the place, location on a map, and the associated audio ambiance. The second goal was to study whether the ability to discriminate audio ambiances from different locations is affected by a visual cue, by presenting additional information in form of a photograph of the suggested location. The results indicate that audio ambiances from similar urban areas of the same city differ enough so that it is not acceptable to use a single recording as ambience to represent different yet similar locations. Including an image was found to increase the perceived credibility of all the audio samples in representing a certain location. The results suggest that developers of audio-augmented location-based systems should aim at using audio samples recorded on-site for each location in order to achieve a credible impression.

Keywords

Listening experiment; Location recognition; Audio-visual perception; Audio ambiance

Cites: 2 (see at Google Scholar)

PDF

2012

Method and apparatus for providing media event suggestions

Abstract

Various methods are described for providing media event suggestions based at least in part on a co-occurrence model. One example method may comprise receiving a selection of at least one media event to include in a media composition. Additionally, the method may comprise determining at least one suggested media event based at least in part on the at least one media events. The method may further comprise causing display of the at least one suggested media event. Similar and related methods, apparatuses, and computer program products are also provided.

Method and apparatus for generating an audio summary of a location

Abstract

Various methods are described for generating an audio summary representing a location on a place exploration service. One example method may comprise receiving at least one audio file. The method may further comprise dividing the at least one audio file into one or more audio segments. Additionally, the method may comprise determining a representative audio segment for each of the one or more audio segments. The method may further comprise generating an audio summary of the at least one audio file by combining one or more of the representative audio segments of the one or more audio segments. Similar and related methods, apparatuses, and computer program products are also provided.

Cites: 10 (see at Google Scholar)

2011

Sound Event Detection in Multisource Environments Using Source Separation

Abstract

This paper proposes a sound event detection system for natural multisource environments, using a sound source separation front-end. The recognizer aims at detecting sound events from various everyday contexts. The audio is preprocessed using non-negative matrix factorization and separated into four individual signals. Each sound event class is represented by a Hidden Markov Model trained using mel frequency cepstral coefficients extracted from the audio. Each separated signal is used individually for feature extraction and then segmentation and classification of sound events using the Viterbi algorithm. The separation allows detection of a maximum of four overlapping events. The proposed system shows a significant increase in event detection accuracy compared to a system able to output a single sequence of events.

Cites: 120 (see at Google Scholar)

PDF Slides

Latent Semantic Analysis in Sound Event Detection

Abstract

This paper presents the use of probabilistic latent semantic analysis (PLSA) for modeling co-occurrence of overlapping sound events in audio recordings from everyday audio environments such as office, street or shop. Co-occurrence of events is represented as the degree of their overlapping in a fixed length segment of polyphonic audio. In the training stage, PLSA is used to learn the relationships between individual events. In detection, the PLSA model continuously adjusts the probabilities of events according to the history of events detected so far. The event probabilities provided by the model are integrated into a sound event detection system that outputs a monophonic sequence of events. The model offers a very good representation of the data, having low perplexity on test recordings. Using PLSA for estimating prior probabilities of events provides an increase of event detection accuracy to 35%, compared to 30% for using uniform priors for the events. There are different levels of performance increase in different audio contexts, with few contexts showing significant improvement.

Keywords

sound event detection, latent semantic analysis

Cites: 47 (see at Google Scholar)

PDF

Sound Event Detection and Context Recognition

Keywords

sound event detection, context recognition

Cites: 2 (see at Google Scholar)

PDF

Automatic understanding of lyrics from singing

Cites: 2 (see at Google Scholar)

PDF

2010

Automatic Recognition of Lyrics in Singing

Abstract

The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm is applied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trained using speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance. Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined by either phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme language models are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textual lyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining an average error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words is also constructed; in 57% of the cases, the first retrieved song is the correct one.

Cites: 96 (see at Google Scholar)

Audio Context Recognition Using Audio Event Histograms

Abstract

This paper presents a method for audio context recognition, meaning classification between everyday environments. The method is based on representing each audio context using a histogram of audio events which are detected using a supervised classifier. In the training stage, each context is modeled with a histogram estimated from annotated training data. In the testing stage, individual sound events are detected in the unknown recording and a histogram of the sound event occurrences is built. Context recognition is performed by computing the cosine distance between this histogram and event histograms of each context from the training database. Term frequency--inverse document frequency weighting is studied for controlling the importance of different events in the histogram distance calculation. An average classification accuracy of 89% is obtained in the recognition between ten everyday contexts. Combining the event based context recognition system with more conventional audio based recognition increases the recognition rate to 92%.

Cites: 80 (see at Google Scholar)

PDF

Acoustic Event Detection in Real-life Recordings

Abstract

This paper presents a system for acoustic event detection in recordings from real life environments. The events are modeled using a network of hidden Markov models; their size and topology is chosen based on a study of isolated events recognition. We also studied the effect of ambient background noise on event classification performance. On real life recordings, we tested recognition of isolated sound events and event detection. For event detection, the system performs recognition and temporal positioning of a sequence of events. An accuracy of 24% was obtained in classifying isolated sound events into 61 classes. This corresponds to the accuracy of classifying between 61 events when mixed with ambient background noise at 0dB signal-to-noise ratio. In event detection, the system is capable of recognizing almost one third of the events, and the temporal positioning of the events is not correct for 84% of the time.

Cites: 267 (see at Google Scholar)

PDF

Recognition of phonemes and words in singing

Abstract

This paper studies the influence of n-gram language models in the recognition of sung phonemes and words. We train uni-, bi-, and trigram language models for phonemes and bi- and trigrams for words. The word-level language model is estimated from a textual lyrics database. In the recognition we use a hidden Markov model based phonetic recognizer adapted to singing voice. The models were tested on monophonic singing and on vocal lines separated from polyphonic music. On clean singing the phoneme recognition accuracies varied from 20% (no language model) to 39% (bigram) and on polyphonic music from 6% (no language model) to 20% (bigram). In word recognition, one fifth of the words were recognized in clean singing, the performance being lower on polyphonic music. We study the use of the recognition results in a query-by-singing application. Using the recognized words, we retrieve the songs by searching for the text in a text lyrics database. For the word recognition system having only 24% correct recognition rate, the first retrieved song is correct in 57% of the test cases.

Keywords

Markov processes;grammars;information analysis;information retrieval systems;musical acoustics;speech recognition;Markov model;information analysis;n-gram language model;phoneme recognition;phonetic recognizer;polyphonic music;query-by-singing application;singing voice;song retrieval;speech recognition;textual lyrics database;word recognition;word-level language model;Automatic speech recognition;Databases;Hidden Markov models;Information analysis;Music information retrieval;Natural languages;Signal processing;Speech recognition;System testing;Text recognition;query-by-singing;singing recognition;speech recognition

Cites: 26 (see at Google Scholar)

PDF

2009

Adaptation of a speech recognizer for singing voice

Abstract

This paper studies the speaker adaptation techniques that can be applied for adapting a speech recognizer to singing voice. Maximum likelihood linear regression (MLLR) techniques are studied, with specific details in choosing the number and types of transforms. The recognition performance of the different methods is measured in terms of phoneme recognition rate and singing-to-lyrics alignment errors of the adapted recognizers. Different methods improve the correct recognition rate with up to 10 percentage units, compared to the non-adapted system. In singing-to-lyrics alignment we obtain a best of 0.94 seconds mean absolute alignment error, compared to 1.26 seconds for the non-adapted system. Global adaptation was found to provide the most improvement in the performance, but small further improvement was obtained with regression tree adaptation.

Cites: 11 (see at Google Scholar)

PDF

2008

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Abstract

This paper proposes a novel algorithm for separating vocals from polyphonic music accompaniment. Based on pitch estimation, the method first creates a binary mask indicating timefrequency segments in the magnitude spectrogram where harmonic content of the vocal signal is present. Second, nonnegative matrix factorization (NMF) is applied on the non-vocal segments of the spectrogram in order to learn a model for the accompaniment. NMF predicts the amount of noise in the vocal segments, which allows separating vocals and noise even when they overlap in time and frequency. Simulations with commercial and synthesized acoustic material show an average improvement of 1.3 dB and 1.8 dB, respectively, in comparison with a reference algorithm based on sinusoidal modeling, and also the perceptual quality of the separated vocals is clearly improved. The method was also tested in aligning separated vocals and textual lyrics, where it produced better results than the reference method.

Cites: 94 (see at Google Scholar)

PDF

Automatic alignment of music audio and lyrics

Abstract

This paper proposes an algorithm for aligning singing in polyphonic music audio with textual lyrics. As preprocessing, the system uses a voice separation algorithm based on melody transcription and sinusoidal modeling. The alignment is based on a hidden Markov model speech recognizer where the acoustic model is adapted to singing voice. The textual input is preprocessed to create a language model consisting of a sequence of phonemes, pauses and possible instrumental breaks. Viterbi algorithm is used to align the audio features with the text. On a test set consisting of 17 commercial recordings, the system achieves an average absolute error of 1.40 seconds in aligning lines of the lyrics.

Cites: 42 (see at Google Scholar)

PDF

2007

Singer identification in polyphonic music using vocal separation and pattern recognition methods

Abstract

This paper evaluates methods for singer identification in polyphonic music, based on pattern classification together with an algorithm for vocal separation. Classification strategies include the discriminant functions, Gaussian mixture model (GMM)-based maximum likelihood classifier and nearest neighbour classifiers using Kullback-Leibler divergence between the GMMs. A novel method of estimating the symmetric Kullback-Leibler distance between two GMMs is proposed. Two different approaches to singer identification were studied: one where the acoustic features were extracted directly from the polyphonic signal and one where the vocal line was first separated from the mixture using a predominant melody transcription system. The methods are evaluated using a database of songs where the level difference between the singing and the accompaniment varies. It was found that vocal line separation enables robust singer identification down to 0dB and -5dB singer-to-accompaniment ratios.

Cites: 117 (see at Google Scholar)

PDF Poster

2006

Band Decomposition of Voice Signals Using Wavelets Defined from Fractional B-spline Functions

Abstract

The B-spline functions constitute a good mathematical background for defining new wavelets. The fractional B-splines provide the only wavelets that have an explicit analytical form, making the mathematical manipulations easier. This paper proposes a decomposition of the voice signal into hierarchical octave bands, by using a fractional B-spline based wavelet decomposition. The proposed decomposition scheme takes into account the location of the formants, a particular feature of the voice signals. Each band is represented by its mean energy, and the resulting energy representation is tested for singing voice identification. Discriminant analysis is used, and the testing is done both on known and unknown data

Keywords

signal representation;speech processing;splines (mathematics);wavelet transforms;discriminant analysis;energy representation;fractional B-spline function;singing voice identification;voice signal;wavelet decomposition scheme;Electronic mail;Filters;Fourier transforms;Mirrors;Polynomials;Spline;Testing;Time frequency analysis;Wavelet analysis;Wavelet transforms;B-spline functions;time-frequency analysis;wavelet decomposition

Cites: 1 (see at Google Scholar)

Methods for singing voice identification using energy coefficients as features

Abstract

This paper describes two energy representations of the voice signal and tests their efficiency in singing voice identification. The first set of energy features consists in the Mel-scale energies of 14 frequency bands, covering the whole frequency spectrum of the signal. The second energy representation is obtained by wavelet decomposition of the voice signal. The wavelet and scaling filters for the decomposition are derived from fractional B-spline functions. The wavelet decomposition is done hierarchically, into 14 bands, with octave-band filters, taking into account the specific frequencies of the formants. Both energy representations are tested for singing voice identification on the training set and on unknown data

Keywords

filtering theory;music;signal representation;speaker recognition;splines (mathematics);Mel-scale energy;energy coefficients;energy representation;fractional B-spline functions;frequency spectrum;octave-band filtering;scaling filtering;singing voice identification;voice signal;wavelet decomposition;wavelet filtering;Fourier transforms;Frequency;Instruments;Power harmonic filters;Signal analysis;Signal processing;Speech analysis;Speech recognition;Testing;Timbre

Cites: 4 (see at Google Scholar)

Spectrum characteristics of singing voice signals and their usefulness in singer identification

Estimation of Closed Glottis Phase in Professional singing Voice Using the Frobenius Norm

2005

Closed Phase Detection in the Singing Voice Using Information About Formant Frequencies During One Glottal Cycle

Poster

The Mel-Frequency Cepstral Coefficients in the Context of Singing Voice

Abstract

The singing voice is the oldest and most complex musical instrument. A familiar singer’s voice is easily recognizable for humans, even when hearing a song for the first time. On the other hand, for automatic identification this is a difficult task among sound source identification applications. The signal processing techniques aim to extract features that are related to identity characteristics. The research presented in this paper considers 32 Mel-Frequency Cepstral Coefficients in two subsets: the low order MFCCs characterizing the vocal tract resonances and the high order MFCCs related to the glottal wave shape. We explore possibilities to identify and discriminate singers using the two sets. Based on the results we can affirm that both subsets have their contribution in defining the identity of the voice, but the high order subset is more robust to changes in singing style.

Cites: 26 (see at Google Scholar)

PDF

Inter-dependence of Spectral Measures for the Singing Voice

Cites: 6 (see at Google Scholar)

2004

An exploration of singing voice individuality, Analysis of Biomedical Signals and Images

Cites: 1 (see at Google Scholar)

Note regarding IEEE copyrighted material on this page
The material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.