Publications


Academic
Publications: 69 ( Books : 2, Phd thesis : 1, Journal articles : 11, Book chapters : 2, Conference papers : 47, Patents : 3, Master thesis : 1, Other publications : 2 )
Cites: 7076 ( according to Google Scholar, Updated 10.04.2024)

2023

Privacy-preserving sound representation

Abstract

According to an example embodiment, a method (200) for audio-based monitoring is provided, the method (200) comprising: deriving (202), via usage of a predefined conversion model (M), based on audio data that represents sounds captured in a monitored space, one or more audio features that are descriptive of at least one characteristic of said sounds; identifying (204) respective occurrences of one or more predefined acoustic events in said space based on the one or more audio features; and carrying out (206), in response to identifying an occurrence of at least one of said one or more predefined acoustic events, one or more predefined actions associated with said at least one of said one or more predefined acoustic events, wherein said conversion model (M) is trained to provide said one or more audio features such that they include information that facilitates identification of respective occurrences of said one or more predefined acoustic events while preventing identification of speech characteristics.

2022

Low-Complexity Acoustic Scene Classification in DCASE 2022 Challenge

Abstract

This paper presents an analysis of the Low-Complexity Acoustic Scene Classification task in DCASE 2022 Challenge. The task was a continuation from the previous years, but the low-complexity requirements were changed to the following: the maximum number of allowed parameters, including the zero-valued ones, was 128 K, with parameters being represented using INT8 numerical format; and the maximum number of multiply-accumulate operations at inference time was 30 million. Despite using the same previous year dataset, the audio samples have been shortened to 1 second instead of 10 second for this year challenge. The provided baseline system is a convolutional neural network which employs post-training quantization of parameters, resulting in 46.5 K parameters, and 29.23 million multiply-and-accumulate operations (MMACs). Its performance on the evaluation data is 44.2\% accuracy and 1.532 log-loss. In comparison, the top system in the challenge obtained an accuracy of 59.6\% and a log loss of 1.091, having 121 K parameters and 28 MMACs. The task received 48 submissions from 19 different teams, most of which outperformed the baseline system.

Cites: 38 (see at Google Scholar)

PDF

Zero-Shot Audio Classification using Image Embeddings

Abstract

Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection. We extract the semantic image representations from the Open Images dataset and evaluate the performance of the models on an audio subset of AudioSet using semantic information in different domains; image, audio, and textual. We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification. The experimental results show that the image and textual embeddings display similar performance both individually and together. We additionally calculate the semantic acoustic embeddings from the test samples to provide an upper limit to the performance. The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.

Keywords

zero-shot learning, audio classification, semantic embeddings, image embeddings

Cites: 4 (see at Google Scholar)

PDF

2021

Low-Complexity Acoustic Scene Classification for Multi-Device Audio: Analysis of DCASE 2021 Challenge Systems

Abstract

This paper presents the details of Task 1A Low-Complexity Acoustic Scene Classification with Multiple Devices in the DCASE 2021 Challenge. The task targeted development of low-complexity solutions with good generalization properties. The provided baseline system is based on a CNN architecture and post-training quantization of parameters. The system is trained using all the available training data, without any specific technique for handling device mismatch, and obtains an overall accuracy of 47.7%, with a log loss of 1.473. The task received 99 submissions from 30 teams, and most of the submitted systems outperformed the baseline. The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70% accuracy, and logloss under 0.8. The acoustic scene classification task remained a popular task in the challenge, despite the increasing difficulty of the setup.

Cites: 61 (see at Google Scholar)

PDF

Audio-Visual Scene Classification: Analysis of DCASE 2021 Challenge Submissions

Abstract

This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1%.

Cites: 18 (see at Google Scholar)

PDF

Sound Event Detection: A Tutorial

Abstract

The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening. In practice, the goal is to recognize at what temporal instances different sounds are active within an audio signal. This paper gives a tutorial presentation of sound event detection, including its definition, signal processing and machine learning approaches, evaluation, and future perspectives.

Cites: 168 (see at Google Scholar)

Computational Audio Content Analysis in Everyday Environments

Abstract

Our everyday environments are full of sounds that have a vital role in providing us information and allowing us to understand what is happening around us. Humans have formed strong associations between physical events in their environment and the sounds that these events produce. Such associations are described using textual labels, sound events, and they allow us to understand, recognize, and interpret the concepts behind sounds. Examples of such sound events are dog barking, person shouting or car passing by. This thesis deals with computational methods for audio content analysis of everyday environments. Along with the increased usage of digital audio in our everyday life, automatic audio content analysis has become a more and more pursued ability. Content analysis enables an in-depth understanding of what was happening in the environment when the audio was captured, and this further facilitates applications that can accurately react to the events in the environment. The methods proposed in this thesis focus on sound event detection, the task of recognizing and temporally locating sound events within an audio signal, and include aspects related to development of methods dealing with a large set of sound classes, detection of multiple sounds, and evaluation of such methods. The work presented in this thesis focuses on developing methods that allow the detection of multiple overlapping sound events and robust acoustic model training based on mixture audio containing overlapping sounds. Starting with an HMM-based approach for prominent sound event detection, the work advanced by extending it into polyphonic detection using multiple Viterbi iterations or sound source separation. These polyphonic sound event detection systems were based on a collection of generative classifiers to produce multiple labels for the same time instance, which doubled or in some cases tripled the detection performance. As an alternative approach, polyphonic detection was implemented using class-wise activity detectors in which the activity of each event class was detected independently and class-wise event sequences were merged to produce the polyphonic system output. The polyphonic detection increased applicability of the methods in everyday environments substantially. For evaluation of methods, the work proposed a new metric for polyphonic sound event detection which takes into account the polyphony. The new metric, a segment-based F-score, provides rigorous definitions for the correct and erroneous detections, besides being more suitable for comparing polyphonic annotation and polyphonic system output than the previously used metrics and has since become one of the standard metrics in the research field. Part of this thesis includes studying sound events as a constituent part of the acoustic scene based on contextual information provided by their co-occurrence. This information was used for both sound event detection and acoustic scene classification. In sound event detection, context information was used to identify the acoustic scene in order to narrow down the selection of possible sound event classes based on this information, which allowed use of context-dependent acoustic models and event priors. This approach provided moderate yet consistent performance increase across all tested acoustic scene types, and enabled the detection system to be easily expanded to new scenes. In acoustic scene classification, the scenes were identified based on the distinctive and scene-specific sound events detected, with performance comparable to traditional approaches, while the fusion of these two approaches showed a significant further increase in the performance. The thesis also includes significant contribution to the development of tools for open research in the field, such as standardized evaluation protocols, and release of open datasets, benchmark systems, and open-source tools.

Cites: 3 (see at Google Scholar)

PDF

A curated dataset of urban acoustic scenes for audio-visual scene analysis

Abstract

This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8% accuracy compared to 75.8% for the audio-only and 68.4% for the video-only equivalent systems.

Keywords

Audio-visual data, Scene analysis, Acoustic scene, Pattern recognition, Transfer learning

Cites: 38 (see at Google Scholar)

PDF

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

Cites: 122 (see at Google Scholar)

2020

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1. Acoustic Scene Classification in the DCASE 2020 Challenge. The task consisted of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Each subtask received around 90 submissions, and most of them outperformed the baseline system. The most used techniques among the submissions were data augmentation in Subtask A, to compensate for the device mismatch, and post-training quantization of neural network weights in Subtask B, to bring the model size under the required limit. The maximum classification accuracy on the evaluation set in Subtask A was 76.5%, compared to the baseline performance of 51.4%. In Subtask B, many systems are just below the size limit, and the maximum classification accuracy was 96.5%, compared to the baseline performance of 89.5%.

Keywords

Acoustic Scene Classification, DCASE 2020 Challenge

Cites: 188 (see at Google Scholar)

PDF

Active Learning for Sound Event Detection

Abstract

This paper proposes an active learning system for sound event detection(SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the selection is based on the principle of mismatch-first farthest-traversal. During the training of SED models, recordings are used as training inputs, preserving the long-term context for annotated segments. The proposed system clearly outperforms reference methods in the two datasets used for evaluation(TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with recordings as context outperforms training with only annotated segments. Mismatch-first farthest-traversal outperforms reference sample selection methods based on random sampling and uncertainty sampling. Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare: by annotating only 2% of the training data, the achieved SED performance is similar to annotating all the training data.

Cites: 32 (see at Google Scholar)

2019

Ympäristömelun mittaaminen käyttäen automaattista lähteiden tunnistusta

Abstract

Jatkuvatoimisen ympäristömelun mittaamisen suuri haaste on ollut eri äänilähteiden osuuksien erottelu mittauksen aikana vallinneesta kokonaisäänitasosta. Yleensä mittaus suoritetaan selkeästi osoitettavissa olevan melulähteen vuoksi ja lopputuloksesta pyritään poistamaan muiden melulähteiden vaikutus. Ajallisesti lyhyessä mittaustapahtumassa mittaaja voi todentaa tilanteen joko kuuntelemalla mittaushetkellä tai tallennetusta äänestä jälkikäteen. Olemme toteuttaneet hahmontunnistukseen perustuvan luokittelualgoritmin äänitasomittariin ja testanneet sitä useissa eri kohteissa. Tässä paperissa kerromme uudesta äänitasomittarikonseptista, siihen toteutetusta opetetusta luokittelijasta ja luokittelijan suorituskyvystä pilottitutkimuksemme aikana. Lisäksi kerromme kohtaamistamme haasteista ja ratkaisuista niihin.

PDF

Acoustic Scene Classification in DCASE 2019 challenge: closed and open set classification and data mismatch setups

Abstract

Acoustic Scene Classification is a regular task in the DCASE Challenge, with each edition having it as a task. Throughout the years, modifications to the task have included mostly changing the dataset and increasing its size, but recently also more realistic setups have been introduced. In DCASE 2019 Challenge, the Acoustic Scene Classification task includes three subtasks: Subtask A, a closed-set typical supervised classification where all data is recorded with the same device; Subtask B, a closed-set classification setup with mismatched recording devices between training and evaluation data, and Subtask C, an open-set classification setup in which evaluation data could contain acoustic scenes not encountered in the training. In all subtasks, the provided baseline system was significantly outperformed, with top performance being 85.2% for Subtask A, 75.5% for Subtask B, and 67.4% for Subtask C. This paper presents the outcome of DCASE 2019 Challenge Task 1 in terms of submitted systems performance and analysis.

Keywords

Acoustic Scene Classification, DCASE 2019 Challenge, open set classification

Cites: 82 (see at Google Scholar)

PDF

City classification from multiple real-world sound scenes

Abstract

The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes. We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches.

Keywords

Acoustic scene classification, location identification, city classification, computational sound scene analysis.

Cites: 9 (see at Google Scholar)

PDF

Joint Measurement of Localization and Detection of Sound Events

Abstract

Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and conversely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner, and characterize adequately both aspects.

Keywords

Sound event detection and localization, performance evaluation

Cites: 72 (see at Google Scholar)

PDF

Audio-based Epileptic seizure detection

Abstract

This paper investigates automatic epileptic seizure detection from audio recordings using convolutional neural networks. The labeling and analysis of seizure events are necessary in the medical field for patient monitoring, but the manual annotation by expert annotators is time-consuming and extremely monotonous. The proposed method treats all seizure vocalizations as a single target event class, and models the seizure detection problem in terms of detecting the target vs non-target classes. For detection, the method employs a convolutional neural network trained to detect the seizure events in short time segments, based on mel-energies as feature representation. Experiments carried out with different seizure types on 900 hours of audio recordings from 40 patients show that the proposed approach can detect seizures with over 80% accuracy, with a 13% false positive rate and a 22.8% false negative rate.

Keywords

Epileptic seizure detection, convolutional neural network (CNN), sound event detection, audio processing and analysis

Cites: 3 (see at Google Scholar)

PDF

Sound event detection in the DCASE 2017 Challenge

Abstract

Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly-labeled data was available for training. In this paper, we present the three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure, to perform statistical analysis of the challenge results. The analysis indicates that while the 95% confidence intervals for many systems overlap, there are significant differences in performance between the top systems and the baseline for all tasks.

Keywords

Sound event detection, weak labels, pattern recognition, jackknife estimates, confidence intervals

Cites: 144 (see at Google Scholar)

Sound event envelope estimation in polyphonic mixtures

Abstract

Sound event detection is the task of identifying automatically the presence and temporal boundaries of sound events within an input audio stream. In the last years, deep learning methods have established themselves as the state-of-the-art approach for the task, using binary indicators during training to denote whether an event is active or inactive. However, such binary activity indicators do not fully describe the events, and estimating the envelope of the sounds could provide more precise modeling of their activity. This paper proposes to estimate the amplitude envelopes of target sound event classes in polyphonic mixtures. For training, we use the amplitude envelopes of the target sounds, calculated from mixture signals and, for comparison, from their isolated counterparts. The model is then used to perform envelope estimation and sound event detection. Results show that the envelope estimation allows good modeling of the sounds activity, with detection results comparable to current state-of-the art.

Keywords

Sound event detection, Envelope estimation, Deep Neural Networks

Cites: 16 (see at Google Scholar)

PDF

2018

Acoustic Scene Classification: a competition review

Abstract

In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and study the impact of each by performing an ablation study of the mixture of approaches. We also compare the results with a neural network baseline, and show the improvement over that. Finally, we discuss the impact of using a competition as a part of a university course, and justify its importance in the curriculum based on student feedback.

Keywords

Acoustic Scene Classification, Data Augmentation, Kaggle, DCASE

Cites: 24 (see at Google Scholar)

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

Cites: 439 (see at Google Scholar)

PDF

Using Sequential Information in Polyphonic Sound Event Detection

Abstract

To detect the class, and start and end times of sound events in real world recordings is a challenging task. Current computer systems often show relatively high frame-wise accuracy but low event-wise accuracy. In this paper, we attempted to merge the gap by explicitly including sequential information to improve the performance of a state-of-the-art polyphonic sound event detection system. We propose to 1) use delayed predictions of event activities as additional input features that are fed back to the neural network; 2) build N-grams to model the co-occurrence probabilities of different events; 3) use sequential loss to train neural networks. Our experiments on a corpus of real world recordings show that the N-grams could smooth the spiky output of a state-of-the-art neural network system, and improve both the frame-wise and the event-wise metrics.

Keywords

Polyphonic sound event detection;language modelling;sequential information

Cites: 8 (see at Google Scholar)

PDF

An Active Learning Method Using Clustering and Committee-Based Sample Selection for Sound Event Classification

Abstract

This paper proposes an active learning method to control a labeling process for efficient annotation of acoustic training material, which is used for training sound event classifiers. The proposed method performs K-medoids clustering over an initially unlabeled dataset, and medoids as local representatives, are presented to an annotator for manual annotation. The annotated label on a medoid propagates to other samples in its cluster for label prediction. After annotating the medoids, the annotation continues to the unexamined sounds with mismatched prediction results from two classifiers, a nearestneighbor classifier and a model-based classifier, both trained with annotated data. The annotation on the segments with mismatched predictions are ordered by the distance to the nearest annotated sample, farthest first. The evaluation is made on a public environmental sound dataset. The labels obtained through a labeling process controlled by the proposed method are used to train a classifier, using supervised learning. Only 20% of the data needs to be manually annotated with the proposed method, to achieve the accuracy with all the data annotated. In addition, the proposed method clearly outperforms other active learning algorithms proposed for sound event classification through all the experiments, simulating varying fraction of data that is manually labeled.

Keywords

active learning;K-medoids clustering;committee-based sample selection;sound event classification

Cites: 12 (see at Google Scholar)

PDF

Acoustic Scene Classification: An Overview of DCASE 2017 Challenge Entries

Abstract

We present an overview of the challenge entries for the Acoustic Scene Classification task of DCASE 2017 Challenge. Being the most popular task of the challenge, acoustic scene classification entries provide a wide variety of approaches for comparison, with a wide performance gap from top to bottom. Analysis of the submissions confirms once more the popularity of deep-learning approaches and mel-frequency representations. Statistical analysis indicates that the top ranked system performed significantly better than the others, and that combinations of top systems are capable of reaching close to perfect performance on the given data.

Keywords

acoustic scene classification, audio classification, DCASE challenge

Cites: 92 (see at Google Scholar)

PDF

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

Keywords

Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection

Cites: 353 (see at Google Scholar)

The Machine Learning Approach for Analysis of Sound Scenes and Events

Abstract

This chapter explains the basic concepts in computational methods used for analysis of sound scenes and events. Even though the analysis tasks in many applications seem different, the underlying computational methods are typically based on the same principles. We explain the commonalities between analysis tasks such as sound event detection, sound scene classification, or audio tagging. We focus on the machine learning approach, where the sound categories (i.e., classes) to be analyzed are defined in advance. We explain the typical components of an analysis system, including signal pre-processing, feature extraction, and pattern classification. We also preset an example system based on multi-label deep neural networks, which has been found to be applicable in many analysis tasks discussed in this book. Finally, we explain the whole processing chain that involves developing computational audio analysis systems.

Cites: 36 (see at Google Scholar)

Datasets and Evaluation

Abstract

Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems' outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.

Cites: 27 (see at Google Scholar)

Environmental noise monitoring using source classification in sensors

Abstract

Environmental noise monitoring systems continuously measure sound levels without assigning these measurements to different noise sources in the acoustic scenes, therefore incapable of identifying the main noise source. In this paper a feasibility study is presented on a new monitoring concept in which an acoustic pattern classification algorithm running in a wireless sensor is used to automatically assign the measured sound level to different noise sources. A supervised noise source classifier is learned from a small amount of manually annotated recordings and the learned classifier is used to automatically detect the activity of target noise source in the presence of interfering noise sources. The sensor is based on an inexpensive credit-card-sized single-board computer with a microphone and associated electronics and wireless connectivity. The measurement results and the noise source information are transferred from the sensors scattered around the measurement site to a cloud service and a noise portal is used to visualise the measurements to users. The proposed noise monitoring concept was piloted on a rock crushing site. The system ran reliably over 50 days on site, during which it was able to recognise more than 90% of the noise sources correctly. The pilot study shows that the proposed noise monitoring system can reduce the amount of required human validation of the sound level measurements when the target noise source is clearly defined.

Keywords

Environmental noise monitoring, Acoustic pattern classification, Wireless sensor network, Cloud service

Cites: 135 (see at Google Scholar)

2017

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)

Cites: 10 (see at Google Scholar)

PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events

Cites: 566 (see at Google Scholar)

PDF

Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study

Abstract

Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.

Cites: 26 (see at Google Scholar)

PDF

Learning Vocal Mode Classifiers from Heterogeneous Data Sources

Abstract

This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. However, previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering the cases that training and testing data are recorded in mismatched condition. Experiments revealed a big difference between homogeneous recognition scenario and heterogeneous recognition scenario, using a new dataset TUT-vocal-2016. In the homogeneous recognition scenario, the classification accuracy using cross-validation on TUT-vocal-2016 was 95.5%. In heterogeneous recognition scenario, seven existing datasets were used as training material and TUT-vocal-2016 was used for testing, the classification accuracy was only 69.6%. Several feature normalization methods were tested to improve the performance in heterogeneous recognition scenario. The best performance (96.8%) was obtained using the proposed subdataset-wise normalization.

Cites: 1 (see at Google Scholar)

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Abstract

Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.

Cites: 668 (see at Google Scholar)

Active Learning for Sound Event Classification by Clustering Unlabeled Data

Abstract

This paper proposes a novel active learning method to save annotation effort when preparing material to train sound event classifiers. K-medoids clustering is performed on unlabeled sound segments, and medoids of clusters are presented to annotators for labeling. The annotated label for a medoid is used to derive predicted labels for other cluster members. The obtained labels are used to build a classifier using supervised training. The accuracy of the resulted classifier is used to evaluate the performance of the proposed method. The evaluation made on a public environmental sound dataset shows that the proposed method outperforms reference methods (random sampling, certainty-based active learning and semi-supervised learning) with all simulated labeling budgets, the number of available labeling responses. Through all the experiments, the proposed method saves 50%-60% labeling budget to achieve the same accuracy, with respect to the best reference method.

Keywords

active learning, sound event classification, K-medoids clustering

Cites: 49 (see at Google Scholar)

2016

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

Cites: 10 (see at Google Scholar)

PDF

Domestic Audio Tagging with Convolutional Neural Networks

Abstract

In this paper, the method used in our submission for DCASE2016 challenge task 4 (domestic audio tagging) is described. The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated. A relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.

Cites: 30 (see at Google Scholar)

PDF

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Abstract

In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database. The proposed method improves the F-score by 3.75% while reducing the error rate by 6%

Keywords

Sound event detection, multichannel, time difference of arrival, pitch, recurrent neural networks, long short term memory

Cites: 138 (see at Google Scholar)

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

Keywords

audio recording;audio signal processing;Gaussian mixture models;TUT database;acoustic scene classification;binaural recordings;environmental sound research;mel frequency cepstral coefficients;sound event detection;Automobiles;Databases;Europe;Event detection;Mel frequency cepstral coefficient;Signal processing

Cites: 701 (see at Google Scholar)

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

Cites: 647 (see at Google Scholar)

2015

Automatic recognition of environmental sound events using all-pole group delay features

Abstract

A feature based on the group delay function from all-pole models (APGD) is proposed for environmental sound event recognition. The commonly used spectral features take into account merely the magnitude information, whereas the phase is overlooked due to the complications related to its interpretation. Additional information concealed in the phase is hypothesised to be beneficial for sound event recognition. The APGD is an approach to inferring phase information, which has shown applicability for analysis of speech and music signals and is now studied in environmental audio. The evaluation is performed within a multi-label deep neural network (DNN) framework on a diverse real-life dataset of environmental sounds. It shows performance improvement compared to the baseline log mel-band energy case. In combination with the magnitude-based features, APGD demonstrates further improvement.

Cites: 15 (see at Google Scholar)

PDF

Multi-Label vs. Combined Single-Label Sound Event Detection With Deep Neural Networks

Abstract

In real-life audio scenes, many sound events from different sources are simultaneously active, which makes the automatic sound event detection challenging. In this paper, we compare two different deep learning methods for the detection of environmental sound events: combined single-label classification and multi-label classification. We investigate the accuracy of both methods on the audio with different levels of polyphony. Multi-label classification achieves an overall 62.8% accuracy, whereas combined single-label classification achieves a very close 61.9% accuracy. The latter approach offers more flexibility on real-world applications by gathering the relevant group of sound events in a single classifier with various combinations.

Cites: 58 (see at Google Scholar)

Polyphonic Sound Event Detection Using Multi Label Deep Neural Networks

Abstract

In this paper, the use of multi label neural networks are proposed for detection of temporally overlapping sound events in realistic environments. Real-life sound recordings typically have many overlapping sound events, making it hard to recognize each event with the standard sound event detection methods. Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work. The model is evaluated with recordings from realistic everyday environments and the obtained overall accuracy is 58.9%. The method is compared against a state-of-the-art method using non-negative matrix factorization as a pre-processing stage and hidden Markov models as a classifier. The proposed method improves the accuracy by 19% percentage points overall.

Cites: 361 (see at Google Scholar)

PDF

Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations

Abstract

Methods for detection of overlapping sound events in audio involve matrix factorization approaches, often assigning separated components to event classes. We present a method that bypasses the supervised construction of class models. The method learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix. In testing, the dictionaries are used to estimate directly the class activations. For dealing with large amount of training data, two methods are proposed for reducing the size of the dictionary. The methods were tested on a database of real life recordings, and outperformed previous approaches by over 10%.

Keywords

coupled non-negative matrix factorization, non-negative dictionaries, sound event detection

Cites: 135 (see at Google Scholar)

PDF

2014

Group Delay Function from All-Pole Models for Musical Instrument Recognition

Abstract

In this work, the feature based on the group delay function from all-pole models (APGD) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The APGD is an elegant approach to inferring phase information, which lacks of the issues related to interpreting the phase and does not require extensive parameter adjustment. Having shown applicability for speech-related problems, it is now explored in terms of instrument recognition. The evaluation is performed with various instrument sets and shows noteworthy absolute accuracy gains of up to 7% compared to the baseline mel-frequency cepstral coefficients (MFCCs) case. Combined with the MFCCs and with feature selection, APGD demonstrates superiority over the baseline with all the evaluated sets.

Keywords

Musical instrument recognition, music information retrieval, all-pole group delay feature, phase spectrum

Cites: 7 (see at Google Scholar)

Method for creating location-specific audio textures

Abstract

An approach is proposed for creating location-specific audio textures for virtual location-exploration services. The presented approach creates audio textures by processing a small amount of audio recorded at a given location, providing a cost-effective way to produce a versatile audio signal that characterizes the location. The resulting texture is non-repetitive and conserves the location-specific characteristics of the audio scene, without the need of collecting large amount of audio from each location. The method consists of two stages: analysis and synthesis. In the analysis stage, the source audio recording is segmented into homogeneous segments. In the synthesis stage, the audio texture is created by randomly drawing segments from the source audio so that the consecutive segments will have timbral similarity near the segment boundaries. Results obtained in listening experiments show that there is no statistically significant difference in the audio quality or location-specificity of audio when the created audio textures are compared to excerpts of the original recordings. Therefore, the proposed audio textures could be utilized in virtual location-exploration services. Examples of source signals and audio textures created from them are available at www.cs.tut.fi/~heittolt/audiotexture.

Cites: 10 (see at Google Scholar)

2013

Sound Event Detection for Office Live and Office Synthetic AASP Challenge

Abstract

We present a sound event detection system based on hidden Markov models. The system is evaluated with development material provided in the AASP Challenge on Detection and Classification of Acoustic Scenes and Events. Two approaches using the same basic detection scheme are presented. First one, developed for acoustic scenes with non-overlapping sound events is evaluated with Office Live development dataset. Second one, developed for acoustic scenes with some degree of overlapping sound events is evaluated with Office Synthetic development dataset.

Keywords

Sound event detection

Cites: 45 (see at Google Scholar)

PDF

Modified Group Delay Feature for Musical Instrument Recognition

Abstract

In this work, the modified group delay feature (MODGDF) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features used in instrument recognition take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The MODGDF is a method of incorporating phase information, which lacks of the issues related to phase unwrapping. Having shown its applicability for speech-related problems, it is now explored in terms of musical instrument recognition. The evaluation is performed on separate note recordings in various instrument sets, and combined with the conventional mel frequency cepstral coefficients (MFCCs), MODGDF shows the noteworthy absolute accuracy gains of up to 5.1% compared to the baseline MFCCs case.

Keywords

Musical instrument recognition; music information retrieval; modified group delay feature; phase spectrum

Cites: 27 (see at Google Scholar)

PDF

Semi-supervised Learning for Musical Instrument Recognition

Abstract

In this work, the semi-supervised learning (SSL) techniques are explored in the context of musical instrument recognition. The conventional supervised approaches normally rely on annotated data to train the classifier. This implies performing costly manual annotations of the training data. The SSL methods enable utilising the additional unannotated data, which is significantly easier to obtain, allowing the overall development cost maintained at the same level while notably improving the performance. The implemented classifier incorporates the Gaussian mixture model-based SSL scheme utilising the iterative EM-based algorithm, as well as the extensions facilitating a simpler convergence criteria. The evaluation is performed on a set of nine instruments while training on a dataset, in which the relative size of the labelled data is as little as 15%. It yields a noteworthy absolute performance gain of 16% compared to the performance of the initial supervised models.

Keywords

Music information retrieval; musical instrument recognition; semi-supervised learning

Cites: 22 (see at Google Scholar)

PDF

Query-by-example retrieval of sound events using an integrated similarity measure of content and label

Abstract

This paper presents a method for combining audio similarity and semantic similarity into a single similarity measure for query-by-example retrieval. The integrated similarity measure is used to retrieve sound events that are similar in content to the given query and have labels containing similar words. Through the semantic component, the method is able to handle variability in labels of sound events. Through the acoustic component, the method retrieves acoustically similar examples. On a test database of over 3000 sound event examples, the proposed method obtains a better retrieval performance than audio-based retrieval, and returns results closer acoustically to the query than a label-based retrieval.

Keywords

audio signal processing;content-based retrieval;semantic networks;acoustic component;audio similarity;integrated similarity measure;label-based retrieval;query-by-example retrieval;semantic similarity;sound events;

Cites: 7 (see at Google Scholar)

PDF

Supervised Model Training for Overlapping Sound Events Based on Unsupervised Source Separation

Abstract

Sound event detection is addressed in the presence of overlapping sounds. Unsupervised sound source separation into streams is used as a preprocessing step to minimize the interference of overlapping events. This poses a problem in supervised model training, since there is no knowledge about which separated stream contains the targeted sound source. We propose two iterative approaches based on EM algorithm to select the most likely stream to contain the target sound: one by selecting always the most likely stream and another one by gradually eliminating the most unlikely streams from the training. The approaches were evaluated with a database containing recordings from various contexts, against the baseline system trained without applying stream selection. Both proposed approaches were found to give a reasonable increase of 8 percentage units in the detection accuracy.

Keywords

acoustic event detection;acoustic pattern recognition;sound source separation;supervised model training

Cites: 93 (see at Google Scholar)

PDF

Analysis of Acoustic-Semantic Relationship for Diversely Annotated Real-World Audio Data

Abstract

A common problem of freely annotated or user contributed audio databases is the high variability of the labels, related to homonyms, synonyms, plurals, etc. Automatically re-labeling audio data based on audio similarity could offer a solution to this problem. This paper studies the relationship between audio and labels in a sound event database, by evaluating semantic similarity of labels of acoustically similar sound event instances. The assumption behind the study is that acoustically similar events are annotated with semantically similar labels. Indeed, for 43% of the tested data, there was at least one in ten acoustically nearest neighbors having a synonym as label, while the closest related term is on average one level higher or lower in the semantic hierarchy.

Keywords

audio similarity;semantic similarity;sound events

Cites: 7 (see at Google Scholar)

PDF

Context-Dependent Sound Event Detection

Abstract

The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.

Cites: 285 (see at Google Scholar)

On the human ability to discriminate audio ambiances from similar locations of an urban environment

Abstract

When developing advanced location-based systems augmented with audio ambiances, it would be cost-effective to use a few representative samples from typical environments for describing a larger number of similar locations. The aim of this experiment was to study the human ability to discriminate audio ambiances recorded in similar locations of the same urban environment. A listening experiment consisting of material from three different environments and nine different locations was carried out with nineteen subjects to study the credibility of audio representations for certain environments which would diminish the need for collecting huge audio databases. The first goal was to study to what degree humans are able to recognize whether the recording has been made in an indicated location or in another similar location, when presented with the name of the place, location on a map, and the associated audio ambiance. The second goal was to study whether the ability to discriminate audio ambiances from different locations is affected by a visual cue, by presenting additional information in form of a photograph of the suggested location. The results indicate that audio ambiances from similar urban areas of the same city differ enough so that it is not acceptable to use a single recording as ambience to represent different yet similar locations. Including an image was found to increase the perceived credibility of all the audio samples in representing a certain location. The results suggest that developers of audio-augmented location-based systems should aim at using audio samples recorded on-site for each location in order to achieve a credible impression.

Keywords

Listening experiment; Location recognition; Audio-visual perception; Audio ambiance

Cites: 2 (see at Google Scholar)

PDF

2012

Method and apparatus for providing media event suggestions

Abstract

Various methods are described for providing media event suggestions based at least in part on a co-occurrence model. One example method may comprise receiving a selection of at least one media event to include in a media composition. Additionally, the method may comprise determining at least one suggested media event based at least in part on the at least one media events. The method may further comprise causing display of the at least one suggested media event. Similar and related methods, apparatuses, and computer program products are also provided.

Method and apparatus for generating an audio summary of a location

Abstract

Various methods are described for generating an audio summary representing a location on a place exploration service. One example method may comprise receiving at least one audio file. The method may further comprise dividing the at least one audio file into one or more audio segments. Additionally, the method may comprise determining a representative audio segment for each of the one or more audio segments. The method may further comprise generating an audio summary of the at least one audio file by combining one or more of the representative audio segments of the one or more audio segments. Similar and related methods, apparatuses, and computer program products are also provided.

Cites: 12 (see at Google Scholar)

Automatic Scoring of Guitar Chords

Abstract

This paper describes a novel approach for detecting the correctness of musical chords played by guitar. The approach is based on pattern matching technique applied on database of chords and their typical mistakes played with multiple guitars. Spectrum of the chord is whitened and a certain region is selected as a feature vector. The cosine distance is calculated between chord to be tested and a reference chord database, chord detection is done based on the minimum distance. The proposed system is evaluated with isolated chords with different noise conditions. The system shows approximately 77% accuracy in scoring the correctness of played chords with a medium sized database.

Cites: 1 (see at Google Scholar)

2011

Sound Event Detection in Multisource Environments Using Source Separation

Abstract

This paper proposes a sound event detection system for natural multisource environments, using a sound source separation front-end. The recognizer aims at detecting sound events from various everyday contexts. The audio is preprocessed using non-negative matrix factorization and separated into four individual signals. Each sound event class is represented by a Hidden Markov Model trained using mel frequency cepstral coefficients extracted from the audio. Each separated signal is used individually for feature extraction and then segmentation and classification of sound events using the Viterbi algorithm. The separation allows detection of a maximum of four overlapping events. The proposed system shows a significant increase in event detection accuracy compared to a system able to output a single sequence of events.

Cites: 161 (see at Google Scholar)

PDF Slides

Latent Semantic Analysis in Sound Event Detection

Abstract

This paper presents the use of probabilistic latent semantic analysis (PLSA) for modeling co-occurrence of overlapping sound events in audio recordings from everyday audio environments such as office, street or shop. Co-occurrence of events is represented as the degree of their overlapping in a fixed length segment of polyphonic audio. In the training stage, PLSA is used to learn the relationships between individual events. In detection, the PLSA model continuously adjusts the probabilities of events according to the history of events detected so far. The event probabilities provided by the model are integrated into a sound event detection system that outputs a monophonic sequence of events. The model offers a very good representation of the data, having low perplexity on test recordings. Using PLSA for estimating prior probabilities of events provides an increase of event detection accuracy to 35%, compared to 30% for using uniform priors for the events. There are different levels of performance increase in different audio contexts, with few contexts showing significant improvement.

Keywords

sound event detection, latent semantic analysis

Cites: 66 (see at Google Scholar)

PDF

Sound Event Detection and Context Recognition

Keywords

sound event detection, context recognition

Cites: 3 (see at Google Scholar)

PDF

2010

Audio Context Recognition Using Audio Event Histograms

Abstract

This paper presents a method for audio context recognition, meaning classification between everyday environments. The method is based on representing each audio context using a histogram of audio events which are detected using a supervised classifier. In the training stage, each context is modeled with a histogram estimated from annotated training data. In the testing stage, individual sound events are detected in the unknown recording and a histogram of the sound event occurrences is built. Context recognition is performed by computing the cosine distance between this histogram and event histograms of each context from the training database. Term frequency--inverse document frequency weighting is studied for controlling the importance of different events in the histogram distance calculation. An average classification accuracy of 89% is obtained in the recognition between ten everyday contexts. Combining the event based context recognition system with more conventional audio based recognition increases the recognition rate to 92%.

Cites: 105 (see at Google Scholar)

PDF

Acoustic Event Detection in Real-life Recordings

Abstract

This paper presents a system for acoustic event detection in recordings from real life environments. The events are modeled using a network of hidden Markov models; their size and topology is chosen based on a study of isolated events recognition. We also studied the effect of ambient background noise on event classification performance. On real life recordings, we tested recognition of isolated sound events and event detection. For event detection, the system performs recognition and temporal positioning of a sequence of events. An accuracy of 24% was obtained in classifying isolated sound events into 61 classes. This corresponds to the accuracy of classifying between 61 events when mixed with ambient background noise at 0dB signal-to-noise ratio. In event detection, the system is capable of recognizing almost one third of the events, and the temporal positioning of the events is not correct for 84% of the time.

Cites: 378 (see at Google Scholar)

PDF

Sound source separation in monaural music signals using excitation-filter model and em algorithm.

Abstract

This paper proposes a method for separating the signals of individual musical instruments from monaural musical audio. The mixture signal is modeled as a sum of the spectra of individual musical sounds which are further represented as a product of excitations and filters. The excitations are restricted to harmonic spectra and their fundamental frequencies are estimated in advance using a multipitch estimator, whereas the filters are restricted to have smooth frequency responses by modeling them as a sum of elementary functions on Mel-frequency scale. A novel expectation-maximization (EM) algorithm is proposed which jointly learns the filter responses and organizes the excitations (musical notes) to filters (instruments). In simulations, the method achieved over 5 dB SNR improvement compared to the mixture signals when separating two or three musical instruments from each other. A slight further improvement was achieved by utilizing musical properties in the initialization of the algorithm.

Keywords

Sound source separation, excitation-filter model, maximum likelihood estimation, expectation maximization

Cites: 41 (see at Google Scholar)

PDF

2009

Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation

Abstract

This paper proposes a novel approach to musical instrument recognition in polyphonic audio signals by using a source-filter model and an augmented non-negative matrix factorization algorithm for sound separation. The mixture signal is decomposed into a sum of spectral bases modeled as a product of excitations and filters. The excitations are restricted to harmonic spectra and their fundamental frequencies are estimated in advance using a multipitch estimator, whereas the filters are restricted to have smooth frequency responses by modeling them as a sum of elementary functions on the Mel-frequency scale. The pitch and timbre information are used in organizing individual notes into sound sources. In the recognition, Mel-frequency cepstral coefficients are used to represent the coarse shape of the power spectrum of sound sources and Gaussian mixture models are used to model instrument-conditional densities of the extracted features. The method is evaluated with polyphonic signals, randomly generated from 19 instrument classes. The recognition rate for signals having six note polyphony reaches 59%.

Keywords

Sound source separation, excitation-filter model

Awards: Best paper award

Cites: 185 (see at Google Scholar)

PDF

Interpolating hidden Markov model and its application to automatic instrument recognition

Abstract

his paper proposes an interpolating extension to hidden Markov models (HMMs), which allows more accurate modeling of natural sounds sources. The model is able to produce observations from distributions which are interpolated between discrete HMM states. The model uses Gaussian mixture state emission densities, and the interpolation is implemented by introducing interpolating states in which the mixture weights, means, and variances are interpolated from the discrete HMM state densities. We propose an algorithm extended from the Baum-Welch algorithm for estimating the parameters of the interpolating model. The model was evaluated in automatic instrument classification task, where it produced systematically better recognition accuracy than a baseline HMM recognition algorithm.

Keywords

Hidden Markov models, acoustic signal processing, musical instruments, pattern classification

Cites: 5 (see at Google Scholar)

PDF

Azimuth Estimation in Polyphonic Music

Abstract

Most of the research in music information retrieval (MIR) has been mainly using monophonic source signals, i.e. ignoring stereo information. However, commercially available music recordings typically consist of a two-track stereo mix. The type of mixing process used in the recordings can be roughly categorizes into live recordings and studio recordings. In live recordings, all musical instruments are usually recorded on a single stereo track using stereophonic microphone setup. The listeners localize sounds mainly based on time-differences between left and right channel, using the interaural time difference (ITD). In studio recordings, each musical instrument is recorded on a separate mono or stereo track. In the final mixing stage, audio effects ( e.g. reverberation) can be added artificially. The virtual sound localization at any point between the left and right channel is achieved using proper amplitude for the left and right channel while mixing down tracks to a two-track stereo mix. Amplitude difference between channels is used to simulate interaural intensity difference (IID) by attenuating one channel and causing sound to be localized more in the opposite channel. The phase of a source is coherent between the channels. By assuming this mixing model, we can perform horizontal angle (azimuth) estimation for music signals. Azimuth information can be utilized in different applications of music information retrieval amongst musical instrument recognition and note streaming. In musical instrument recognition with polyphonic notes, the signal-to-noise ratio can be improved with beamforming in the feature extraction stage. Azimuth information can be utilized also in the note streaming of polyphonic audio, where notes can be grouped together based on pitch, timbre and azimuth.

PDF

2008

Musical Instrument Recognition in Polyphonic Music

Abstract

Understanding the timbre and pitch of musical instruments is an important issue for automatic music transcription, music information retrieval and computational auditory scene analysis. In particular, recent worldwide popularization of online music distribution services and portable digital music players makes musical instrument recognition even more important. Musical instruments are one of the main criteria (besides musical genre), which can be used to search certain type of music from music databases. Some classical music are even characterized with the used musical instruments (e.g. piano sonata and string quartet). The purpose of the research is to develop mathematical models for sound sources and apply these in the automatic analysis and coding of polyphonic music. Target signal are musical signals and in limited cases also speech signals. The redundant frequency information of the harmonic sounds will be used in the developed new models. The developed modeling schemes will be tested in two applications, musical instrument recognition in polyphonic music and in music transcription.

PDF

TUT Acoustic Event Detection System 2007

Abstract

This paper describes a system used in acoustic event detection task of the CLEAR 2007 evaluation. The objective of the task is to detect acoustic events (door slam, steps, paper wrapping etc.) using acoustic data from a multiple microphone set up in the meeting room environment. A system based on hidden Markov models and multi-channel audio data was implemented. Mel-Frequency Cepstral Coefficients are used to represent the power spectrum of the acoustic signal. Fully-connected three-state hidden Markov models are trained for 12 acoustic events and one-state models are trained for speech, silence, and unknown events.

Keywords

Hidden Markov models, acoustic signal processing, musical instruments, pattern classification

Cites: 17 (see at Google Scholar)

PDF

2004

Automatic Classification of Music Signals

Abstract

Collections of digital music have become increasingly common over the recent years. As the amount of data increases, digital content management is becoming more important. In this thesis, we are studying content-based classification of acoustic musical signals according to their musical genre (e.g., classical, rock) and the instruments used. A listening experiment is conducted to study human abilities to recognise musical genres. This thesis covers a literature review on human musical genre recognition, state-of-the-art musical genre recognition systems, and related fields of research. In addition, a general-purpose music database consisting of recordings and their manual annotations is introduced. The theory behind the used features and classifiers is reviewed and the results from the simulations are presented. The developed musical genre recognition system uses mel-frequency cepstral coefficients to represent the time-varying magnitude spectrum of a music signal. The class-conditional feature densities are modelled with hidden Markov models. Musical instrument detection for a few pitched instruments from music signals is also studied using the same structure. Furthermore, this thesis proposes a method for the detection of drum instruments. The presence of drums is determined based on the periodicity of the amplitude envelopes of the signal at subbands. The conducted listening experiment shows that the recognition of musical genres is not a trivial task even for humans. On the average, humans are able to recognise the correct genre in 75% of cases (given five-second samples). Results also indicate that humans can do rather accurate musical genre recognition without long-term temporal features, such as rhythm. For the developed automatic recognition system, the obtained recognition accuracy for six musical genres was around 60%, which is comparable to the state-of-the-art systems. Detection accuracy of 81% was obtained with the proposed drum instrument detection method.

Cites: 30 (see at Google Scholar)

PDF

2003

Discriminative Training of Unsupervised Acoustic Models for Non-speech Audio

Abstract

This paper studies acoustic modeling of non-speech audio using hidden Markov models. Simulation results are presented in two different application areas: audio-based context awareness and music classification, the latter focusing on recognition of musical genres and instruments. Two training methods are evaluated: conventional maximum likelihood estimation using the Baum-Welch algorithm, and discriminative training, which is expected to improve the recognition accuracy on models with a small number of component densities in state distributions. Our approach is unsupervised in the sense that we do not know what are the underlying acoustic classes that are modeled with different HMM states. In addition to reporting the achieved recognition results, analyses are made to study what properties of sound signals are captured by the states.

Cites: 1 (see at Google Scholar)

PDF

2002

Locating Segments with Drums in Music Signals

Abstract

A system is described which segments musical signals according to the presence or absence of drum instruments. Two different yet approximately equally accurate approaches were taken to solve the problem. The first is based on periodicity detection in the amplitude envelopes of the signal at subbands. The band-wise periodicity estimates are aggregated into a summary autocorrelation function, the characteristics of which reveal the drums. The other mechanism applies straightforward acoustic pattern recognition with mel-frequency cepstrum coefficients as features and a Gaussian mixture model classifier. The integrated system achieves 88 % correct segmentation over a database of 28 hours of music from different musical genres. For the both methods, errors occur for borderline cases with soft percussive-like drum accompaniment, or transient-like instrumentation without drums.

Cites: 27 (see at Google Scholar)

PDF

Updated: 2024-02-15

Note regarding IEEE copyrighted material on this page
The material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.