Technical program

Workshop on Detection and Classification of Acoustic Scenes and Events
16 - 17 November 2017, Munich, Germany

Day 1

Thursday 16.11.2017, 9:00 - 18:00

Hours
8:45 Registration

Registration

8:45 Coffee

Coffee

Welcome coffee

9:10 Welcome

Welcome

Annamaria Mesaros
Tampere University of Technology, Finland

9:20 Keynote

Keynote

Session chair Sacha Krstulović

General-Purpose Sound Event Recognition

Shawn Hershey
Google Research

Abstract

Inspired by the success of general-purpose object recognition in images, we have been working on automatic, real-time systems for recognizing sound events regardless of domain. Our goal is a system that can tag or describe an arbitrary soundtrack - as might be found on a media sharing site like YouTube - using terms that make sense to a human. I will cover the process of defining this task, our deep learning approach, our efforts to collect training data, and our current results. I'll discuss some factors important for accurate models, and some ideas about how to get the best return from manual labeling investment.

Biography

Shawn Hershey is a software engineer at Google Research, working in the Machine Hearing Group on machine learning for speech and audio processing. He is currently working on soundtrack classification and audio event detection. Before Google he worked as the first Software Engineer at Lyric Semiconductors, building tools to aid the development of hardware accelerators for AI. On the side, Shawn travels the world teaching Lindy Hop and blues dancing and playing in swing and blues bands. Long ago Shawn graduated from the University of Rochester with a BA in Computer Science and half of a degree from the Eastman School of Music.

Shawn Hershey

Google Research

10:10 Break

Coffee break

10:30 Presentations

Oral Session I

Session chair Axel Plinge

10:30

DCASE2017 Challenge Summary

Tuomas Virtanen
Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland

DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System

Annamaria Mesaros1, Toni Heittola1, Aleksandr Diment1, Benjamin Elizalde2, Ankit Shah2, Emmanuel Vincent3, Bhiksha Raj2 and Tuomas Virtanen 1
1Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland, 2Carnegie Mellon University, Department of Electrical and Computer Engineering, & Department of Language Technologies Institute, Pittsburgh, USA, 3Inria, F-54600 Villers-les-Nancy, France

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels

PDF
11:00

Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane

Seongkyu Mun1, Sangwook Park1, David Han2 and Hanseok Ko1
1Intelligent Signal Processing Laboratory, Korea University, Seoul, South Korea, 2Office of Naval Research, Office of Naval Research, Arlington VA, USA

Abstract

Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the crossvalidated experiments on development DB, the usage of the generated features could improve ASC performance.

Keywords

acoustic scene classification, generative adversarial networks, support vector machine, data augmentation, decision hyper-plane

PDF
11:20

Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input

Donmoon Lee1,2, Subin Lee1,2, Yoonchang Han2 and Kyogu Lee1
1Music and Audio Research Group, Seoul National University, Seoul, Korea, 2Cochlear.ai, Seoul, Korea

Abstract

In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate.

PDF
11:40

Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network

Sharath Adavanne and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

Abstract

This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log melband energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split.

Keywords

sound event detection, weak labels, deep neural network, CNN, GRU

PDF
12:00

Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study

Christian Kroos and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK

Abstract

Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks.

Keywords

Sound event detection, neuroevolution, NEAT, deep scattering transform, wavelets, clustering, co-evolution

PDF
12:30 Break

Lunch

14:00 Presentations

Oral Session II

Session chair Romain Serizel

14:00

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain

Abstract

This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled melspectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2%.

Keywords

acoustic scene classification, gradient boosting machine, convolutional neural networks, ensembling

PDF
14:20

DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features

Abelino Jimenez, Benjamin Elizalde and Bhiksha Raj
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA

Abstract

Acoustic scene recordings are represented by different types of handcrafted or Neural Network features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support Vector Machines (SVM). However, the complexity of training these methods increases with the dimensionality of these input features and the size of the dataset. A solution is to map the input features to a randomized low-dimensional feature space. The resulting random features can approximate non-linear kernels with faster linear kernel computation. In this work, we computed a set of 6,553 input features and used them to compute random features to approximate three types of kernels, Guassian, Laplacian and Cauchy. We compared their performance using an SVM in the context of the DCASE Task 1 - Acoustic Scene Classification. Experiments show that both, input and random features outperformed the DCASE baseline by an absolute 4%. Moreover, the random features reduced the dimensionality of the input by more than three times with minimal loss of performance and by more than six times and still outperformed the baseline. Hence, random features could be employed by state of the art approaches to compute low-storage features and perform faster kernel computations.

Keywords

Acoustic Scene Classification, Laplacian Kernel, Kernel Machines, Random Features

PDF
14:40

Nonnegative Feature Learning Methods for Acoustic Scene Classification

Victor Bisot1, Romain Serizel2,3,4, Slim Essid1 and Gaël Richard1
1Image Data and Signal, Telecom ParisTech, Paris, France, 2Université de Lorraine, Loria, Nancy, France, 3Inria, Nancy, France, 4CNRS, LORIA, Nancy, France

Abstract

This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge.

Keywords

Feature learning, Nonnegative Matrix Factorization, Deep Neural Networks

PDF
15:00

Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio

Shahin Amiriparian1,2,3, Michael Freitag1, Nicholas Cummins1,2 and Björn Schuller2,4
1Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany, 2Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany, 3Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany, 4Group of Language, Audio & Music, Imperial Collage London, London, UK

Abstract

This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. In comparison to the baseline, the accuracy is increased from 74:8 % to 88:0 % on the development set, and from 61:0 % to 67:5 % on the test set.

Keywords

deep feature learning, sequence to sequence learning, recurrent autoencoders, audio processing acoustic scene classification

PDF
15:20 Coffee

Coffee

Coffee served during the poster session.

15:20 Posters

Poster Session I

Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features

Sangwook Park1, Seongkyu Mun2, Younglo Lee1 and Hanseok Ko1
1School of Electrical Engineering, Korea University, Seoul, Republic of Korea, 2Department of Visual Information Processing, Korea University, Seoul, Republic of Korea

Abstract

This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6%, which is an improvement of 8.8%, 9.5%, 8.2% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively.

Keywords

Acoustic scene classification, covariance learning, double FFT, convolutional neural network

PDF

The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification

Karol Piczak
Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland

Abstract

This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration.

Keywords

acoustic scene classification, spectrogram, frequency resolution, convolutional neural network, DCASE 2017

PDF

Wavelets Revisited for the Classification of Acoustic Scenes

Qian Kun1,2,3, Ren Zhao2,3, Pandit Vedhas2,3, Yang Zijiang1,2, Zhang Zixing2, and Schuller Björn2,3,4
1MISP group, Technische Universität München, Munich, Germany, 2Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 3Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 4GLAM - Group on Language, Audio and Music, Imperial College London, London, UK

Abstract

We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2 %, and 82.6 % by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8 %) of the official development set (p < 0:001, one-tailed z-test).

Keywords

Acoustic Scene Classification, Wavelets, Support Vector Machines, Sequence Modelling, Gated Recurrent Neural Networks

PDF

Deep Sequential Image Features on Acoustic Scene Classification

Ren Zhao1,2, Pandit Vedhas1,2, Qian Kun1,2,3, Yang Zijiang1,2, Zhang Zixing2, and Schuller Björn1,2,4
1Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 3MISP group, Technische Universität München, Munich, Germany, 4GLAM - Group on Language, Audio and Music, Imperial College London, London, UK

Abstract

For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, ‘deep features’ are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80:9%, which increases by 6:1% when compared with the official baseline (p < :001 by one-tailed z-test).

Keywords

Audio Scene Classification, Deep Sequential Learning, Scalogram, Convolutional Neural Networks, Gated Recurrent Neural Networks

PDF

Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification

Alexander Schindler1, Thomas Lidy2 and Andreas Rauber2
1Center for Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria, 2Institute for Software and Interactive Systems, Technical University of Vienna, Vienna, Austria

Abstract

In this paper we present a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolutions are chosen to cover fine-grained characteristics of a scene’s spectral texture as well as its distribution of acoustic events. The proposed model shows a 3.56% absolute improvement of the best performing single resolution model and 12.49% of the DCASE 2017 Acoustic Scenes Classification task baseline [1].

Keywords

Deep Learning, Convolutional Neural Networks, Acoustic Scene Classification, Audio Analysis

PDF

Convolutional Recurrent Neural Networks for Rare Sound Event Detection

Emre Cakir and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

Abstract

Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling.

Keywords

Sound Event Detection, Convolutional Neural Network, Recurrent Neural Network, Machine learning

PDF

Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks

Hyungui Lim1, Jeongsoo Park2,3 and Yoonchang Han1
1Cochlear.ai, Seoul, Korea, 2N/A, Cochlear.ai, Seoul, Korea, 3Music and Audio Research Group, Seoul National University, Seoul, Korea

Abstract

Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long shortterm memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time-frequency frame to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-score on the event-based metric, respectively. The proposed system has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1 on the evaluation dataset.

Keywords

Rare sound event detection, deep learning, convolutional neural network, recurrent neural network, long short-term memory

PDF

DCASE2017 Challenge posters

DCASE2017 Challenge Results

Annamaria Mesaros1, Toni Heittola1, Aleksandr Diment1, Benjamin Elizalde2, Ankit Shah2, Emmanuel Vincent3, Bhiksha Raj2 and Tuomas Virtanen 1
1Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland, 2Carnegie Mellon University, Department of Electrical and Computer Engineering, & Department of Language Technologies Institute, Pittsburgh, USA, 3Inria, F-54600 Villers-les-Nancy, France

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels

Classifying Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Optimisations for the 2017 DCASE ASC Task

Bernhard Lehner, Hamid Eghbal-Zadeh, Matthias Dorfer, Filip Korzeniowski, Khaled Koutini and Gerhard Widmer
Department of Computational Perception, Johannes Kepler University, Linz, Austria

Abstract

This report describes the CP-JKU team's submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2017 challenge, and discusses some observations we made about the data and the classification setup. Our approach is based on the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on raw spectrograms. The data provided for the 2017 ASC task presented some new challenges -- in particular, audio stimuli of very short duration. These will be discussed in detail, and our measures for addressing them will be described. The result of our experiments is a classification system that achieves classification accuracies of around 90% on the provided development data, as estimated via the prescribed four-fold cross-validation scheme (which, we suspect, may be rather optimistic in relation to new data).

A Report on Sound Event Detection with Different Binaural Features

Sharath Adavanne and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

Abstract

In this paper, we compare the performance of using binaural audio features in place of single channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single channel and binaural features using stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied binaural features are seen to consistently perform equal to or better than the single-channel features with respect to error rate metric.

Surrey-CVSSP System for DCASE2017 Challenge Task4

Yong Xu, Qiuqiang Kong, Wenwu Wang and Mark D. Plumbley
Center for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK

Abstract

In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.72 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.

17:00 Discussion

Open Discussion

Moderated by Mark Plumbley
University of Surrey, United Kingdom

Day 2

Friday 17.11.2017, 9:00 - 12:10

Hours
9:00 Keynote

Keynote

Session chair Dan Stowell

Sound Texture Perception via Summary Statistics

Josh McDermott
Massachusetts Institute of Technology, USA

Abstract

Sound textures are produced by superpositions of large numbers of similar acoustic features (as in rain, swarms of insects, or galloping horses). Textures are noteworthy for being stationary, raising the possibility that time-averaged statistics might capture their structure. I will describe several lines of work testing this idea. I will show how the synthesis of textures from statistics of biological auditory models provides evidence for statistical texture representations. I will then describe experiments that characterize the process by which texture statistics are measured by the auditory system, and that explore their role in auditory scene analysis.

Biography

Josh McDermott is a perceptual scientist studying sound and hearing in the Department of Brain and Cognitive Sciences at MIT, where he is the Fred & Carole Middleton Career Development Assistant Professor and heads the Laboratory for Computational Audition. His research addresses human and machine audition using tools from experimental psychology, engineering, and neuroscience. McDermott obtained a BA in Brain and Cognitive Science from Harvard, an MPhil in Computational Neuroscience from University College London, a PhD in Brain and Cognitive Science from MIT, and postdoctoral training in psychoacoustics at the University of Minnesota and in computational neuroscience at NYU. He is the recipient of a Marshall Scholarship, a James S. McDonnell Foundation Scholar Award, and an NSF CAREER Award.

Josh McDermott

Fred & Carole Middleton Career Development Assistant Professor, Department of Brain and Cognitive Science, Massachusetts Institute of Technology, USA

9:50 Presentations

Oral Session III

Session chair Dan Stowell

9:50

The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network

Gert Dekkers1,2, Steven Lauwereins2, Bart Thoen1, Mulu Weldegebreal Adhana1, Henk Brouckxon3, Bertold Van den Bergh2, Toon van Waterschoot1,2, Bart Vanrumste1,2,4, Marian Verhelst2, Peter Karsmakers1
1 KU Leuven, Department of Electrical Engineering, Engineering Technology Cluster, Geel, Belgium, 2 KU Leuven, Department of Electrical Engineering, Leuven, Belgium, 3 Vrije Universiteit Brussel, Department ETRO-DSSP, Brussels, Belgium, 4 IMEC, Leuven, Belgium

Abstract

There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.

Keywords

Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks

PDF
10:10

Acoustic Scene Classification Using Spatial Features

Marc C. Green and Damian Murphy
Audio Lab, Department of Electonic Engineering, University of York, York, UK

Abstract

Due to various factors, the vast majority of the research in the field of Acoustic Scene Classification has used monaural or binaural datasets. This paper introduces EigenScape - a new dataset of 4th-order Ambisonic acoustic scene recordings - and presents preliminary analysis of this dataset. The data is classified using a standard Mel-Frequency Cepstral Coefficient - Gaussian Mixture Model system, and the performance of this system is compared to that of a new system using spatial features extracted using Directional Audio Coding (DirAC) techniques. The DirAC features are shown to perform well in scene classification, with some subsets of these features outperforming the MFCC classification. The differences in label confusion between the two systems are especially interesting, as these suggest that certain scenes that are spectrally similar might not necessarily be spatially similar.

Keywords

Acoustic scene classification, MFCC, gaussian mixture model, ambisonics, directional audio coding, multichannel, eigenmike

PDF
10:30

Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks

Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Grafe, and Hanna Lukashevich
Fraunhofer IDMT, Ilmenau, Germany

Abstract

Motivated by the recent success of deep learning techniques in various audio analysis tasks, this work presents a distributed sensor-server system for acoustic scene classification in urban environments based on deep convolutional neural networks (CNN). Stacked autoencoders are used to compress extracted spectrogram patches on the sensor side before being transmitted to and classified on the server side. In our experiments, we compare two state-of-theart CNN architectures subject to their classification accuracy under the presence of environmental noise, the dimensionality reduction in the encoding stage, as well as a reduced number of filters in the convolution layers. Our results show that the best model configuration leads to a classification accuracy of 75% for 5 acoustic scenes. We furthermore discuss which confusions among particular classes can be ascribed to particular sound event types, which are present in multiple acoustic scene classes.

Keywords

Acoustic Scene Classification, Convolutional Neural Networks, Stacked Denoising Autoencoder, Smart City

PDF
10:50 Coffee

Coffee

Coffee served during the poster session.

10:50 Posters

Poster Session II

Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification

Yoonchang Han1 and Jeongsoo Park1,2
1Cochlear.ai, Seoul, Korea, 2Music and Audio Research Group, Seoul National University, Seoul, Korea

Abstract

In this paper, we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonicpercussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set.

Keywords

DCASE 2017, acoustic scene classification, convolutional neural network, binaural representations, harmonicpercussive source separation, background subtraction

PDF

Audio Event Detection Using Multiple-Input Convolutional Neural Network

Il-Young Jeong1,2, Subin Lee1,2, Yoonchang Han2 and Kyogu Lee1
1Music and Audio Research Group, Seoul National University, Seoul, Korea, 2Cochlear.ai, Seoul, Korea

Abstract

This paper describes the model and training framework from our submission for DCASE 2017 task 3: sound event detection in real life audio. Extending the basic convolutional neural network architecture, we use both short- and long-term audio signal simultaneously as input data. In the training stage, we calculated validation errors more frequently than one epoch with adaptive thresholds. We also used class-wise early-stopping strategy to find the best model for each class. The proposed model showed meaningful improvements in cross-validation experiments compared to the baseline system.

Keywords

DCASE 2017, Sound event detection, Convolutional neural networks

PDF

DNN-Based Audio Scene Classification for DCASE2017: Dual Input Features, Balancing Cost, and Stochastic Data Duplication

Jung Jee-Weon, Heo Hee-Soo, Yang IL-Ho, Yoon Sung-Hyun, Shim Hye-Jin and Yu Ha-Jin
School of Computer Science, University of Seoul, Seoul, Republic of South Korea

Abstract

In this study, we explored DNN-based audio scene classification systems with dual input features. Dual input features take advantage of simultaneously utilizing two features with different levels of abstraction as inputs: a frame-level mel-filterbank feature and segment-level identity vector. A new fine-tune cost that solves the drawback of dual input features was developed, as well as a data duplication method that enables DNN to clearly discriminate frequently misclassified classes. Combining the proposed methods with the latest DNN techniques such as residual learning achieved a fold-wise accuracy of 95.9% for the validation set and 70.6% for the evaluation set provided by the Detection and Classification of Acoustic Scenes and Events community.

Keywords

audio scene classification, DNN, dual input feature, balancing cost, data duplication, residual learning

PDF

Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection

Jongpil Lee1, Jiyoung Park1, Sangeun Kum1, Youngho Jeong2, Juhan Nam1
1Graduate School of Culture Technology, KAIST, Korea, 2Realistic AV Research Group, ETRI, Korea

Abstract

This paper describes our method submitted to large-scale weakly supervised sound event detection for smart cars in the DCASE Challenge 2017. It is based on two deep neural network methods suggested for music auto-tagging. One is training sample-level Deep Convolutional Neural Networks (DCNN) using raw waveforms as a feature extractor. The other is aggregating features on multiscaled models of the DCNNs and making final predictions from them. With this approach, we achieved the best results, 47.3% in F-score on subtask A (audio tagging) and 0.75 in error rate on subtask B (sound event detection) in the evaluation. These results show that the waveform-based models can be comparable to spectrogrambased models when compared to other DCASE Task 4 submissions. Finally, we visualize hierarchically learned filters from the challenge dataset in each layer of the waveform-based model to explain how they discriminate the events.

Keywords

Sound event detection, audio tagging, weakly supervised learning, multi-scale features, sample-level, convolutional neural networks, raw waveforms

PDF

Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning

Anastasios Vafeiadis1, Dimitrios Kalatzis1, Konstantinos Votis1, Dimitrios Giakoumis1, Dimitrios Tzovaras1, Liming Chen2 and Raouf Hamzaoui2
1Information Technologies Institute, Center for Research & Technology Hellas, Thessaloniki, Greece, 2Faculty of Technology, De Montfort University, Leicester, UK

Abstract

This report describes our contribution to the 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. We investigated two approaches for the acoustic scene classification task. Firstly, we used a combination of features in the time and frequency domain and a hybrid Support Vector Machines - Hidden Markov Model (SVM-HMM) classifier to achieve an average accuracy over 4-folds of 80.9% on the development dataset and 61.0% on the evaluation dataset. Secondly, by exploiting dataaugmentation techniques and using the whole segment (as opposed to splitting into sub-sequences) as an input, the accuracy of our CNN system was boosted to 95.9%. However, due to the small number of kernels used for the CNN and a failure of capturing the global information of the audio signals, it achieved an accuracy of 49.5% on the evaluation dataset. Our two approaches outperformed the DCASE baseline method, which uses log-mel band energies for feature extraction and a Multi-Layer Perceptron (MLP) to achieve an average accuracy over 4-folds of 74.8%.

Keywords

Acoustic scene classification, feature extraction, deep learning, spectral features, data augmentation

PDF

Audio Events Detection and classification using extended R-FCN Approach

Wang Kaiwu, Yang Liping and Yang Bin
Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, ChongQing University, ChongQing, China

Abstract

In this study, we present a new audio event detection and classification approach based on R-FCN—a state-of-the-art fully convolutional network framework for visual object detection. Spectrogram features of audio signals are used as the input of the approach. The proposed approach consists of two stages like R-FCN network. In the first stage, we detect whether there are audio events by sliding convolutional kernel in time axis, and then proposals which possibly contain audio events are generated by RPN (Region Proposal Networks). In the second stage, time and frequency domain information are integrated to classify these proposals and refine their boundaries. Our approach can output the positions of audio events directly which can input a two-dimensional representation of arbitrary length sound without any size regularization.

Keywords

audio events detection, Convolutional Neural Network, spectrogram feature

PDF

Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion

Zheng Weiping1, Yi Jiantao1, Xing Xiaotao1, Liu Xiangtao2 and Peng Shaohu3
1School of Computer, South China Normal University, Guangzhou, China, 2Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen, China, 3School of Mechanical and Electrical Engineering,, Guangzhou University, Guangzhou, China

Abstract

Making sense of the environment by sounds is an important research in machine learning community. In this work, a Deep Convolutional Neural Network (DCNN) model is presented to classify acoustic scenes along with a multiple spectrograms fusion method. Firstly, the generations of standard spectrogram and CQT spectrogram are introduced separately. Corresponding features can then be extracted by feeding these spectrogram data into the proposed DCNN model. To fuse these multiple spectrogram features, two fusing mechanisms, namely the voting and the SVM methods, are designed. By fusing DCNN features of the standard and CQT spectrograms, the accuracy is significantly improved in our experiments, comparing with the single spectrogram schemes. This proves the effectiveness of the proposed multi-spectrograms fusion method.

Keywords

Deep convolutional neural network, spectrogram, feature fusion, acoustic scene classification

PDF

Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF

Qing Zhou and Zuren Feng
School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi'an, China

Abstract

This paper addresses the problem of sound event detection under non-stationary noises and various real-world acoustic scenes. An effective noise reduction strategy is proposed in this paper which can automatically adapt to background variations. The proposed method is based on supervised non-negative matrix factorization (NMF) for separating target events from noise. The event dictionary is trained offline using the training data of the target event class while the noise dictionary is learned online from the input signal by sparse and low-rank decomposition. Incorporating the estimated noise bases, this method can produce accurate source separation results by reducing noise residue and signal distortion of the reconstructed event spectrogram. Experimental results on DCASE 2017 task 2 dataset show that the proposed method outperforms the baseline system based on multi-layer perceptron classifiers and also another NMF-based method which employs a semi-supervised strategy for noise reduction.

Keywords

Sound event detection, non-negative matrix factorization, sparse and low-rank decomposition, source separation

PDF

A Hierarchic Multi-Scaled Approach for Rare Sound Event Detection

Fabio Vesperini, Diego Droghini, Daniele Ferretti, Emanuele Principi, Leonardo Gabrielli, Stefano Squartini and Francesco Piazza
Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy

Abstract

We propose a system for rare sound event detection using hierarchical and multi-scaled approach based on Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). It is our contribution to the rare sound event detection task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). The task consists on detection of event onset from artificially generated mixtures. Acoustic features are extracted from the acoustic signals, successively first event detection stage is performed by an MLP based neural network which proposes contiguous blocks of frames to the second stage. The CNN refines the event detection of the prior network, intrinsically operating on a multi-scaled resolution and discarding blocks that contain background wrongly classified by the MLP as event. Finally the effective onset time of the active event is obtained. The achieved overall error rate and F-measure on the development testset are respectively equal to 0.18 and 90.9%.

12:00 Closing remarks

Closing remarks