The proceedings of the DCASE2017 Workshop have been published as electronic publication of Tampere University of Technology series:
Virtanen, T., Mesaros, A., Heittola, T., Diment, A., Vincent, E., Benetos, E. & Elizalde, B. (Eds.) (2017). Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017).
ISBN (Electronic): 978-952-15-4042-4
Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks
Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Grafe, and Hanna Lukashevich
Fraunhofer IDMT, Ilmenau, Germany
Motivated by the recent success of deep learning techniques in various audio analysis tasks, this work presents a distributed sensor-server system for acoustic scene classification in urban environments based on deep convolutional neural networks (CNN). Stacked autoencoders are used to compress extracted spectrogram patches on the sensor side before being transmitted to and classified on the server side. In our experiments, we compare two state-of-theart CNN architectures subject to their classification accuracy under the presence of environmental noise, the dimensionality reduction in the encoding stage, as well as a reduced number of filters in the convolution layers. Our results show that the best model configuration leads to a classification accuracy of 75% for 5 acoustic scenes. We furthermore discuss which confusions among particular classes can be ascribed to particular sound event types, which are present in multiple acoustic scene classes.
Acoustic Scene Classification, Convolutional Neural Networks, Stacked Denoising Autoencoder, Smart City
Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network
Sharath Adavanne and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log melband energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split.
sound event detection, weak labels, deep neural network, CNN, GRU
Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio
Shahin Amiriparian1,2,3, Michael Freitag1, Nicholas Cummins1,2 and Björn Schuller2,4
1Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany, 2Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany, 3Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany, 4Group of Language, Audio & Music, Imperial Collage London, London, UK
This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. In comparison to the baseline, the accuracy is increased from 74:8 % to 88:0 % on the development set, and from 61:0 % to 67:5 % on the test set.
deep feature learning, sequence to sequence learning, recurrent autoencoders, audio processing acoustic scene classification
Nonnegative Feature Learning Methods for Acoustic Scene Classification
Victor Bisot1, Romain Serizel2,3,4, Slim Essid1 and Gaël Richard1
1Image Data and Signal, Telecom ParisTech, Paris, France, 2Université de Lorraine, Loria, Nancy, France, 3Inria, Nancy, France, 4CNRS, LORIA, Nancy, France
This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge.
Feature learning, Nonnegative Matrix Factorization, Deep Neural Networks
Convolutional Recurrent Neural Networks for Rare Sound Event Detection
Emre Cakir and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling.
Sound Event Detection, Convolutional Neural Network, Recurrent Neural Network, Machine learning
The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network
Gert Dekkers1,2, Steven Lauwereins2, Bart Thoen1, Mulu Weldegebreal Adhana1, Henk Brouckxon3, Bertold Van den Bergh2, Toon van Waterschoot1,2, Bart Vanrumste1,2,4, Marian Verhelst2, Peter Karsmakers1
1 KU Leuven, Department of Electrical Engineering, Engineering Technology Cluster, Geel, Belgium, 2 KU Leuven, Department of Electrical Engineering, Leuven, Belgium, 3 Vrije Universiteit Brussel, Department ETRO-DSSP, Brussels, Belgium, 4 IMEC, Leuven, Belgium
There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.
Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks
Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks
Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled melspectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2%.
acoustic scene classification, gradient boosting machine, convolutional neural networks, ensembling
Acoustic Scene Classification Using Spatial Features
Marc C. Green and Damian Murphy
Audio Lab, Department of Electonic Engineering, University of York, York, UK
Due to various factors, the vast majority of the research in the field of Acoustic Scene Classification has used monaural or binaural datasets. This paper introduces EigenScape - a new dataset of 4th-order Ambisonic acoustic scene recordings - and presents preliminary analysis of this dataset. The data is classified using a standard Mel-Frequency Cepstral Coefficient - Gaussian Mixture Model system, and the performance of this system is compared to that of a new system using spatial features extracted using Directional Audio Coding (DirAC) techniques. The DirAC features are shown to perform well in scene classification, with some subsets of these features outperforming the MFCC classification. The differences in label confusion between the two systems are especially interesting, as these suggest that certain scenes that are spectrally similar might not necessarily be spatially similar.
Acoustic scene classification, MFCC, gaussian mixture model, ambisonics, directional audio coding, multichannel, eigenmike
Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification
Yoonchang Han1 and Jeongsoo Park1,2
1Cochlear.ai, Seoul, Korea, 2Music and Audio Research Group, Seoul National University, Seoul, Korea
In this paper, we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonicpercussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set.
DCASE 2017, acoustic scene classification, convolutional neural network, binaural representations, harmonicpercussive source separation, background subtraction
Audio Event Detection Using Multiple-Input Convolutional Neural Network
Il-Young Jeong1,2, Subin Lee1,2, Yoonchang Han2 and Kyogu Lee1
1Music and Audio Research Group, Seoul National University, Seoul, Korea, 2Cochlear.ai, Seoul, Korea
This paper describes the model and training framework from our submission for DCASE 2017 task 3: sound event detection in real life audio. Extending the basic convolutional neural network architecture, we use both short- and long-term audio signal simultaneously as input data. In the training stage, we calculated validation errors more frequently than one epoch with adaptive thresholds. We also used class-wise early-stopping strategy to find the best model for each class. The proposed model showed meaningful improvements in cross-validation experiments compared to the baseline system.
DCASE 2017, Sound event detection, Convolutional neural networks
DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features
Abelino Jimenez, Benjamin Elizalde and Bhiksha Raj
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA
Acoustic scene recordings are represented by different types of handcrafted or Neural Network features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support Vector Machines (SVM). However, the complexity of training these methods increases with the dimensionality of these input features and the size of the dataset. A solution is to map the input features to a randomized low-dimensional feature space. The resulting random features can approximate non-linear kernels with faster linear kernel computation. In this work, we computed a set of 6,553 input features and used them to compute random features to approximate three types of kernels, Guassian, Laplacian and Cauchy. We compared their performance using an SVM in the context of the DCASE Task 1 - Acoustic Scene Classification. Experiments show that both, input and random features outperformed the DCASE baseline by an absolute 4%. Moreover, the random features reduced the dimensionality of the input by more than three times with minimal loss of performance and by more than six times and still outperformed the baseline. Hence, random features could be employed by state of the art approaches to compute low-storage features and perform faster kernel computations.
Acoustic Scene Classification, Laplacian Kernel, Kernel Machines, Random Features
DNN-Based Audio Scene Classification for DCASE2017: Dual Input Features, Balancing Cost, and Stochastic Data Duplication
Jung Jee-Weon, Heo Hee-Soo, Yang IL-Ho, Yoon Sung-Hyun, Shim Hye-Jin and Yu Ha-Jin
School of Computer Science, University of Seoul, Seoul, Republic of South Korea
In this study, we explored DNN-based audio scene classification systems with dual input features. Dual input features take advantage of simultaneously utilizing two features with different levels of abstraction as inputs: a frame-level mel-filterbank feature and segment-level identity vector. A new fine-tune cost that solves the drawback of dual input features was developed, as well as a data duplication method that enables DNN to clearly discriminate frequently misclassified classes. Combining the proposed methods with the latest DNN techniques such as residual learning achieved a fold-wise accuracy of 95.9% for the validation set and 70.6% for the evaluation set provided by the Detection and Classification of Acoustic Scenes and Events community.
audio scene classification, DNN, dual input feature, balancing cost, data duplication, residual learning
Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study
Christian Kroos and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks.
Sound event detection, neuroevolution, NEAT, deep scattering transform, wavelets, clustering, co-evolution
Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection
Jongpil Lee1, Jiyoung Park1, Sangeun Kum1, Youngho Jeong2, Juhan Nam1
1Graduate School of Culture Technology, KAIST, Korea, 2Realistic AV Research Group, ETRI, Korea
This paper describes our method submitted to large-scale weakly supervised sound event detection for smart cars in the DCASE Challenge 2017. It is based on two deep neural network methods suggested for music auto-tagging. One is training sample-level Deep Convolutional Neural Networks (DCNN) using raw waveforms as a feature extractor. The other is aggregating features on multiscaled models of the DCNNs and making final predictions from them. With this approach, we achieved the best results, 47.3% in F-score on subtask A (audio tagging) and 0.75 in error rate on subtask B (sound event detection) in the evaluation. These results show that the waveform-based models can be comparable to spectrogrambased models when compared to other DCASE Task 4 submissions. Finally, we visualize hierarchically learned filters from the challenge dataset in each layer of the waveform-based model to explain how they discriminate the events.
Sound event detection, audio tagging, weakly supervised learning, multi-scale features, sample-level, convolutional neural networks, raw waveforms
Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input
Donmoon Lee1,2, Subin Lee1,2, Yoonchang Han2 and Kyogu Lee1
1Music and Audio Research Group, Seoul National University, Seoul, Korea, 2Cochlear.ai, Seoul, Korea
In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate.
Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks
Hyungui Lim1, Jeongsoo Park2,3 and Yoonchang Han1
1Cochlear.ai, Seoul, Korea, 2N/A, Cochlear.ai, Seoul, Korea, 3Music and Audio Research Group, Seoul National University, Seoul, Korea
Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long shortterm memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time-frequency frame to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-score on the event-based metric, respectively. The proposed system has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1 on the evaluation dataset.
Rare sound event detection, deep learning, convolutional neural network, recurrent neural network, long short-term memory
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
Annamaria Mesaros1, Toni Heittola1, Aleksandr Diment1, Benjamin Elizalde2, Ankit Shah2, Emmanuel Vincent3, Bhiksha Raj2 and Tuomas Virtanen 1
1Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland, 2Carnegie Mellon University, Department of Electrical and Computer Engineering, & Department of Language Technologies Institute, Pittsburgh, USA, 3Inria, F-54600 Villers-les-Nancy, France
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels
Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane
Seongkyu Mun1, Sangwook Park1, David Han2 and Hanseok Ko1
1Intelligent Signal Processing Laboratory, Korea University, Seoul, South Korea, 2Office of Naval Research, Office of Naval Research, Arlington VA, USA
Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the crossvalidated experiments on development DB, the usage of the generated features could improve ASC performance.
acoustic scene classification, generative adversarial networks, support vector machine, data augmentation, decision hyper-plane
Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features
Sangwook Park1, Seongkyu Mun2, Younglo Lee1 and Hanseok Ko1
1School of Electrical Engineering, Korea University, Seoul, Republic of Korea, 2Department of Visual Information Processing, Korea University, Seoul, Republic of Korea
This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6%, which is an improvement of 8.8%, 9.5%, 8.2% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively.
Acoustic scene classification, covariance learning, double FFT, convolutional neural network
The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification
Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland
This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration.
acoustic scene classification, spectrogram, frequency resolution, convolutional neural network, DCASE 2017
Wavelets Revisited for the Classification of Acoustic Scenes
Qian Kun1,2,3, Ren Zhao2,3, Pandit Vedhas2,3, Yang Zijiang1,2, Zhang Zixing2, and Schuller Björn2,3,4
1MISP group, Technische Universität München, Munich, Germany, 2Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 3Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 4GLAM - Group on Language, Audio and Music, Imperial College London, London, UK
We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2 %, and 82.6 % by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8 %) of the official development set (p < 0:001, one-tailed z-test).
Acoustic Scene Classification, Wavelets, Support Vector Machines, Sequence Modelling, Gated Recurrent Neural Networks
Deep Sequential Image Features on Acoustic Scene Classification
Ren Zhao1,2, Pandit Vedhas1,2, Qian Kun1,2,3, Yang Zijiang1,2, Zhang Zixing2, and Schuller Björn1,2,4
1Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 3MISP group, Technische Universität München, Munich, Germany, 4GLAM - Group on Language, Audio and Music, Imperial College London, London, UK
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, ‘deep features’ are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80:9%, which increases by 6:1% when compared with the official baseline (p < :001 by one-tailed z-test).
Audio Scene Classification, Deep Sequential Learning, Scalogram, Convolutional Neural Networks, Gated Recurrent Neural Networks
Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification
Alexander Schindler1, Thomas Lidy2 and Andreas Rauber2
1Center for Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria, 2Institute for Software and Interactive Systems, Technical University of Vienna, Vienna, Austria
In this paper we present a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolutions are chosen to cover fine-grained characteristics of a scene’s spectral texture as well as its distribution of acoustic events. The proposed model shows a 3.56% absolute improvement of the best performing single resolution model and 12.49% of the DCASE 2017 Acoustic Scenes Classification task baseline .
Deep Learning, Convolutional Neural Networks, Acoustic Scene Classification, Audio Analysis
Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning
Anastasios Vafeiadis1, Dimitrios Kalatzis1, Konstantinos Votis1, Dimitrios Giakoumis1, Dimitrios Tzovaras1, Liming Chen2 and Raouf Hamzaoui2
1Information Technologies Institute, Center for Research & Technology Hellas, Thessaloniki, Greece, 2Faculty of Technology, De Montfort University, Leicester, UK
This report describes our contribution to the 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. We investigated two approaches for the acoustic scene classification task. Firstly, we used a combination of features in the time and frequency domain and a hybrid Support Vector Machines - Hidden Markov Model (SVM-HMM) classifier to achieve an average accuracy over 4-folds of 80.9% on the development dataset and 61.0% on the evaluation dataset. Secondly, by exploiting dataaugmentation techniques and using the whole segment (as opposed to splitting into sub-sequences) as an input, the accuracy of our CNN system was boosted to 95.9%. However, due to the small number of kernels used for the CNN and a failure of capturing the global information of the audio signals, it achieved an accuracy of 49.5% on the evaluation dataset. Our two approaches outperformed the DCASE baseline method, which uses log-mel band energies for feature extraction and a Multi-Layer Perceptron (MLP) to achieve an average accuracy over 4-folds of 74.8%.
Acoustic scene classification, feature extraction, deep learning, spectral features, data augmentation
Audio Events Detection and classification using extended R-FCN Approach
Wang Kaiwu, Yang Liping and Yang Bin
Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, ChongQing University, ChongQing, China
In this study, we present a new audio event detection and classification approach based on R-FCN—a state-of-the-art fully convolutional network framework for visual object detection. Spectrogram features of audio signals are used as the input of the approach. The proposed approach consists of two stages like R-FCN network. In the first stage, we detect whether there are audio events by sliding convolutional kernel in time axis, and then proposals which possibly contain audio events are generated by RPN (Region Proposal Networks). In the second stage, time and frequency domain information are integrated to classify these proposals and refine their boundaries. Our approach can output the positions of audio events directly which can input a two-dimensional representation of arbitrary length sound without any size regularization.
audio events detection, Convolutional Neural Network, spectrogram feature
Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion
Zheng Weiping1, Yi Jiantao1, Xing Xiaotao1, Liu Xiangtao2 and Peng Shaohu3
1School of Computer, South China Normal University, Guangzhou, China, 2Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen, China, 3School of Mechanical and Electrical Engineering,, Guangzhou University, Guangzhou, China
Making sense of the environment by sounds is an important research in machine learning community. In this work, a Deep Convolutional Neural Network (DCNN) model is presented to classify acoustic scenes along with a multiple spectrograms fusion method. Firstly, the generations of standard spectrogram and CQT spectrogram are introduced separately. Corresponding features can then be extracted by feeding these spectrogram data into the proposed DCNN model. To fuse these multiple spectrogram features, two fusing mechanisms, namely the voting and the SVM methods, are designed. By fusing DCNN features of the standard and CQT spectrograms, the accuracy is significantly improved in our experiments, comparing with the single spectrogram schemes. This proves the effectiveness of the proposed multi-spectrograms fusion method.
Deep convolutional neural network, spectrogram, feature fusion, acoustic scene classification
Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF
Qing Zhou and Zuren Feng
School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi'an, China
This paper addresses the problem of sound event detection under non-stationary noises and various real-world acoustic scenes. An effective noise reduction strategy is proposed in this paper which can automatically adapt to background variations. The proposed method is based on supervised non-negative matrix factorization (NMF) for separating target events from noise. The event dictionary is trained offline using the training data of the target event class while the noise dictionary is learned online from the input signal by sparse and low-rank decomposition. Incorporating the estimated noise bases, this method can produce accurate source separation results by reducing noise residue and signal distortion of the reconstructed event spectrogram. Experimental results on DCASE 2017 task 2 dataset show that the proposed method outperforms the baseline system based on multi-layer perceptron classifiers and also another NMF-based method which employs a semi-supervised strategy for noise reduction.
Sound event detection, non-negative matrix factorization, sparse and low-rank decomposition, source separation