Sound event detection
in real life audio


Challenge results

Task description

This task evaluated performance of the sound event detection systems in multisource conditions similar to our everyday life, where the sound sources are rarely heard in isolation. The participants used 1 hour and 32 minutes of audio in 24 recordings to train their systems. The challenge evaluation was done using 29 minutes of audio in 8 recordings.

More detailed task description can be found in the task description page.

Challenge results

Detailed description of metrics used can be found here.

Systems ranking

Rank Submission Information Technical
Report
Segment-based
(overall / evaluation dataset)
Segment-based
(overall / development dataset)
Code Name ER F1 ER F1
Adavanne_TUT_task3_1 Ash_1 Adavanne2017 0.7914 41.7 0.2500 79.3
Adavanne_TUT_task3_2 Ash_2 Adavanne2017 0.8061 42.9 0.2400 79.1
Adavanne_TUT_task3_3 Ash_3 Adavanne2017 0.8544 41.4 0.2000 80.3
Adavanne_TUT_task3_4 Ash_4 Adavanne2017 0.8716 36.2 0.2400 76.9
Chen_UR_task3_1 Chen Chen2017 0.8575 30.9 0.8100 37.0
Dang_NCU_task3_1 andang3 Dang2017 0.9529 42.6 0.5900 55.4
Dang_NCU_task3_2 andang3 Dang2017 0.9468 42.8 0.5900 55.4
Dang_NCU_task3_3 andang3 Dang2017 1.0318 44.2 0.5900 55.4
Dang_NCU_task3_4 andang3 Dang2017 1.1028 43.5 0.6200 53.7
Feroze_IST_task3_1 Khizer Feroze2017 1.0942 42.6 0.7600 47.4
Feroze_IST_task3_2 Khizer Feroze2017 1.0312 39.7 0.7600 47.4
DCASE2017 baseline Baseline Heittola2017 0.9358 42.8 0.6900 56.7
Hou_BUPT_task3_1 MMS_HYB Hou2017 1.0446 29.3 0.6000 58.9
Hou_BUPT_task3_2 BGRU_HYB Hou2017 0.9248 34.1 0.6600 53.9
Kroos_CVSSP_task3_1 J-NEAT-E Kroos2017 0.8979 44.9 0.7300 49.2
Kroos_CVSSP_task3_2 J-NEAT-P Kroos2017 0.8911 41.6 0.7200 50.5
Kroos_CVSSP_task3_3 SLFFN Kroos2017 1.0141 43.8 0.6900 56.5
Lee_SNU_task3_1 MICNN_1 Jeong2017 0.9260 42.0 0.5100 67.0
Lee_SNU_task3_2 MICNN_2 Jeong2017 0.8673 27.9 0.5100 67.0
Lee_SNU_task3_3 MICNN_3 Jeong2017 0.8080 40.8 0.5100 67.0
Lee_SNU_task3_4 MICNN_4 Jeong2017 0.8985 43.6 0.5100 67.0
Li_SCUT_task3_1 LiSCUTt3_1 Li2017 0.9920 40.3 0.7100 55.5
Li_SCUT_task3_2 LiSCUTt3_2 Li2017 0.9523 41.0 0.6900 54.5
Li_SCUT_task3_3 LiSCUTt3_3 Li2017 1.0043 43.4 0.7100 55.8
Li_SCUT_task3_4 LiSCUTt3_4 Li2017 0.9878 33.9 0.7100 52.8
Lu_THU_task3_1 bigru_da Lu2017 0.8251 39.6 0.6100 56.7
Lu_THU_task3_2 bigru_da Lu2017 0.8306 39.2 0.6100 56.7
Lu_THU_task3_3 bigru_da Lu2017 0.8361 38.0 0.6100 56.7
Lu_THU_task3_4 bigru_da Lu2017 0.8373 38.3 0.6100 56.7
Wang_NTHU_task3_1 NTHU_AHG Wang2017 0.9749 40.8 0.7700 43.6
Xia_UWA_task3_1 UWA_T3_1 Xia2017 0.9523 43.5 0.6600 56.9
Xia_UWA_task3_2 UWA_T3_1 Xia2017 0.9437 41.1 0.6500 56.0
Xia_UWA_task3_3 UWA_T3_1 Xia2017 0.8740 41.7 0.6400 56.0
Yu_FZU_task3_1 DRF Yu2017 1.1963 3.9 0.8200 38.2
Zhou_PKU_task3_1 MC-LSTM-1 Zhou2017 0.8526 39.1 0.6600 54.5
Zhou_PKU_task3_2 MC-LSTM-2 Zhou2017 0.8526 37.3 0.6400 54.4

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission Information Technical
Report
Segment-based
(overall / evaluation dataset)
Segment-based
(overall / development dataset)
Code Name ER F1 ER F1
Adavanne_TUT_task3_1 Ash_1 Adavanne2017 0.7914 41.7 0.2500 79.3
Chen_UR_task3_1 Chen Chen2017 0.8575 30.9 0.8100 37.0
Dang_NCU_task3_2 andang3 Dang2017 0.9468 42.8 0.5900 55.4
Feroze_IST_task3_2 Khizer Feroze2017 1.0312 39.7 0.7600 47.4
DCASE2017 baseline Baseline Heittola2017 0.9358 42.8 0.6900 56.7
Hou_BUPT_task3_2 BGRU_HYB Hou2017 0.9248 34.1 0.6600 53.9
Kroos_CVSSP_task3_2 J-NEAT-P Kroos2017 0.8911 41.6 0.7200 50.5
Lee_SNU_task3_3 MICNN_3 Jeong2017 0.8080 40.8 0.5100 67.0
Li_SCUT_task3_2 LiSCUTt3_2 Li2017 0.9523 41.0 0.6900 54.5
Lu_THU_task3_1 bigru_da Lu2017 0.8251 39.6 0.6100 56.7
Wang_NTHU_task3_1 NTHU_AHG Wang2017 0.9749 40.8 0.7700 43.6
Xia_UWA_task3_3 UWA_T3_1 Xia2017 0.8740 41.7 0.6400 56.0
Yu_FZU_task3_1 DRF Yu2017 1.1963 3.9 0.8200 38.2
Zhou_PKU_task3_1 MC-LSTM-1 Zhou2017 0.8526 39.1 0.6600 54.5

Class-wise performance

Rank Submission Information Technical
Report
Brakes squeking Car Children Large vehicle People speaking People walking
Code Name ER F1 ER F1 ER F1 ER F1 ER F1 ER F1
Adavanne_TUT_task3_1 Ash_1 Adavanne2017 1.0000 0.7674 54.6 1.2000 0.0 1.0678 49.3 1.0408 0.0 1.0331 38.7
Adavanne_TUT_task3_2 Ash_2 Adavanne2017 0.9773 4.4 0.7674 54.7 2.8000 0.0 1.4181 45.3 1.2367 1.9 0.8398 52.6
Adavanne_TUT_task3_3 Ash_3 Adavanne2017 1.0000 0.7758 52.0 3.2667 0.0 1.4576 48.0 1.4286 3.3 0.9144 52.8
Adavanne_TUT_task3_4 Ash_4 Adavanne2017 1.0000 0.8496 51.4 1.0000 1.4011 37.7 1.0000 1.5580 28.8
Chen_UR_task3_1 Chen Chen2017 1.0000 0.8538 51.8 1.0000 0.9887 14.6 1.0082 0.0 1.0663 1.0
Dang_NCU_task3_1 andang3 Dang2017 0.8409 27.5 0.8022 59.1 1.2667 6.6 1.8079 33.6 1.0980 21.1 1.8287 35.0
Dang_NCU_task3_2 andang3 Dang2017 0.8182 30.8 0.8036 59.0 1.2000 6.9 1.8079 32.8 1.1102 22.3 1.7928 35.2
Dang_NCU_task3_3 andang3 Dang2017 0.7045 46.6 0.8482 59.4 1.2000 18.2 2.3785 31.5 1.1265 36.4 1.9503 34.8
Dang_NCU_task3_4 andang3 Dang2017 0.9318 12.8 0.7187 65.2 2.5111 1.7 1.6836 42.0 1.9020 21.8 2.1381 34.5
Feroze_IST_task3_1 Khizer Feroze2017 0.7955 37.5 0.7479 61.8 4.0889 0.0 2.0000 43.3 1.9592 14.3 1.5166 39.1
Feroze_IST_task3_2 Khizer Feroze2017 0.8750 25.2 0.7618 58.1 3.6222 0.0 1.8023 43.7 1.8204 10.4 1.4171 35.1
DCASE2017 baseline Baseline Heittola2017 0.9205 16.5 0.7674 61.5 2.6667 0.0 1.4407 42.7 1.2980 8.6 1.4448 33.5
Hou_BUPT_task3_1 MMS_HYB Hou2017 0.9886 2.2 0.7507 50.9 4.8222 0.0 1.7571 36.9 1.5020 0.0 1.1851 13.7
Hou_BUPT_task3_2 BGRU_HYB Hou2017 1.0000 0.9373 52.7 2.8889 0.0 1.2712 32.8 1.1469 0.0 1.1657 16.3
Kroos_CVSSP_task3_1 J-NEAT-E Kroos2017 1.0000 0.8677 62.4 4.2222 0.0 1.0226 0.0 1.4163 14.7 0.8508 51.3
Kroos_CVSSP_task3_2 J-NEAT-P Kroos2017 1.0000 0.8621 47.1 2.7333 1.6 1.4463 43.1 1.4041 33.1 0.9558 50.0
Kroos_CVSSP_task3_3 SLFFN Kroos2017 0.9545 8.7 0.7939 58.9 4.1333 1.1 1.7458 42.0 1.6163 16.1 1.1050 49.5
Lee_SNU_task3_1 MICNN_1 Jeong2017 1.0000 0.9234 61.2 1.0000 2.5311 41.1 1.1837 6.5 1.0138 0.0
Lee_SNU_task3_2 MICNN_2 Jeong2017 1.0000 0.9248 45.5 1.0000 1.3672 25.8 1.0000 1.0000
Lee_SNU_task3_3 MICNN_3 Jeong2017 1.0000 0.9234 61.2 1.0000 1.3672 25.8 1.0000 0.8 1.0000
Lee_SNU_task3_4 MICNN_4 Jeong2017 1.1023 7.6 0.9234 61.2 2.7556 1.6 1.8983 45.3 1.3020 30.2 1.0000
Li_SCUT_task3_1 LiSCUTt3_1 Li2017 0.9432 10.8 0.7591 60.2 4.0222 0.0 1.7345 43.3 1.5224 7.4 1.3343 32.4
Li_SCUT_task3_2 LiSCUTt3_2 Li2017 1.0000 0.7019 62.2 3.9111 0.0 1.4520 45.4 1.4082 8.0 1.4613 31.9
Li_SCUT_task3_3 LiSCUTt3_3 Li2017 1.0568 0.0 0.6783 66.9 3.4889 0.0 1.9322 37.8 1.4531 12.7 1.6685 34.5
Li_SCUT_task3_4 LiSCUTt3_4 Li2017 1.0682 17.5 0.9109 27.5 2.5111 0.0 1.6723 43.9 1.8653 15.5 0.8646 56.3
Lu_THU_task3_1 bigru_da Lu2017 1.0000 0.7855 45.0 1.0444 0.0 1.5424 33.6 1.0980 8.2 1.0000 54.2
Lu_THU_task3_2 bigru_da Lu2017 1.0000 0.8008 44.4 1.0444 0.0 1.5876 33.6 1.0735 7.7 1.0166 53.4
Lu_THU_task3_3 bigru_da Lu2017 1.0000 0.8120 41.9 1.0889 0.0 1.5424 33.9 1.0612 8.5 1.0221 52.7
Lu_THU_task3_4 bigru_da Lu2017 1.0000 0.8273 40.2 1.0444 0.0 1.4802 34.8 1.0776 9.0 1.0083 54.5
Wang_NTHU_task3_1 NTHU_AHG Wang2017 1.0000 0.8315 58.7 2.4222 1.8 2.0678 22.8 1.6367 17.3 1.3094 43.0
Xia_UWA_task3_1 UWA_T3_1 Xia2017 1.0000 0.7604 59.1 1.1556 18.8 2.1299 41.9 1.2408 17.8 1.6022 38.0
Xia_UWA_task3_2 UWA_T3_1 Xia2017 1.0000 0.7214 58.1 3.8000 0.0 1.6497 42.1 1.5673 13.1 1.2265 43.1
Xia_UWA_task3_3 UWA_T3_1 Xia2017 1.0000 0.7242 57.7 1.0444 20.3 1.7797 40.7 1.3755 6.6 1.3011 39.5
Yu_FZU_task3_1 DRF Yu2017 1.2159 0.0 1.2925 6.3 15.6444 4.6 1.2938 1.7 1.3306 1.2 1.0304 1.1
Zhou_PKU_task3_1 MC-LSTM-1 Zhou2017 1.0455 0.0 0.7674 54.9 1.1333 0.0 1.7345 37.2 1.0694 6.4 1.2790 34.0
Zhou_PKU_task3_2 MC-LSTM-2 Zhou2017 1.0227 0.0 0.8245 47.0 1.5333 0.0 1.3220 49.8 1.0163 10.8 1.3315 32.7

System characteristics

Rank Submission Information Technical
Report
Segment-based (overall) System characteristics
Code Name ER F1 Input Sampling
rate
Data
augmentation
Features Classifier Decision
making
Adavanne_TUT_task3_1 Ash_1 Adavanne2017 0.7914 41.7 mono 44.1kHz log-mel energies CRNN threshold
Adavanne_TUT_task3_2 Ash_2 Adavanne2017 0.8061 42.9 binaural 44.1kHz log-mel energies CRNN threshold
Adavanne_TUT_task3_3 Ash_3 Adavanne2017 0.8544 41.4 binaural 44.1kHz multi-scale log-mel energies CRNN threshold
Adavanne_TUT_task3_4 Ash_4 Adavanne2017 0.8716 36.2 binaural 44.1kHz spectrogram CRNN threshold
Chen_UR_task3_1 Chen Chen2017 0.8575 30.9 mono 44.1kHz log-mel energies CNN median filtering
Dang_NCU_task3_1 andang3 Dang2017 0.9529 42.6 mono 44.1kHz log-mel energies CRNN majority vote
Dang_NCU_task3_2 andang3 Dang2017 0.9468 42.8 mono 44.1kHz log-mel energies CRNN majority vote
Dang_NCU_task3_3 andang3 Dang2017 1.0318 44.2 mono 44.1kHz log-mel energies CRNN majority vote
Dang_NCU_task3_4 andang3 Dang2017 1.1028 43.5 mono 44.1kHz log-mel energies CRNN majority vote
Feroze_IST_task3_1 Khizer Feroze2017 1.0942 42.6 mixed 44.1kHz Perceptual Linear Predictive NN morphological operations
Feroze_IST_task3_2 Khizer Feroze2017 1.0312 39.7 mixed 44.1kHz Perceptual Linear Predictive NN morphological operations
DCASE2017 baseline Baseline Heittola2017 0.9358 42.8 mono 44.1kHz log-mel energies MLP median filtering
Hou_BUPT_task3_1 MMS_HYB Hou2017 1.0446 29.3 mono 44.1kHz log-mel energies combination [MLP; BGRU] median filtering
Hou_BUPT_task3_2 BGRU_HYB Hou2017 0.9248 34.1 mono 44.1kHz raw audio data BGRU median filtering
Kroos_CVSSP_task3_1 J-NEAT-E Kroos2017 0.8979 44.9 mono 44.1kHz scattering transform, clustering Neuroevolution threshold
Kroos_CVSSP_task3_2 J-NEAT-P Kroos2017 0.8911 41.6 mono 44.1kHz scattering transform, clustering Neuroevolution threshold
Kroos_CVSSP_task3_3 SLFFN Kroos2017 1.0141 43.8 mono 44.1kHz scattering transform, clustering ANN threshold
Lee_SNU_task3_1 MICNN_1 Jeong2017 0.9260 42.0 binaural 44.1kHz channel swapping log-mel energies CNN adaptive thresholding
Lee_SNU_task3_2 MICNN_2 Jeong2017 0.8673 27.9 binaural 44.1kHz channel swapping log-mel energies CNN adaptive thresholding
Lee_SNU_task3_3 MICNN_3 Jeong2017 0.8080 40.8 binaural 44.1kHz channel swapping log-mel energies CNN adaptive thresholding
Lee_SNU_task3_4 MICNN_4 Jeong2017 0.8985 43.6 binaural 44.1kHz channel swapping log-mel energies CNN adaptive thresholding
Li_SCUT_task3_1 LiSCUTt3_1 Li2017 0.9920 40.3 mono 44.1kHz DNN(MFCC) Bi-LSTM Top output probability
Li_SCUT_task3_2 LiSCUTt3_2 Li2017 0.9523 41.0 mono 44.1kHz DNN(MFCC) Bi-LSTM Top output probability
Li_SCUT_task3_3 LiSCUTt3_3 Li2017 1.0043 43.4 mono 44.1kHz DNN(MFCC) DNN Top output probability
Li_SCUT_task3_4 LiSCUTt3_4 Li2017 0.9878 33.9 mono 44.1kHz DNN(MFCC) Bi-LSTM Top output probability
Lu_THU_task3_1 bigru_da Lu2017 0.8251 39.6 mixed 44.1kHz pitch shifting, time stretching MFCC, pitch RNN, ensemble median filtering
Lu_THU_task3_2 bigru_da Lu2017 0.8306 39.2 mixed 44.1kHz pitch shifting, time stretching MFCC, pitch RNN, ensemble median filtering
Lu_THU_task3_3 bigru_da Lu2017 0.8361 38.0 mixed 44.1kHz pitch shifting, time stretching MFCC, pitch RNN, ensemble median filtering
Lu_THU_task3_4 bigru_da Lu2017 0.8373 38.3 mixed 44.1kHz pitch shifting, time stretching MFCC, pitch RNN, ensemble median filtering
Wang_NTHU_task3_1 NTHU_AHG Wang2017 0.9749 40.8 mono, binaural 44.1kHz MFCC, TDOA RNN post processing technique
Xia_UWA_task3_1 UWA_T3_1 Xia2017 0.9523 43.5 mono 44.1kHz log-mel energies MLP Class wise distance evaluation (CW)
Xia_UWA_task3_2 UWA_T3_1 Xia2017 0.9437 41.1 mono 44.1kHz log-mel energies CNN median filtering
Xia_UWA_task3_3 UWA_T3_1 Xia2017 0.8740 41.7 mono 44.1kHz log-mel energies CNN Class wise distance evaluation (CW)
Yu_FZU_task3_1 DRF Yu2017 1.1963 3.9 mono 16kHz mel energies Deep Random Forest sliding median filtering
Zhou_PKU_task3_1 MC-LSTM-1 Zhou2017 0.8526 39.1 right, diff 44.1kHz log-mel energies LSTM median filtering
Zhou_PKU_task3_2 MC-LSTM-2 Zhou2017 0.8526 37.3 right, mean, diff 44.1kHz log-mel energies LSTM median filtering

Technical reports

A Report on Sound Event Detection with Different Binaural Features

Abstract

In this paper, we compare the performance of using binaural audio features in place of single channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single channel and binaural features using stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied binaural features are seen to consistently perform equal to or better than the single-channel features with respect to error rate metric.

System characteristics
Input mono; binaural
Sampling rate 44.1kHz
Features log-mel energies; multi-scale log-mel energies; spectrogram
Classifier CRNN
Decision making threshold
PDF

DCASE2017 Sound Event Detection Using Convolutional Neural Network

Abstract

The DCASE2017 Challenge Task 3 is to develop a sound event detection system of real life audio. In our setup, we merge the two channels into one, then use Mel-band energy to calculate the converted spectrum, and train the model based on convolu- tional neural network (CNN). The method we use achieves a 0.81 error rate on average for the four cross-validation folders. It proves the practicability of using CNN for sound event detection.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
Decision making median filtering
PDF

Deep Learning for DCASE2017 Challenge

Abstract

This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
Decision making majority vote
PDF

Comparison of Baseline System with Perceptual Linear Predictive Feature Using Neural Network for Sound Event Detection in Real Life Audio

Abstract

For sound event detection of polyphonic sounds, we compare the performance of perceptual linear predictive (PLP) feature with Mel frequency cepstral coefficients (MFCC) using neural network classifier. The results are further compared with the performance of the baseline system given by DCASE 2017 (task 3). Our results show that using PLP based classifier, individual error rate (ER) for each event is improved compared to the baseline system. For car event, ER is improved by 10%, for large vehicle event 23%, for people walking event 26% and some improvements are also ob-served in other events.

System characteristics
Input mixed
Sampling rate 44.1kHz
Features Perceptual Linear Predictive
Classifier NN
Decision making morphological operations
PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier MLP
Decision making median filtering
PDF

Sound Event Detection in Real Life Audio Using Multi-Model System

Abstract

In this paper, we present a polyphonic sound event detection (SED) system based on a multi-model system. In the proposed multi-model system, we use one model based on Deep Neural Networks (DNN) to detect sound events of car, and five models based on Bi-directional Gated Recurrent Units Recurrent Neural Networks (BGRU-RNN) to detect other sound events including: brakes squeaking, children, large vehicle, people speaking and people walking. Since different classes sound events have differ-ent audio characteristics, we use different models to detect each class. The proposed multi-model system is trained and tested based on IEEE DCASE2017 Challenge: Sound Event Detection in Real Life Audio (Task 3) Development Dataset, the result yields up to 58.92% and 0.60 in terms of F-Score and error rate on segment-based metric respectively.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies; raw audio data
Classifier combination [MLP; BGRU]; BGRU
Decision making median filtering
PDF

Audio Event Detection Using Multiple-Input Convolutional Neural Network

Abstract

This paper describes the model and training framework from our submission for DCASE 2017 task 3: sound event detection in real life audio. Our model basically follows convolutional neural network architecture, yet uses two input data of the short- and long-term audio signal. In the training stage, we calculated validation errors more frequently than one epoch with adaptive thresholds. We also used class-wise early stopping to find the best model for each class. The proposed model shows a meaningful improvements in cross validation experiments compared to the baseline system using the simple neural network.

System characteristics
Input binaural
Sampling rate 44.1kHz
Data augmentation channel swapping
Features log-mel energies
Classifier CNN
Decision making adaptive thresholding
PDF

Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study

Abstract

Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing co-evolution, and applied it to sound event detection in real life audio (task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. Results show that J-NEAT is capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. The evolved networks were, however, narrowly outperformed by a comparable, experimenter-designed minimal single-layer feed-forward network. We discuss the question of evolving versus learning for supervised tasks.

System characteristics
Input mono
Sampling rate 44.1kHz
Features scattering transform, clustering
Classifier Neuroevolution; ANN
Decision making threshold
PDF

The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification

Abstract

In this report, we present our works about three tasks of IEEE AASP challenge on DCASE 2017, i.e. task 1: Acoustic Scene Classification (ASC), task 2: detection of rare sound events in artificially created mixtures and task 3: sound event detection in real life recordings. Tasks 2 and 3 belong to the same problem, i.e. Sound Event Detection (SED). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes or sound events. Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bi-directional Long Short Term Memory (Bi-LSTM) fed by the DAF is built for ASC and SED. Evaluated on the development datasets of DCASE 2017, our systems are superior to the corresponding baselines for tasks 1 and 2, and our system for task 3 performs as good as the baseline in terms of the predominant metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features DNN(MFCC)
Classifier Bi-LSTM; DNN
Decision making Top output probability
PDF

Bidirectional GRU for Sound Event Detection

Abstract

Sound event detection (SED) aims to detect temporal boundaries of sound events given audio streams. Sound recordings in real life situations typically have many overlapping events, making this detection task much more difficult than classification and non- overlapping detection. Recently, multi-label recurrent neural net- works (RNNs) have become the main stream solutions for this poly- phonic sound event detection problem. However, similar to many other deep learning approaches, the relative scarcity of carefully labeled data has limited the capacity of RNNs. In this paper, we first present a multi label bi-directional recurrent neural network to model the temporal evolution of sound events. Then we propose the use of data augmentation to overcome the problem of data scarcity and explore the appropriate augmentation strategies that achieve better performance. We evaluate our approach on the development subset of the DCASE2017 task3 dataset. Combined with data augmentation and ensemble technique, we reduce the error rate by over 11% compared to the officially published baseline system.

System characteristics
Input mixed
Sampling rate 44.1kHz
Data augmentation pitch shifting, time stretching
Features MFCC, pitch
Classifier RNN, ensemble
Decision making median filtering
PDF

Sound Event Detection From Real-Life Audio by Training a Long Short-Term Memory Network with Mono and Stereo Features

Abstract

In this paper, we trained and evaluated an acoustic sound event classifier that uses a combination of stereo and mono features. For stereo features, we treated the time difference of arrival (TDOA) as a random variable and calculated its probability density function. For mono features, Mel-frequency cepstral coefficients (MFCCs) and their 1st and 2nd derivatives were extracted. A recurrent neural network (RNN) with long-short term memory (LSTM) was constructed to perform multi-label classification. Training with the 4-fold validation dataset given by DCASE2017 challenge [5], model parameters were chosen based on the best average performance. The proposed TDOA plus MFCC features combined with the RNN-LSTM model achieved a segment-based error rate of 0.77.

System characteristics
Input mono, binaural
Sampling rate 44.1kHz
Features MFCC, TDOA
Classifier RNN
Decision making post processing technique
PDF

Class Wise Distance Based Acoustic Event Detection

Abstract

In this paper, we propose a class wise distance based approach in a neural network based acoustic event detection system. The neural network output probabilities are updated by calculating the distance between the acoustic features of each frame and the class wise distance of each event class. The detected acoustic segments are re-evaluated segmentally using the class wise distances. Cross-validation detection results on the development set of DCASE2017 show the efficiency of the proposed method by achieving a 4% absolute reduction in segment-based error rate compared to the baseline system.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier MLP; CNN
Decision making Class wise distance evaluation (CW); median filtering
PDF

Sound Event Detection Using Deep Random Forest

Abstract

In this paper, we present our work on Task 3 Sound Event Detection in Real Life Audio [1]. The systems aim at dealing with the detection of overlapping audio events, where the detectors are based on deep random forest, a decision tree ensemble approach. For random forest has natural defect of detecting and classifying polyphonic events, the systems use one-vs-the-rest (OvR) mul-ticlass/multilabel strategy, fitting one deep random forest per event class. On the development data set, the system obtained error rate value of 0.82 and F-score of 38.2%.

System characteristics
Input mono
Sampling rate 16kHz
Features mel energies
Classifier Deep Random Forest
Decision making sliding median filtering
PDF

Sound Event Detection in Multichannel Audio LSTM Network

Abstract

In this paper, a polyphonic sound event detection system is proposed. This system uses log mel-band energy features with long short term memory (LSTM) recurrent neural network. Human listeners have been successfully recognizing overlapping sound events by two ears. Motivated by that we propose to extend the system to use multichannel audio data. The original stereo (multichannel) audio signal has two channels, we construct three different channel data and use different fusion strategies to extend our system. Experiments show that our system achieved superior performances compared with the baselines.

System characteristics
Input right, diff; right, mean, diff
Sampling rate 44.1kHz
Features log-mel energies
Classifier LSTM
Decision making median filtering
PDF