Acoustic scene classification


Task results

Task description

The goal of acoustic scene classification task was to classify a test recordings into one of predefined classes (15) that characterizes the environment in which it was recorded — for example "park", "home", "office".

More detailed task description can be found in challenge task page.

Challenge results

Here you can find complete information on the submissions for Task 1: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.

System outputs are available in this package:

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Accuracy
(Development dataset)
Aggarwal_task1_1 - Vij2016 74.4 74.1
Bae_task1_1 CLC Bae2016 84.1 79.2
Bao_task1_1 - Bao2016 83.1
Battaglino_task1_1 - Battaglino2016 80.0
Bisot_task1_1 - Bisot2016 87.7 86.2
DCASE2016 baseline DCASE2016_baseline Heittola2016 77.2 72.5
Duong_task1_1 Tec_SVM_A Sena_Mafra2016 76.4 80.0
Duong_task1_2 Tec_SVM_V Sena_Mafra2016 80.5 78.0
Duong_task1_3 Tec_MLP Sena_Mafra2016 73.1 75.0
Duong_task1_4 Tec_CNN Sena_Mafra2016 62.8 59.0
Eghbal-Zadeh_task1_1 CPJKU16_BMBI Eghbal-Zadeh2016 86.4 80.8
Eghbal-Zadeh_task1_2 CPJKU16_CBMBI Eghbal-Zadeh2016 88.7 83.9
Eghbal-Zadeh_task1_3 CPJKU16_DCNN Eghbal-Zadeh2016 83.3 79.5
Eghbal-Zadeh_task1_4 CPJKU16_LFCBI Eghbal-Zadeh2016 89.7 89.9
Foleiss_task1_1 JFTT Foleiss2016 76.2 71.8
Hertel_task1_1 All-ConvNet Hertel2016 79.5 84.5
Kim_task1_1 QRK Yun2016 82.1 84.0
Ko_task1_1 KU_ISPL1_2016 Park2016 87.2 76.3
Ko_task1_2 KU_ISPL2_2016 Mun2016 82.3 72.7
Kong_task1_1 QK Kong2016 81.0 76.4
Kumar_task1_1 Gauss Elizalde2016 85.9 78.9
Lee_task1_1 MARGNet_MWFD Han2016 84.6 83.1
Lee_task1_2 MARGNet_ZENS Kim2016 85.4 81.6
Liu_task1_1 liu-re Liu2016 83.8
Liu_task1_2 liu-pre Liu2016 83.6
Lostanlen_task1_1 LostanlenAnden_2016 Lostanlen2016 80.8 79.4
Marchi_task1_1 Marchi_2016 Marchi2016 86.4 81.4
Marques_task1_1 DRKNN_2016 Marques2016 83.1 78.2
Moritz_task1_1 - Moritz2016 79.0 76.5
Mulimani_task1_1 - Mulimani2016 65.6 66.8
Nogueira_task1_1 - Nogueira2016 81.0
Patiyal_task1_1 IITMandi_2016 Patiyal2016 78.5 97.6
Phan_task1_1 CNN-LTE Phan2016 83.3 81.2
Pugachev_task1_1 - Pugachev2016 73.1 82.9
Qu_task1_1 - Dai2016 80.5
Qu_task1_2 - Dai2016 84.1
Qu_task1_3 - Dai2016 82.3
Qu_task1_4 - Dai2016 80.5
Rakotomamonjy_task1_1 RAK_2016_1 Rakotomamonjy2016 82.1 81.2
Rakotomamonjy_task1_2 RAK_2016_2 Rakotomamonjy2016 79.2
Santoso_task1_1 SWW Santoso2016 80.8 78.8
Schindler_task1_1 CQTCNN_1 Lidy2016 81.8 80.8
Schindler_task1_2 CQTCNN_2 Lidy2016 83.3
Takahashi_task1_1 UTNII_2016 Takahashi2016 85.6 77.5
Valenti_task1_1 - Valenti2016 86.2 79.0
Vikaskumar_task1_1 ABSP_IITKGP_2016 Vikaskumar2016 81.3 80.4
Vu_task1_1 - Vu2016 80.0 82.1
Xu_task1_1 HL-DNN-ASC_2016 Xu2016 73.3 81.4
Zoehrer_task1_1 - Zoehrer2016 73.1

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Accuracy
(Development dataset)
Aggarwal_task1_1 - Vij2016 74.4 74.1
Bae_task1_1 CLC Bae2016 84.1 79.2
Bao_task1_1 - Bao2016 83.1
Battaglino_task1_1 - Battaglino2016 80.0
Bisot_task1_1 - Bisot2016 87.7 86.2
DCASE2016 baseline DCASE2016_baseline Heittola2016 77.2 72.5
Duong_task1_2 Tec_SVM_V Sena_Mafra2016 80.5 78.0
Eghbal-Zadeh_task1_4 CPJKU16_LFCBI Eghbal-Zadeh2016 89.7 89.9
Foleiss_task1_1 JFTT Foleiss2016 76.2 71.8
Hertel_task1_1 All-ConvNet Hertel2016 79.5 84.5
Kim_task1_1 QRK Yun2016 82.1 84.0
Ko_task1_1 KU_ISPL1_2016 Park2016 87.2 76.3
Ko_task1_2 KU_ISPL2_2016 Mun2016 82.3 72.7
Kong_task1_1 QK Kong2016 81.0 76.4
Kumar_task1_1 Gauss Elizalde2016 85.9 78.9
Lee_task1_1 MARGNet_MWFD Han2016 84.6 83.1
Lee_task1_2 MARGNet_ZENS Kim2016 85.4 81.6
Liu_task1_1 liu-re Liu2016 83.8
Lostanlen_task1_1 LostanlenAnden_2016 Lostanlen2016 80.8 79.4
Marchi_task1_1 Marchi_2016 Marchi2016 86.4 81.4
Marques_task1_1 DRKNN_2016 Marques2016 83.1 78.2
Moritz_task1_1 - Moritz2016 79.0 76.5
Mulimani_task1_1 - Mulimani2016 65.6 66.8
Nogueira_task1_1 - Nogueira2016 81.0
Patiyal_task1_1 IITMandi_2016 Patiyal2016 78.5 97.6
Phan_task1_1 CNN-LTE Phan2016 83.3 81.2
Pugachev_task1_1 - Pugachev2016 73.1 82.9
Qu_task1_2 - Dai2016 84.1
Rakotomamonjy_task1_1 RAK_2016_1 Rakotomamonjy2016 82.1 81.2
Santoso_task1_1 SWW Santoso2016 80.8 78.8
Schindler_task1_2 CQTCNN_2 Lidy2016 83.3
Takahashi_task1_1 UTNII_2016 Takahashi2016 85.6 77.5
Valenti_task1_1 - Valenti2016 86.2 79.0
Vikaskumar_task1_1 ABSP_IITKGP_2016 Vikaskumar2016 81.3 80.4
Vu_task1_1 - Vu2016 80.0 82.1
Xu_task1_1 HL-DNN-ASC_2016 Xu2016 73.3 81.4
Zoehrer_task1_1 - Zoehrer2016 73.1

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Beach Bus Cafe /
Restaurant
Car City
center
Forest
path
Grocery
store
Home Library Metro
station
Office Park Residential
area
Train Tram
Aggarwal_task1_1 - Vij2016 74.4 80.8 84.6 69.2 88.5 80.8 84.6 84.6 92.3 38.5 96.2 92.3 65.4 42.3 34.6 80.8
Bae_task1_1 CLC Bae2016 84.1 84.6 100.0 61.5 88.5 92.3 100.0 96.2 88.5 46.2 88.5 100.0 96.2 65.4 53.8 100.0
Bao_task1_1 - Bao2016 83.1 84.6 96.2 57.7 100.0 76.9 92.3 84.6 88.5 46.2 96.2 100.0 96.2 76.9 50.0 100.0
Battaglino_task1_1 - Battaglino2016 80.0 84.6 73.1 76.9 84.6 96.2 100.0 96.2 84.6 34.6 80.8 84.6 96.2 65.4 53.8 88.5
Bisot_task1_1 - Bisot2016 87.7 88.5 100.0 76.9 100.0 100.0 88.5 88.5 96.2 50.0 100.0 96.2 80.8 76.9 73.1 100.0
DCASE2016 baseline DCASE2016_baseline Heittola2016 77.2 84.6 88.5 69.2 96.2 80.8 65.4 88.5 92.3 26.9 100.0 96.2 53.8 88.5 30.8 96.2
Duong_task1_1 Tec_SVM_A Sena_Mafra2016 76.4 88.5 100.0 69.2 88.5 84.6 100.0 96.2 38.5 46.2 80.8 100.0 61.5 34.6 57.7 100.0
Duong_task1_2 Tec_SVM_V Sena_Mafra2016 80.5 80.8 100.0 84.6 92.3 92.3 100.0 96.2 57.7 46.2 96.2 100.0 50.0 53.8 57.7 100.0
Duong_task1_3 Tec_MLP Sena_Mafra2016 73.1 73.1 92.3 50.0 84.6 88.5 100.0 80.8 34.6 26.9 92.3 100.0 84.6 46.2 50.0 92.3
Duong_task1_4 Tec_CNN Sena_Mafra2016 62.8 80.8 88.5 53.8 80.8 69.2 96.2 76.9 50.0 15.4 46.2 92.3 42.3 34.6 19.2 96.2
Eghbal-Zadeh_task1_1 CPJKU16_BMBI Eghbal-Zadeh2016 86.4 92.3 92.3 76.9 96.2 92.3 96.2 100.0 88.5 69.2 73.1 100.0 96.2 76.9 46.2 100.0
Eghbal-Zadeh_task1_2 CPJKU16_CBMBI Eghbal-Zadeh2016 88.7 96.2 100.0 84.6 100.0 92.3 96.2 100.0 92.3 69.2 69.2 100.0 96.2 84.6 50.0 100.0
Eghbal-Zadeh_task1_3 CPJKU16_DCNN Eghbal-Zadeh2016 83.3 92.3 96.2 42.3 88.5 84.6 100.0 100.0 100.0 53.8 100.0 96.2 46.2 80.8 69.2 100.0
Eghbal-Zadeh_task1_4 CPJKU16_LFCBI Eghbal-Zadeh2016 89.7 96.2 100.0 61.5 96.2 96.2 96.2 100.0 96.2 69.2 100.0 96.2 88.5 88.5 61.5 100.0
Foleiss_task1_1 JFTT Foleiss2016 76.2 84.6 84.6 61.5 80.8 96.2 84.6 96.2 88.5 46.2 57.7 84.6 65.4 42.3 80.8 88.5
Hertel_task1_1 All-ConvNet Hertel2016 79.5 84.6 92.3 53.8 100.0 80.8 80.8 76.9 76.9 69.2 100.0 100.0 84.6 46.2 53.8 92.3
Kim_task1_1 QRK Yun2016 82.1 76.9 100.0 76.9 100.0 84.6 100.0 88.5 100.0 0.0 92.3 96.2 76.9 69.2 69.2 100.0
Ko_task1_1 KU_ISPL1_2016 Park2016 87.2 88.5 96.2 84.6 96.2 100.0 96.2 96.2 88.5 53.8 80.8 100.0 57.7 80.8 88.5 100.0
Ko_task1_2 KU_ISPL2_2016 Mun2016 82.3 92.3 84.6 65.4 92.3 100.0 84.6 96.2 92.3 53.8 65.4 84.6 92.3 84.6 53.8 92.3
Kong_task1_1 QK Kong2016 81.0 84.6 100.0 57.7 92.3 88.5 96.2 92.3 76.9 34.6 80.8 100.0 96.2 69.2 46.2 100.0
Kumar_task1_1 Gauss Elizalde2016 85.9 84.6 92.3 73.1 88.5 92.3 96.2 96.2 92.3 50.0 96.2 96.2 80.8 88.5 73.1 88.5
Lee_task1_1 MARGNet_MWFD Han2016 84.6 84.6 96.2 61.5 100.0 88.5 96.2 92.3 96.2 42.3 84.6 96.2 84.6 76.9 69.2 100.0
Lee_task1_2 MARGNet_ZENS Kim2016 85.4 84.6 92.3 61.5 100.0 96.2 100.0 96.2 96.2 46.2 84.6 100.0 92.3 69.2 61.5 100.0
Liu_task1_1 liu-re Liu2016 83.8 84.6 96.2 69.2 84.6 92.3 96.2 88.5 92.3 46.2 92.3 96.2 88.5 76.9 53.8 100.0
Liu_task1_2 liu-pre Liu2016 83.6 88.5 92.3 69.2 84.6 96.2 92.3 92.3 88.5 46.2 88.5 96.2 92.3 76.9 50.0 100.0
Lostanlen_task1_1 LostanlenAnden_2016 Lostanlen2016 80.8 80.8 92.3 50.0 96.2 84.6 96.2 84.6 80.8 65.4 96.2 100.0 65.4 69.2 53.8 96.2
Marchi_task1_1 Marchi_2016 Marchi2016 86.4 88.5 92.3 80.8 100.0 96.2 100.0 100.0 76.9 50.0 96.2 100.0 92.3 84.6 42.3 96.2
Marques_task1_1 DRKNN_2016 Marques2016 83.1 88.5 96.2 65.4 84.6 84.6 96.2 80.8 84.6 69.2 84.6 92.3 96.2 65.4 57.7 100.0
Moritz_task1_1 - Moritz2016 79.0 88.5 100.0 19.2 100.0 92.3 100.0 88.5 92.3 38.5 80.8 100.0 61.5 76.9 46.2 100.0
Mulimani_task1_1 - Mulimani2016 65.6 73.1 96.2 69.2 100.0 73.1 50.0 65.4 76.9 7.7 76.9 96.2 96.2 23.1 15.4 65.4
Nogueira_task1_1 - Nogueira2016 81.0 88.5 88.5 65.4 92.3 73.1 96.2 84.6 92.3 38.5 96.2 100.0 73.1 80.8 53.8 92.3
Patiyal_task1_1 IITMandi_2016 Patiyal2016 78.5 84.6 96.2 61.5 92.3 92.3 92.3 80.8 92.3 34.6 96.2 96.2 92.3 69.2 11.5 84.6
Phan_task1_1 CNN-LTE Phan2016 83.3 84.6 96.2 53.8 100.0 100.0 96.2 84.6 88.5 46.2 84.6 100.0 88.5 84.6 46.2 96.2
Pugachev_task1_1 - Pugachev2016 73.1 84.6 69.2 61.5 92.3 80.8 96.2 92.3 80.8 26.9 96.2 88.5 57.7 42.3 34.6 92.3
Qu_task1_1 - Dai2016 80.5 84.6 100.0 73.1 88.5 96.2 84.6 100.0 88.5 23.1 76.9 96.2 73.1 76.9 46.2 100.0
Qu_task1_2 - Dai2016 84.1 88.5 100.0 80.8 92.3 96.2 84.6 100.0 88.5 42.3 76.9 96.2 76.9 80.8 57.7 100.0
Qu_task1_3 - Dai2016 82.3 88.5 100.0 76.9 92.3 96.2 84.6 92.3 88.5 30.8 88.5 96.2 76.9 76.9 46.2 100.0
Qu_task1_4 - Dai2016 80.5 80.8 100.0 84.6 88.5 92.3 84.6 92.3 92.3 42.3 76.9 96.2 76.9 76.9 23.1 100.0
Rakotomamonjy_task1_1 RAK_2016_1 Rakotomamonjy2016 82.1 80.8 96.2 46.2 92.3 84.6 100.0 96.2 88.5 42.3 80.8 96.2 88.5 73.1 65.4 100.0
Rakotomamonjy_task1_2 RAK_2016_2 Rakotomamonjy2016 79.2 92.3 92.3 69.2 84.6 80.8 96.2 84.6 88.5 38.5 96.2 100.0 73.1 57.7 34.6 100.0
Santoso_task1_1 SWW Santoso2016 80.8 84.6 84.6 61.5 96.2 84.6 100.0 80.8 100.0 42.3 92.3 100.0 80.8 65.4 42.3 96.2
Schindler_task1_1 CQTCNN_1 Lidy2016 81.8 88.5 100.0 34.6 92.3 96.2 100.0 92.3 88.5 46.2 96.2 100.0 65.4 73.1 53.8 100.0
Schindler_task1_2 CQTCNN_2 Lidy2016 83.3 88.5 100.0 34.6 92.3 96.2 100.0 92.3 92.3 46.2 96.2 100.0 65.4 76.9 69.2 100.0
Takahashi_task1_1 UTNII_2016 Takahashi2016 85.6 92.3 100.0 61.5 100.0 88.5 88.5 96.2 84.6 57.7 80.8 100.0 92.3 80.8 61.5 100.0
Valenti_task1_1 - Valenti2016 86.2 84.6 100.0 76.9 100.0 96.2 100.0 92.3 92.3 42.3 96.2 96.2 76.9 76.9 65.4 96.2
Vikaskumar_task1_1 ABSP_IITKGP_2016 Vikaskumar2016 81.3 84.6 92.3 61.5 100.0 84.6 84.6 80.8 88.5 65.4 92.3 69.2 80.8 73.1 73.1 88.5
Vu_task1_1 - Vu2016 80.0 88.5 76.9 61.5 100.0 92.3 100.0 80.8 73.1 46.2 92.3 100.0 92.3 50.0 46.2 100.0
Xu_task1_1 HL-DNN-ASC_2016 Xu2016 73.3 84.6 96.2 23.1 96.2 84.6 100.0 84.6 69.2 23.1 57.7 100.0 73.1 69.2 38.5 100.0
Zoehrer_task1_1 - Zoehrer2016 73.1 80.8 92.3 38.5 92.3 65.4 96.2 84.6 65.4 23.1 84.6 100.0 61.5 69.2 42.3 100.0

System characteristics

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Input Features Classifier
Aggarwal_task1_1 - Vij2016 74.4 binaural various SVM
Bae_task1_1 CLC Bae2016 84.1 monophonic spectrogram CNN-RNN
Bao_task1_1 - Bao2016 83.1 monophonic MFCC+mel energy fusion
Battaglino_task1_1 - Battaglino2016 80.0 binaural mel energy CNN
Bisot_task1_1 - Bisot2016 87.7 monophonic spectrogram NMF
DCASE2016 baseline DCASE2016_baseline Heittola2016 77.2 monophonic MFCC GMM
Duong_task1_1 Tec_SVM_A Sena_Mafra2016 76.4 monophonic mel energy SVM
Duong_task1_2 Tec_SVM_V Sena_Mafra2016 80.5 monophonic mel energy SVM
Duong_task1_3 Tec_MLP Sena_Mafra2016 73.1 monophonic mel energy DNN
Duong_task1_4 Tec_CNN Sena_Mafra2016 62.8 monophonic mel energy DNN
Eghbal-Zadeh_task1_1 CPJKU16_BMBI Eghbal-Zadeh2016 86.4 binaural MFCC I-vector
Eghbal-Zadeh_task1_2 CPJKU16_CBMBI Eghbal-Zadeh2016 88.7 binaural MFCC I-vector
Eghbal-Zadeh_task1_3 CPJKU16_DCNN Eghbal-Zadeh2016 83.3 monophonic spectrogram CNN
Eghbal-Zadeh_task1_4 CPJKU16_LFCBI Eghbal-Zadeh2016 89.7 mono+binaural MFCC+spectrograms fusion
Foleiss_task1_1 JFTT Foleiss2016 76.2 monophonic various SVM
Hertel_task1_1 All-ConvNet Hertel2016 79.5 left spectrogram CNN
Kim_task1_1 QRK Yun2016 82.1 mono MFCC GMM
Ko_task1_1 KU_ISPL1_2016 Park2016 87.2 mono various fusion
Ko_task1_2 KU_ISPL2_2016 Mun2016 82.3 left+right+mono various DNN
Kong_task1_1 QK Kong2016 81.0 mono mel energy DNN
Kumar_task1_1 Gauss Elizalde2016 85.9 mono MFCC distribution SVM
Lee_task1_1 MARGNet_MWFD Han2016 84.6 mono mel energy CNN
Lee_task1_2 MARGNet_ZENS Kim2016 85.4 mono unsupervised CNN ensemble
Liu_task1_1 liu-re Liu2016 83.8 mono MFCC+mel energy fusion
Liu_task1_2 liu-pre Liu2016 83.6 mono MFCC+mel energy fusion
Lostanlen_task1_1 LostanlenAnden_2016 Lostanlen2016 80.8 mixed gammatone scattering SVM
Marchi_task1_1 Marchi_2016 Marchi2016 86.4 mono various fusion
Marques_task1_1 DRKNN_2016 Marques2016 83.1 mono MFCC kNN
Moritz_task1_1 - Moritz2016 79.0 left+right+mono amplitude modulation filter bank TDNN
Mulimani_task1_1 - Mulimani2016 65.6 mono MFCC+matching pursuit GMM
Nogueira_task1_1 - Nogueira2016 81.0 binaural various SVM
Patiyal_task1_1 IITMandi_2016 Patiyal2016 78.5 mono MFCC DNN
Phan_task1_1 CNN-LTE Phan2016 83.3 mono label tree embedding CNN
Pugachev_task1_1 - Pugachev2016 73.1 mono MFCC DNN
Qu_task1_1 - Dai2016 80.5 mono various ensemble
Qu_task1_2 - Dai2016 84.1 mono various ensemble
Qu_task1_3 - Dai2016 82.3 mono various ensemble
Qu_task1_4 - Dai2016 80.5 mono various ensemble
Rakotomamonjy_task1_1 RAK_2016_1 Rakotomamonjy2016 82.1 mono various SVM
Rakotomamonjy_task1_2 RAK_2016_2 Rakotomamonjy2016 79.2 mono various SVM
Santoso_task1_1 SWW Santoso2016 80.8 mono MFCC CNN
Schindler_task1_1 CQTCNN_1 Lidy2016 81.8 mono CQT CNN
Schindler_task1_2 CQTCNN_2 Lidy2016 83.3 mono CQT CNN
Takahashi_task1_1 UTNII_2016 Takahashi2016 85.6 mono MFCC DNN-GMM
Valenti_task1_1 - Valenti2016 86.2 mono mel energy CNN
Vikaskumar_task1_1 ABSP_IITKGP_2016 Vikaskumar2016 81.3 mono MFCC SVM
Vu_task1_1 - Vu2016 80.0 mono MFCC RNN
Xu_task1_1 HL-DNN-ASC_2016 Xu2016 73.3 mono mel energy DNN
Zoehrer_task1_1 - Zoehrer2016 73.1 mono spectrogram GRNN

Technical reports

Acoustic Scene Classification Using Parallel Combination of LSTM and CNN

Abstract

Deep neural networks(DNNs) have recently achieved a great success in various learning task, and have also been used for classification of environmental sounds. While DNNs are showing their potential in the classification task, they cannot fully utilize the temporal information. In this paper, we propose a neural network architecture for the purpose of using sequential information. The proposed structure is composed of two separated lower networks and one upper network. We refer to these as LSTM layers, CNN layers and connected layers, respectively. The LSTM layers extract the sequential information from consecutive audio features. The CNN layers learn the spectro-temporal locality from spectrogram images. Finally, the connected layers summarize the outputs of two networks to take advantage of the complementary features of the LSTM and CNN by combining them. To compare the proposed method with other neural networks, we conducted a number of experiments on the TUT acoustic scenes 2016 dataset which consists of recordings from various acoustic scenes. By using the proposed combination structure, we achieved higher performance compared to the conventional DNN, CNN and LSTM architecture.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features spectrogram
Classifier CNN-RNN
PDF

Technical Report of USTC System for Acoustic Scene Classification

Abstract

This technical report describes our submission for acoustic scene classification task of DCASE 2016. We first explore the use of and Gaussian mixture models (GMM) and ergodic hidden Markov models (HMM). Next, we combine neural network based discriminative models (DNN, CNN) with generative models to build hybrid systems, including DNN-GMM, CNN-GMM, DNN-HMM and CNNHMM. Finally, a system combination method is used to obtain the best overall performance from the multiple systems.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC+mel energy
Classifier fusion
PDF

Acoustic Scene Classification Using Convolutional Neural Networks

Abstract

Acoustic scene classification (ASC) aims to distinguish between different acoustic environments and is a technology which can be used by smart devices for contextualization and personalization. Standard algorithms exploit hand-crafted features which are unlikely to offer the best potential for reliable classification. This paper reports the first application of convolutional neural networks (CNNs) to ASC, an approach which learns discriminant features automatically from spectral representations of raw acoustic data. A principal influence on performance comes from the specific convolutional filters which can be adjusted to capture different spectrotemporal, recurrent acoustic structure. The proposed CNN approach is shown to outperform a Gaussian mixture model baseline for the DCASE 2016 database even though training data is sparse.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features mel energy
Classifier CNN
PDF

Supervised Nonnegative Matrix Factorization for Acoustic Scene Classification

Abstract

This report describes our contribution to the 2016 IEEE AASP DCASE challenge for the acoustic scene classification task. We propose a feature learning approach following the idea of decomposing time-frequency representations with nonnegative matrix factorization. We aim at learning a common dictionary representing the data and use projections on this dictionary as features for classification. Our system is based on a novel supervised extension of nonnegative matrix factorization. In the approach we propose, the dictionary and the classifier are optimized jointly in order to find a suited representation to minimize the classification cost. The proposed method significantly outperforms the baseline and provides improved results compared to unsupervised nonnegative matrix factorization.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features spectrogram
Classifier NMF
PDF

Acoustic Scene Recognition with Deep Neural Networks (DCASE Challenge 2016)

System characteristics
Input mono
Sampling rate 44.1kHz
Features various
Classifier ensemble
PDF

CP-JKU Submissions for DCASE-2016: a Hybrid Approach Using Binaural I-Vectors and Deep Convolutional Neural Networks

Abstract

This report describes the 4 submissions for Task 1 (Audio scene classification) of the DCASE-2016 challenge of the CP-JKU team. We propose 4 different approaches for Audio Scene Classification (ASC). First, we propose a novel i-vector extraction scheme for ASC using both left and right audio channels. Second, we propose a Deep Convolutional Neural Network (DCNN) architecture trained on spectrograms of audio excerpts in end-to-end fashion. Third, we use a calibration transformation to improve the performance of our binaural i-vector system. Finally, we propose a late-fusion of our binaural i-vector and the DCNN. We report the performance of our proposed methods on the provided cross-validation setup for the DCASE-2016 challenge. Using the late-fusion approach, we improve the performance of the baseline by 17% in accuracy.

System characteristics
Input binaural; monophonic; mono+binaural
Sampling rate 44.1kHz
Features MFCC; spectrogram; MFCC+spectrograms
Classifier I-vector; CNN; fusion
PDF

Experiments on The DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording

Abstract

In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC distribution
Classifier SVM
PDF

Mel-Band Features for DCASE 2016 Acoustic Scene Classification Task

Abstract

In this work we propose to separately calculate spectral low-level features in each frequency band, as it is commonly done in the problem of beat tracking and tempo estimation [1]. We based this assumption in the same auditory models that inspired the use of Mel- Frequency Cepstral Coefficients (MFCCs) [2] or energy through a filter bank [3] for audio genre classification. They rely on a model for the cochlea in which similar regions of the inner ear are stimulated by similar frequencies, and are processed independently. Both the MFCCs and the energy through filterbank approaches only generate an energy spectrum. In our approach, we expand this idea to incorporate other perceptually-inspired features.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features various
Classifier SVM
PDF

Convolutional Neural Network with Multiple-Width Frequency-Delta Data Augmentation for Acoustic Scene Classification

Abstract

In this paper, we apply convolutional neural network on acoustic scene classification task of DCASE 2016. We propose multiwidth frequency-delta data augmentation which uses static melspectrogram as well as frequency-delta features as individual examples with same labels for the network input, and the experimental result shows that this method significantly improves the performance compare to the case of using static mel-spectrogram input only. In addition, we propose folded mean aggregation, which first multiplies output probabilities of static and delta augmentation data from the same window first prior to audio clip-wise aggregation, and we found that this method reduces the error rate further. The system exhibited a classification accuracy of 0.831 when classifying 15 acoustic scenes.

System characteristics
Input mono
Sampling rate 44.1kHz
Features mel energy
Classifier CNN
PDF

DCASE2016 Baseline System

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC
Classifier GMM
PDF

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Abstract

We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84:5% on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72:5%, and an average equal error rate of 0:17 for domestic audio tagging, compared to the baseline of 0:21. The network therefore improves the baselines by a relative amount of 17% and 19%, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer, nor dropout layers.

System characteristics
Input left
Sampling rate 44.1kHz
Features spectrogram
Classifier CNN
PDF

Empirical Study on Ensemble Method of Deep Neural Networks for Acoustic Scene Classification

Abstract

The deep neural network has shown superior classification or regression performances in wide range of applications. In particular, the ensemble of deep machines was reported to effectively decrease test errors in many studies. In this work, we extend the scale of deep machines to include hundreds of networks, and apply it to acoustic scene classification. In so doing, several recent learning techniques are employed to accelerate the training process, and a novel stochastic feature diversification method is proposed to allow different contributions from each constituent network. Experimental results with the DCASE2016 dataset indicate that an ensemble of deep machines leads to better performances on the acoustic scene classification.

System characteristics
Input mono
Sampling rate 44.1kHz
Features unsupervised
Classifier CNN ensemble
PDF

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

The DCASE Challenge 2016 contains tasks for Acoustic Acene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Melfilter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.

System characteristics
Input mono
Sampling rate 44.1kHz
Features mel energy
Classifier DNN
PDF

CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging

Abstract

For the DCASE 2016 audio benchmarking contest, we submitted a parallel Convolutional Neural Network architecture for the tasks of 1) classifying acoustic scenes (task 1) and urban sound scapes and 2) domestic audio tagging (task 4). A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, found that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. Finally, we propose a parallel (graph-based) neural network architecture which captures relevant audio characteristics both in time and in frequency, and submitted it to the DCASE 2016 tasks 1 and 4, with some slight alterations described in this paper. Our approach shows a 10.7 % relative improvement of the baseline system of the Acoustic Scenes Classification task on the development set of task 1[1].

System characteristics
Input mono
Sampling rate 44.1kHz
Features CQT
Classifier CNN
PDF

Acoustic Scene Classification by Feed Forward Neural Network with Class Dependent Attention Mechanism

Abstract

In the acoustic scene classification task, we proposed a novel attention mechanism embedded to feed forward networks. On top of a shared input layer, 15 separated attention modules are calculated for each class, and output 15 class dependent feature vectors. Then the feature vectors are mapped to class labels by 15 subnetworks. A softmax layer is employed on the very top of the network. In our experiments, the default feature, MFCC and mel filterbank with delta and acceleration, is used to represent each segment. We split each 30s audio recording into 1s segments and calculate label for the segment, then output the most frequent label for the 30s recording. The best single neural network could get 77.4% cross validation accuracy without further feature engineering and any data augmentation. We train 5 models with MFCC features and 5 models with mel filterbank features, then make an ensemble with majority vote, getting a 78.6% final cross validation result. For submission, the 10 models are retrained with full dataset. And, the final submission is a majority vote ensemble of the 10 models' outputs.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC+mel energy
Classifier fusion
PDF

Binaural Scene Classification with Wavelet Scattering

Abstract

This technical report describes our contribution to the scene classification task of the 2016 edition of the IEEE AASP Challenge for Detection and Classification of Acoustic Scenes and Events (DCASE). Our computational pipeline consists of a gammatone scattering transform, logarithmically compressed and coupled with a per-frame linear support vector machine. At test time, frame-level labels are aggregated over the whole recording by majority vote. During the training phase, we propose a novel data augmentation technique, where left and right channels are mixed at different proportions to introduce invariance to sound direction in the training data.

System characteristics
Input mixed
Sampling rate 44.1kHz
Features gammatone scattering
Classifier SVM
PDF

The Up System for The 2016 DCASE Challenge Using Deep Recurrent Neural Network and Multiscale Kernel Subspace Learning

Abstract

We propose a system for acoustic scene classification using pairwise decomposition with deep neural networks and dimensionality reduction by multiscale kernel subspace learning. It is our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016). The system classifies 15 different acoustic scenes. First, auditory spectral features are extracted and fed into 15 binary deep multilayer perceptron neural networks (MLP). MLP are trained with the one-against-all paradigm to perform a pairwise decomposition. In a second stage, a large number of spectral, cepstral, energy and voicing-related audio features are extracted. Multiscale Gaussian kernels are then used in constructing optimal linear combination of Gram matrices for multiple kernel subspace learning. The reduced feature set is fed into a nearest-neighbour classifier. Predictions from the two systems are then combined by a threshold-based decision function. On the official development set of the challenge, an accuracy of 81.5% is achieved. In this technical report, we provide a description of the actual system submitted to the challenge.

System characteristics
Input mono
Sampling rate 44.1kHz
Features various
Classifier fusion
PDF

TUT Acoustic Scene Classification Submission

Abstract

This technical report presents the details of our submission to the D-CASE classification challenge, Task 1: Acoustic Scene Classification. The method used consists in a feature extraction phase followed by two dimensionality reduction steps (PCA and LDA) the classification being done using the k nearest-neighbours algorithm.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier kNN
PDF

Acoustic Scene Classification Using Time-Delay Neural Networks and Amplitude Modulation Filter Bank Features

Abstract

This paper presents a system for acoustic scene classification (SC) that is applied to data of the SC task of the DCASE'16 challenge (Task 1). The proposed method is based on extracting acoustic features that employ a relatively long temporal context, i.e., amplitude modulation filer bank (AMFB) features, prior to detection of acoustic scenes using a neural network (NN) based classification approach. Recurrent neural networks (RNN) are well suited to model long-term acoustic dependencies that are known to encode important information for SC tasks. However, RNNs require a relatively large amount of training data in comparison to feed-forward deep neural networks (DNNs). Hence, the time-delay neural network (TDNN) approach is used in the present work that enables analysis of long contextual information similar to RNNs but with training efforts comparable to conventional DNNs. The proposed SC system attains a recognition accuracy of 76.5 %, which is 4.0 % higher compared to the DCASE'16 baseline system.

System characteristics
Input left+right+mono
Sampling rate 16kHz
Features amplitude modulation filter bank
Classifier TDNN
PDF

Acoustic Scene Classification Using MFCC and MP Features

Abstract

This paper, clearly describes our experiments for efficient acoustic scene classification task as a part of Detection and Classification of Acoustic Scenes and Events-2016 (DCASE-2016) IEEE Audio and Acoustic Signal Processing (AASP) challenge. Identification of features from given audio clips to appropriate acoustic scene classification is a challenging task because of heterogeneity by thier nature. In order to identify such features, in this paper we have implemented few methods using Matching Pursuit (MP) algorithm in order to extract Time-Frequency (TF) based features. MP algorithm is used to select atoms iteratively among the set of parameterized waveforms in the dictionary that best correlates the original signal structure. Using these selected set of atoms mean and standard deviation of amplitude and frequency parameters of first few (n) atoms are calculated separately, resulting into four MP feature sets. Combination of twenty MFCCs along with four MP features enhanced the recognition accuracy of acoustic scenes using GMM classifier.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC+matching pursuit
Classifier GMM
PDF

Deep Neural Network Bottleneck Feature for Acoustic Scene Classification

Abstract

Bottleneck features have been shown to be effective in improv-ing the accuracy of speaker recognition, language identification and automatic speech recognition. However, few works have focused on bottleneck features for acoustic scene classification. This report proposes a novel acoustic scene feature extraction using bottleneck features derived from a Deep Neural Network (DNN). On the official development set with our settings, a fea-ture set that includes bottleneck features and Perceptual Linear Prediction (PLP) feature shows a best accuracy rate.

System characteristics
Input left+right+mono
Sampling rate 16kHz
Features various
Classifier DNN
PDF

Sound Scene Identification Based on Monaural and Binaural Features

Abstract

This submission to the sub-task acoustic scene classification of the IEEE DCASE 2016 Challenge: Acoustic scene classification is based on a feature extraction module based on the concatenation of monaural and binaural features. Monaural features are based on Mel Frequency Cepstrums summarized using recurrence quantification analysis. Binaural features are based on the extraction of inter-aural differences (level and time) and the coherence between the two channel stereo recordings. These features are used in conjunction with a support vector-machine for the classification of the acoustic sound scenes. In this short paper the impact of different features is analyzed.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features various
Classifier SVM
PDF

Score Fusion of Classification Systems for Acoustic Scene Classification

Abstract

This is a technical report about our study for an acoustic scene classification which is a task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In order to achieve this purpose, we investigated several methods in three aspects, feature extraction, generative/discriminative machine learning, and score fusion for a final decision. For finding an appropriate frame based feature, a new feature was devised after investigating several features. And then, models based on both generative and discriminative learning were applied for classifying the feature. From these studies, several systems composed of feature and classifier were considered. The final result was determined by fusing individual results. In Section 3, experiment results are summarized, and concluding remarks of this report are presented in Section 4.

System characteristics
Input mono
Sampling rate 44.1kHz
Features various
Classifier fusion
PDF

Acoustic Scene Classification Using Deep Learning

Abstract

Acoustic Scene Classification (ASC) is the task of classifying audio samples on the basis of their soundscapes. This is one of the tasks taken up by Detection and Classification of Acoustic Scenes and Events 2016 (DCASE-2016) challenge. A labeled dataset of audio samples from various scenes is provided and solutions are invited. In this paper, use of Deep Neural Networks (DNN) is proposed for the task of ASC. Here, different methods for extracting features with different classification algorithms are explored. It is observed that DNN works significantly better as compared to other methods trained over the same set of features. It performs at par with the state of-the-art techniques presented in DCASE-2013. It is concluded that the use of MFCC features with DNN works the best, giving 97.6 % cross-validation score on the development dataset-2016 data for a particular set of parameters for the DNN. Also training a DNN does not take larger run times compared to others methods.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier DNN
PDF

CNN-LTE: a Class of 1-X Pooling Convolutional Neural Networks on Label Tree Embeddings for Audio Scene Recognition

Abstract

We describe in this report our audio scene recognition system submitted to the DCASE 2016 challenge [1]. Firstly, given the label set of the scenes, a label tree is automatically constructed. This category taxonomy is then used in the feature extraction step in which an audio scene instance is represented by a label tree embedding image. Different convolutional neural networks, which are tailored for the task at hand, are finally learned on top of the image features for scene recognition. Our system reaches an overall recognition accuracy of 81.2% and outperforms the DCASE 2016 baseline with an absolute improvement of 8.7% on the development data.

System characteristics
Input mono
Sampling rate 44.1kHz
Features label tree embedding
Classifier CNN
PDF

Deep Neural Network for Acoustic Scene Detection

Abstract

The DCASE 2016 challenge comprised the task of Acoustic Scene Classification. The goal of this task was to classify test recordings into one of predefined classes that characterizes the environment during the recording.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier DNN
PDF

Enriched Supervised Feature Learning for Acoustic Scene Classification

Abstract

This paper presents the methodology we have followed for our submission at the DCASE 2016 competition on acoustic scene classification (Task 1). The approach is based on a supervised feature learning technique which is built upon matrix factorization of timefrequency representation of an audio scene. As an original contribution, we have introduced a non-negative supervised matrix factorization that helps in learning discriminative codes. Our experiments have shown that these supervised features perform slightly better than convolutional neural networks for this challenge. In addition, when they are coupled with some hand-crafted features such as histogram of gradient, their performances are further boosted.

System characteristics
Input mono
Sampling rate 44.1kHz
Features various
Classifier SVM
PDF

Acoustic Scene Classification Using Network-In-Network Based Convolutional Neural Network

Abstract

In this paper, we present our entry to the challenge of detection and classification of acoustic scenes and events (DCASE). The submission for this challenge is for the task of automatic audio scene classification. Our approach is based on the deep learning method that is adopted from computer vision research field. The convolutional neural network is adopted to solve the problem of audio based scene classification, specifically the architecture of network-in-network is utilized to build the classifier. For the feature extraction part, mel frequency spectral coefficients (MFCC) is used as the input vector for the classifier. Differ from the original architecture of network-in-network, in this work we perform 1-D convolution operation instead of performing 2-D convolution. The classifier is trained using every frames from MFCC feature set, and the results for every frames are then thresholded and voted to choose the final scene label of audio data. The proposed work in this paper shows a better performance of the provided baseline system of DCASE challenge.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier CNN
PDF

Acoustic Scene Classification: an Evaluation of an Extremely Compact Feature Representation

Abstract

This paper investigates several approaches to address the acoustic scene classification (ASC) task. We start from low-level feature representation for segmented audio frames and investigate different time granularity for feature aggregation. We study the use of support vector machine (SVM), as a well-known classifier, together with two popular neural network (NN) architectures, namely multilayer perceptron (MLP) and convolutional neural network (CNN), for higher level feature learning and classification. We evaluate the performance of these approaches on benchmark datasets provided from the 2013 and 2016 Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. We observe that a simple approach exploiting averaged Mel-log-spectrogram, as an extremely compact feature, and SVM can obtain even better result than NN-based approaches and comparable performance with the best systems in the DCASE 2013 challenge.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features mel energy
Classifier SVM; DNN
PDF

Acoustic Scene Classification Using Deep Neural Network and Frame-Concatenated Acoustic Feature

Abstract

This paper describes our contribution to the task of acoustic scene classification in the DCASE2016 (Detection and Classification of Acoustic Scenes and Events 2016) Challenge set by IEEE AASP. In this work, we applied the DNN-GMM (Deep Neural Network-Gaussian Mixture Model) to acoustic scene classification. We introduced high-dimensional features that are concatenated with acoustic features in temporally adjacent frames. As a result, it was confirmed that the classification accuracy of the DNN-GMM was improved by 5.0% in comparison with that of the GMM, which was used as the baseline classifier.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier DNN-GMM
PDF

DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks

Abstract

This workshop paper presents our contribution for the task of acoustic scene classification proposed for the Detection and classification of acoustic scenes and events (D-CASE) 2016 challenge. We propose the use of a convolutional neural network trained to classify short sequences of audio, represented by their log-mel spectrogram. In addition we use a training method that can be used when the validation performance of the system saturates as the training proceeds. The performance is evaluated on the public acoustic scene classification development dataset provided for the D-CASE challenge. The best accuracy score obtained by our configuration on a four-folded crossvalidation setup is 79.0%. It constitutes a 8.8% relative improvement with respect to the baseline system, based on a Gaussian mixture model classifier.

System characteristics
Input mono
Sampling rate 44.1kHz
Features mel energy
Classifier CNN
PDF

Acoustic Scene Classification Based on Spectral Analysis and Feature-Level Channel Combination

Abstract

This paper is a submission to the sub-task Acoustic Scene Classification of the IEEE Audio and Acoustic Signal Processing challenge: Detection and Classification of Acoustic Scenes and Events 2016. The aim of the sub-task is to correctly detect 15 different acoustic scenes, which consist of indoor, outdoor, and vehicle categories. This work is based on spectral analysis, feature-level channel combination, and support vector machine classifier. In this short paper, the impact of different parameters while extracting features is analyzed. The accuracy gain obtained by feature-level channel combination is then reported.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features various
Classifier SVM
PDF

Acoustic Scene Classification Using Block Based MFCC Features

Abstract

Acoustic Scene Classification (ASC) is receiving wide spread attention due to its wide variety of applications in smart wearable devices, surveillance, life log diarization etc. This work describes our contribution to the Acoustic scene classification task of the DCASE2016 Challenge for Detection and Classification of Acoustic Scenes and Events. In this work, we apply block based MFCC along with few traditional short term audio features with mean and standard deviation as statistics and Support Vector Machine (SVM) as a classifier to ASC. It is observed that block based MFCC feature performs better than classical MFCC. For evaluation purpose, we used three different datasets.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier SVM
PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier RNN
PDF

Hierarchical Learning for DNN-Based Acoustic Scene Classification

Abstract

In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.

System characteristics
Input mono
Sampling rate 44.1kHz
Features mel energy
Classifier DNN
PDF

Discriminative Training of GMM Parameters for Audio Scene Classification

Abstract

This report describes the algorithm for audio scene classification and audio tagging and the result for DCASE 2016 challenge data. We propose a discriminative training algorithm to improve the baseline GMM performance. The algorithm updates the baseline GMM parameters by maximizing the margin between classes to improve discriminative performance. For Task1, we use a hierarchical classifier to maximize discriminative performance, and achieve 84% accuracy for given cross validation data. For Task4, we apply binary classifier for each label, and achieve 16.71% EER for given cross validation data.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC
Classifier GMM
PDF

Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection

Abstract

We present two resource efficient frameworks for acoustic scene classification and acoustic event detection. In particular, we combine gated recurrent neural networks (GRNNs) and linear discriminant analysis (LDA) for efficiently classifying environmental sound scenes of the IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE2016). Our system reaches an overall accuracy of 79.1% on DCASE 2016 task 1 development data, resulting in a relative improvement of 8.34% compared to the baseline GMM system. By applying GRNNs on DCASE2016 real event detection data using a MSE objective, we obtain a segment-based error rate (ER) score of 0.73 - which is a relative improvement of 19.8% compared to the baseline GMM system. We further investigate semi-supervised learning applied to acoustic scene analysis. In particular, we evaluate the effects of a hybrid, i.e. generative discriminative, objective function.

System characteristics
Input mono
Sampling rate 44.1kHz
Features spectrogram
Classifier GRNN
PDF