Acoustic
scene classification


Challenge results

Task description

The goal of acoustic scene classification task was to classify test recordings into one of predefined classes (15) that characterizes the environment in which they were recorded — for example park, home, office. The participants used 4680 10-second audio excerpts (13h of audio) to train their systems, and 1620 10-second audio excerpts (4h 30min of audio) were used for the challenge evaluation.

More detailed task description can be found in the task description page.

Challenge results

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Abrol_IITM_task1_1 Baseline Abrol2017 65.7 (63.4 - 68.0) 88.1
Amiriparian_AU_task1_1 S2S-AE Amiriparian2017 67.5 (65.3 - 69.8) 88.0
Amiriparian_AU_task1_2 Shahin_APTI Amiriparian2017a 59.1 (56.7 - 61.5) 90.1
Biho_Sogang_task1_1 Biho1 Kim2017 56.5 (54.1 - 59.0) 75.9
Biho_Sogang_task1_2 Biho2 Kim2017 60.5 (58.1 - 62.9) 75.9
Bisot_TPT_task1_1 TPT1 Bisot2017 69.8 (67.6 - 72.1) 90.1
Bisot_TPT_task1_2 TPT2 Bisot2017 69.6 (67.3 - 71.8) 89.1
Chandrasekhar_IIITH_task1_1 - Chandrasekhar2017 45.9 (43.4 - 48.3) 77.6
Chou_SINICA_task1_1 TP_CNN_cv1 Chou2017 57.1 (54.7 - 59.5)
Chou_SINICA_task1_2 SINICA Chou2017 61.5 (59.2 - 63.9)
Chou_SINICA_task1_3 SINICA Chou2017 59.8 (57.4 - 62.1)
Chou_SINICA_task1_4 SINICA Chou2017 57.1 (54.7 - 59.5)
Dang_NCU_task1_1 andang1 Dang2017 62.7 (60.4 - 65.1) 82.0
Dang_NCU_task1_2 andang1 Dang2017 62.7 (60.4 - 65.1) 79.1
Dang_NCU_task1_3 andang1 Dang2017 63.7 (61.4 - 66.0) 81.6
Duppada_Seernet_task1_1 Seernet Duppada2017 57.0 (54.6 - 59.4) 79.9
Duppada_Seernet_task1_2 Seernet Duppada2017 59.9 (57.5 - 62.3) 81.9
Duppada_Seernet_task1_3 Seernet Duppada2017 64.1 (61.7 - 66.4) 81.6
Duppada_Seernet_task1_4 Seernet Duppada2017 63.0 (60.7 - 65.4) 84.8
Foleiss_UTFPR_task1_1 MLPFeats Foleiss2017 64.5 (62.2 - 66.8) 78.0
Foleiss_UTFPR_task1_2 MLPFeatRF Foleiss2017 66.9 (64.6 - 69.2) 80.0
Fonseca_MTG_task1_1 MTG Fonseca2017 67.3 (65.1 - 69.6) 83.0
Fraile_UPM_task1_1 GAMMA-UPM Fraile2017 58.3 (55.9 - 60.7) 79.8
Gong_MTG_task1_1 MTG_GBMVGG Gong2017 61.2 (58.8 - 63.5) 86.8
Gong_MTG_task1_2 MTG_GBM Gong2017 61.5 (59.1 - 63.9) 86.1
Gong_MTG_task1_3 MTG_VGG Gong2017 61.9 (59.5 - 64.2) 84.0
Han_COCAI_task1_1 4fEnsemSel Han2017 79.9 (78.0 - 81.9) 91.9
Han_COCAI_task1_2 4fMeanAll Han2017 79.6 (77.7 - 81.6) 91.7
Han_COCAI_task1_3 FlEnsemSel Han2017 80.4 (78.4 - 82.3) 91.9
Han_COCAI_task1_4 flMeanAll Han2017 80.3 (78.4 - 82.2) 91.7
Hasan_BUET_task1_1 BUETBOSCH1 Hyder2017 74.1 (72.0 - 76.3) 88.1
Hasan_BUET_task1_2 BUETBOSCH2 Hyder2017 72.2 (70.0 - 74.3) 83.3
Hasan_BUET_task1_3 BUETBOSCH3 Hyder2017 68.6 (66.3 - 70.8) 89.8
Hasan_BUET_task1_4 BUETBOSCH4 Hyder2017 72.0 (69.8 - 74.2) 89.6
DCASE2017 baseline Baseline Heittola2017 61.0 (58.7 - 63.4) 74.8
Huang_THU_task1_1 wjhta Huang2017 65.5 (63.2 - 67.8) 83.4
Huang_THU_task1_2 wjhta Huang2017 65.4 (63.1 - 67.7) 84.4
Hussain_NUCES_task1_1 - Hussain2017 56.7 (54.3 - 59.1) 90.7
Hussain_NUCES_task1_2 - Hussain2017 59.5 (57.1 - 61.9) 90.4
Hussain_NUCES_task1_3 - Hussain2017 59.9 (57.5 - 62.3) 90.0
Hussain_NUCES_task1_4 - Hussain2017 55.4 (52.9 - 57.8) 88.9
Jallet_TUT_task1_1 CRNN-1 Jallet2017 60.7 (58.4 - 63.1) 78.9
Jallet_TUT_task1_2 CRNN-2 Jallet2017 61.2 (58.8 - 63.5) 80.8
Jimenez_CMU_task1_1 LapKernel Jimenez2017 59.9 (57.6 - 62.3) 78.7
Kukanov_UEF_task1_1 K-CRNN Kukanov2017 71.7 (69.5 - 73.9) 85.8
Kun_TUM_UAU_UP_task1_1 Wav_SVMs Kun2017 64.2 (61.9 - 66.5) 83.2
Kun_TUM_UAU_UP_task1_2 Wav_GRUs Kun2017 64.0 (61.7 - 66.3) 82.6
Lehner_JKU_task1_1 JKU_IVEC Lehner2017 68.7 (66.4 - 71.0) 84.5
Lehner_JKU_task1_2 JKU_ALL_av Lehner2017 66.8 (64.5 - 69.1) 87.7
Lehner_JKU_task1_3 JKU_CNN Lehner2017 64.8 (62.5 - 67.1) 89.0
Lehner_JKU_task1_4 JKU_All_ca Lehner2017 73.8 (71.7 - 76.0) 91.3
Li_SCUT_task1_1 LiSCUTt1_1 Li2017 53.7 (51.3 - 56.1) 91.0
Li_SCUT_task1_2 LiSCUTt1_2 Li2017 63.6 (61.3 - 66.0) 83.9
Li_SCUT_task1_3 LiSCUTt1_3 Li2017 61.7 (59.4 - 64.1) 83.1
Li_SCUT_task1_4 LiSCUTt1_4 Li2017 57.8 (55.4 - 60.2) 87.5
Maka_ZUT_task1_1 ASAWI Maka2017 47.5 (45.1 - 50.0) 70.6
Mun_KU_task1_1 GAN_SKMUN Mun2017 83.3 (81.5 - 85.1) 87.1
Park_ISPL_task1_1 ISPL Park2017 72.6 (70.4 - 74.8) 83.6
Phan_UniLuebeck_task1_1 CNN Phan2017 59.0 (56.6 - 61.4) 83.8
Phan_UniLuebeck_task1_2 ACNN Phan2017 55.9 (53.5 - 58.3) 82.3
Phan_UniLuebeck_task1_3 CNN+ Phan2017 58.3 (55.9 - 60.7) 83.8
Phan_UniLuebeck_task1_4 ACNN+ Phan2017 58.0 (55.6 - 60.4) 82.3
Piczak_WUT_task1_1 amb200 Piczak2017 70.6 (68.4 - 72.8) 82.3
Piczak_WUT_task1_2 dishes Piczak2017 69.6 (67.3 - 71.8) 82.7
Piczak_WUT_task1_3 amb100 Piczak2017 67.7 (65.4 - 69.9) 80.2
Piczak_WUT_task1_4 amb60 Piczak2017 62.0 (59.6 - 64.3) 79.0
Rakotomamonjy_UROUEN_task1_1 HBGS CNN Rakotomamonjy2017 61.5 (59.2 - 63.9) 85.9
Rakotomamonjy_UROUEN_task1_2 HBGS CNN-4 Rakotomamonjy2017 62.7 (60.3 - 65.0) 85.3
Rakotomamonjy_UROUEN_task1_3 HBGS CNN-19 Rakotomamonjy2017 62.8 (60.4 - 65.1) 84.6
Schindler_AIT_task1_1 multires Schindler2017 61.7 (59.4 - 64.1) 87.3
Schindler_AIT_task1_2 multires-p Schindler2017 61.7 (59.4 - 64.1) 90.5
Vafeiadis_CERTH_task1_1 CERTH_1 Vafeiadis2017 61.0 (58.6 - 63.4) 80.4
Vafeiadis_CERTH_task1_2 CERTH_2 Vafeiadis2017 49.5 (47.1 - 51.9) 95.9
Vij_UIET_task1_1 Vij_UIET_1 Vij2017 61.2 (58.9 - 63.6) 77.3
Vij_UIET_task1_2 Vij_UIET_2 Vij2017 57.5 (55.1 - 59.9) 79.0
Vij_UIET_task1_3 Vij_UIET_3 Vij2017 59.6 (57.2 - 62.0) 78.0
Vij_UIET_task1_4 Vij_UIET_4 Vij2017 65.0 (62.7 - 67.3) 82.7
Waldekar_IITKGP_task1_1 IITKGP_ABSP_Fusion Waldekar2017 67.0 (64.7 - 69.3) 86.3
Waldekar_IITKGP_task1_2 IITKGP_ABSP_Hierarchical Waldekar2017 64.9 (62.6 - 67.2) 88.8
Xing_SCNU_task1_1 DCNN_vote Weiping2017 74.8 (72.6 - 76.9) 87.6
Xing_SCNU_task1_2 DCNN_SVM Weiping2017 77.7 (75.7 - 79.7) 89.9
Xu_NUDT_task1_1 XuCnnMFCC Xu2017 68.5 (66.2 - 70.7) 85.3
Xu_NUDT_task1_2 XuCnnMFCC Xu2017 67.5 (65.3 - 69.8) 87.4
Xu_PKU_task1_1 autolog1 Xu2017a 65.9 (63.6 - 68.2) 84.4
Xu_PKU_task1_2 autolog2 Xu2017a 66.7 (64.4 - 69.0) 84.4
Xu_PKU_task1_3 autolog3 Xu2017a 64.6 (62.3 - 67.0) 84.4
Yang_WHU_TASK1_1 MFS Lu2017 61.5 (59.2 - 63.9) 81.3
Yang_WHU_TASK1_2 STD Lu2017 65.2 (62.9 - 67.6) 80.3
Yang_WHU_TASK1_3 MFS+STD Lu2017 62.8 (60.5 - 65.2) 82.0
Yang_WHU_TASK1_4 Pre-training Lu2017 63.6 (61.3 - 66.0) 82.3
Yu_UOS_task1_1 UOS_DualIn Jee-Weon2017 67.0 (64.7 - 69.3) 85.5
Yu_UOS_task1_2 UOS_BalCos Jee-Weon2017 66.2 (63.9 - 68.5) 85.1
Yu_UOS_task1_3 UOS_DatDup Jee-Weon2017 67.3 (65.1 - 69.6) 95.4
Yu_UOS_task1_4 UOS_res Jee-Weon2017 70.6 (68.3 - 72.8) 95.8
Zhao_ADSC_task1_1 MResNet-34 Zhao2017 70.0 (67.8 - 72.2) 85.6
Zhao_ADSC_task1_2 Conv Zhao2017 67.9 (65.6 - 70.2) 85.4
Zhao_UAU_UP_task1_1 GRNN Zhao2017a 63.8 (61.5 - 66.2) 83.3

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Abrol_IITM_task1_1 Baseline Abrol2017 65.7 (63.4 - 68.0) 88.1
Amiriparian_AU_task1_1 S2S-AE Amiriparian2017 67.5 (65.3 - 69.8) 88.0
Amiriparian_AU_task1_2 Shahin_APTI Amiriparian2017a 59.1 (56.7 - 61.5) 90.1
Biho_Sogang_task1_2 Biho2 Kim2017 60.5 (58.1 - 62.9) 75.9
Bisot_TPT_task1_1 TPT1 Bisot2017 69.8 (67.6 - 72.1) 90.1
Chandrasekhar_IIITH_task1_1 - Chandrasekhar2017 45.9 (43.4 - 48.3) 77.6
Chou_SINICA_task1_2 SINICA Chou2017 61.5 (59.2 - 63.9)
Dang_NCU_task1_3 andang1 Dang2017 63.7 (61.4 - 66.0) 81.6
Duppada_Seernet_task1_3 Seernet Duppada2017 64.1 (61.7 - 66.4) 81.6
Foleiss_UTFPR_task1_2 MLPFeatRF Foleiss2017 66.9 (64.6 - 69.2) 80.0
Fonseca_MTG_task1_1 MTG Fonseca2017 67.3 (65.1 - 69.6) 83.0
Fraile_UPM_task1_1 GAMMA-UPM Fraile2017 58.3 (55.9 - 60.7) 79.8
Gong_MTG_task1_3 MTG_VGG Gong2017 61.9 (59.5 - 64.2) 84.0
Han_COCAI_task1_3 FlEnsemSel Han2017 80.4 (78.4 - 82.3) 91.9
Hasan_BUET_task1_1 BUETBOSCH1 Hyder2017 74.1 (72.0 - 76.3) 88.1
DCASE2017 baseline Baseline Heittola2017 61.0 (58.7 - 63.4) 74.8
Huang_THU_task1_1 wjhta Huang2017 65.5 (63.2 - 67.8) 83.4
Hussain_NUCES_task1_3 - Hussain2017 59.9 (57.5 - 62.3) 90.0
Jallet_TUT_task1_2 CRNN-2 Jallet2017 61.2 (58.8 - 63.5) 80.8
Jimenez_CMU_task1_1 LapKernel Jimenez2017 59.9 (57.6 - 62.3) 78.7
Kukanov_UEF_task1_1 K-CRNN Kukanov2017 71.7 (69.5 - 73.9) 85.8
Kun_TUM_UAU_UP_task1_1 Wav_SVMs Kun2017 64.2 (61.9 - 66.5) 83.2
Lehner_JKU_task1_4 JKU_All_ca Lehner2017 73.8 (71.7 - 76.0) 91.3
Li_SCUT_task1_2 LiSCUTt1_2 Li2017 63.6 (61.3 - 66.0) 83.9
Maka_ZUT_task1_1 ASAWI Maka2017 47.5 (45.1 - 50.0) 70.6
Mun_KU_task1_1 GAN_SKMUN Mun2017 83.3 (81.5 - 85.1) 87.1
Park_ISPL_task1_1 ISPL Park2017 72.6 (70.4 - 74.8) 83.6
Phan_UniLuebeck_task1_1 CNN Phan2017 59.0 (56.6 - 61.4) 83.8
Piczak_WUT_task1_1 amb200 Piczak2017 70.6 (68.4 - 72.8) 82.3
Rakotomamonjy_UROUEN_task1_3 HBGS CNN-19 Rakotomamonjy2017 62.8 (60.4 - 65.1) 84.6
Schindler_AIT_task1_1 multires Schindler2017 61.7 (59.4 - 64.1) 87.3
Vafeiadis_CERTH_task1_1 CERTH_1 Vafeiadis2017 61.0 (58.6 - 63.4) 80.4
Vij_UIET_task1_4 Vij_UIET_4 Vij2017 65.0 (62.7 - 67.3) 82.7
Waldekar_IITKGP_task1_1 IITKGP_ABSP_Fusion Waldekar2017 67.0 (64.7 - 69.3) 86.3
Xing_SCNU_task1_2 DCNN_SVM Weiping2017 77.7 (75.7 - 79.7) 89.9
Xu_NUDT_task1_1 XuCnnMFCC Xu2017 68.5 (66.2 - 70.7) 85.3
Xu_PKU_task1_2 autolog2 Xu2017a 66.7 (64.4 - 69.0) 84.4
Yang_WHU_TASK1_2 STD Lu2017 65.2 (62.9 - 67.6) 80.3
Yu_UOS_task1_4 UOS_res Jee-Weon2017 70.6 (68.3 - 72.8) 95.8
Zhao_ADSC_task1_1 MResNet-34 Zhao2017 70.0 (67.8 - 72.2) 85.6
Zhao_UAU_UP_task1_1 GRNN Zhao2017a 63.8 (61.5 - 66.2) 83.3

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Beach Bus Cafe /
Restaurant
Car City
center
Forest
path
Grocery
store
Home Library Metro
station
Office Park Residential
area
Train Tram
Abrol_IITM_task1_1 Baseline Abrol2017 65.7 73.1 61.1 88.9 81.5 82.4 44.4 73.1 72.2 35.2 75.0 86.1 32.4 49.1 75.0 55.6
Amiriparian_AU_task1_1 S2S-AE Amiriparian2017 67.5 44.4 75.0 63.0 95.4 94.4 97.2 73.1 60.2 43.5 79.6 62.0 16.7 64.8 82.4 61.1
Amiriparian_AU_task1_2 Shahin_APTI Amiriparian2017a 59.1 24.1 62.0 58.3 82.4 91.7 97.2 69.4 51.9 39.8 66.7 43.5 7.4 62.0 78.7 50.9
Biho_Sogang_task1_1 Biho1 Kim2017 56.5 24.1 33.3 33.3 75.9 61.1 80.6 50.9 88.9 27.8 99.1 57.4 17.6 88.0 55.6 54.6
Biho_Sogang_task1_2 Biho2 Kim2017 60.5 37.0 41.7 30.6 74.1 74.1 88.0 50.9 86.1 39.8 96.3 57.4 41.7 83.3 55.6 50.9
Bisot_TPT_task1_1 TPT1 Bisot2017 69.8 5.6 81.5 51.9 80.6 76.9 86.1 75.0 88.0 45.4 99.1 85.2 26.9 80.6 95.4 69.4
Bisot_TPT_task1_2 TPT2 Bisot2017 69.6 23.1 75.9 54.6 75.9 78.7 84.3 75.0 88.9 39.8 100.0 87.0 27.8 75.9 94.4 62.0
Chandrasekhar_IIITH_task1_1 - Chandrasekhar2017 45.9 6.5 47.2 21.3 88.9 96.3 69.4 42.6 92.6 61.1 68.5 0.0 0.0 3.7 73.1 16.7
Chou_SINICA_task1_1 TP_CNN_cv1 Chou2017 57.1 25.9 40.7 48.1 75.0 80.6 88.9 58.3 67.6 19.4 80.6 62.0 21.3 61.1 69.4 57.4
Chou_SINICA_task1_2 SINICA Chou2017 61.5 19.4 48.1 66.7 68.5 77.8 86.1 65.7 57.4 25.0 97.2 81.5 28.7 68.5 66.7 65.7
Chou_SINICA_task1_3 SINICA Chou2017 59.8 32.4 50.0 49.1 74.1 88.9 88.9 62.0 59.3 36.1 92.6 57.4 20.4 50.0 69.4 65.7
Chou_SINICA_task1_4 SINICA Chou2017 57.1 25.9 40.7 48.1 75.0 80.6 88.9 58.3 67.6 19.4 80.6 62.0 21.3 61.1 69.4 57.4
Dang_NCU_task1_1 andang1 Dang2017 62.7 32.4 49.1 61.1 65.7 76.9 87.0 57.4 90.7 26.9 95.4 82.4 24.1 75.0 70.4 46.3
Dang_NCU_task1_2 andang1 Dang2017 62.7 24.1 38.9 68.5 66.7 76.9 71.3 65.7 67.6 20.4 99.1 95.4 30.6 77.8 69.4 68.5
Dang_NCU_task1_3 andang1 Dang2017 63.7 28.7 49.1 61.1 71.3 69.4 88.9 59.3 83.3 34.3 100.0 84.3 25.0 83.3 72.2 45.4
Duppada_Seernet_task1_1 Seernet Duppada2017 57.0 13.0 35.2 51.9 88.0 85.2 86.1 52.8 68.5 25.0 28.7 72.2 35.2 82.4 71.3 60.2
Duppada_Seernet_task1_2 Seernet Duppada2017 59.9 8.3 39.8 57.4 96.3 75.9 88.0 58.3 79.6 34.3 23.1 86.1 40.7 78.7 74.1 57.4
Duppada_Seernet_task1_3 Seernet Duppada2017 64.1 10.2 49.1 45.4 77.8 89.8 85.2 54.6 81.5 38.9 97.2 94.4 25.0 80.6 75.0 56.5
Duppada_Seernet_task1_4 Seernet Duppada2017 63.0 13.9 42.6 57.4 85.2 85.2 87.0 57.4 83.3 35.2 63.9 88.9 31.5 81.5 72.2 60.2
Foleiss_UTFPR_task1_1 MLPFeats Foleiss2017 64.5 18.5 47.2 65.7 75.0 86.1 84.3 63.9 89.8 52.8 99.1 54.6 15.7 77.8 65.7 71.3
Foleiss_UTFPR_task1_2 MLPFeatRF Foleiss2017 66.9 13.9 49.1 68.5 75.9 87.0 91.7 69.4 99.1 50.9 99.1 63.0 18.5 78.7 69.4 69.4
Fonseca_MTG_task1_1 MTG Fonseca2017 67.3 36.1 41.7 62.0 75.9 75.0 92.6 57.4 84.3 41.7 99.1 89.8 38.9 76.9 76.9 62.0
Fraile_UPM_task1_1 GAMMA-UPM Fraile2017 58.3 61.1 46.3 47.2 76.9 88.9 65.7 48.1 95.4 35.2 63.0 24.1 29.6 63.9 75.0 53.7
Gong_MTG_task1_1 MTG_GBMVGG Gong2017 61.2 50.0 45.4 66.7 67.6 66.7 89.8 62.0 81.5 27.8 85.2 35.2 34.3 68.5 80.6 56.5
Gong_MTG_task1_2 MTG_GBM Gong2017 61.5 41.7 43.5 66.7 70.4 64.8 93.5 51.9 95.4 32.4 88.9 37.0 43.5 67.6 71.3 53.7
Gong_MTG_task1_3 MTG_VGG Gong2017 61.9 64.8 46.3 66.7 71.3 68.5 84.3 71.3 76.9 24.1 55.6 84.3 22.2 57.4 76.9 57.4
Han_COCAI_task1_1 4fEnsemSel Han2017 79.9 75.9 66.7 82.4 92.6 86.1 98.1 80.6 93.5 54.6 100.0 87.0 47.2 75.0 96.3 63.0
Han_COCAI_task1_2 4fMeanAll Han2017 79.6 75.0 65.7 82.4 92.6 86.1 98.1 78.7 92.6 55.6 100.0 85.2 49.1 75.0 96.3 62.0
Han_COCAI_task1_3 FlEnsemSel Han2017 80.4 78.7 71.3 83.3 93.5 88.9 98.1 79.6 94.4 53.7 100.0 86.1 44.4 75.9 90.7 66.7
Han_COCAI_task1_4 flMeanAll Han2017 80.3 77.8 73.1 82.4 92.6 90.7 98.1 76.9 93.5 52.8 100.0 84.3 48.1 76.9 90.7 66.7
Hasan_BUET_task1_1 BUETBOSCH1 Hyder2017 74.1 87.0 59.3 91.7 92.6 94.4 91.7 81.5 97.2 47.2 76.9 49.1 38.0 58.3 81.5 65.7
Hasan_BUET_task1_2 BUETBOSCH2 Hyder2017 72.2 69.4 61.1 65.7 94.4 81.5 93.5 66.7 91.7 38.9 100.0 83.3 36.1 61.1 77.8 61.1
Hasan_BUET_task1_3 BUETBOSCH3 Hyder2017 68.6 77.8 70.4 95.4 86.1 86.1 84.3 71.3 98.1 50.0 40.7 22.2 41.7 68.5 83.3 52.8
Hasan_BUET_task1_4 BUETBOSCH4 Hyder2017 72.0 83.3 72.2 94.4 85.2 88.0 88.0 71.3 98.1 54.6 60.2 26.9 44.4 75.0 83.3 54.6
DCASE2017 baseline Baseline Heittola2017 61.0 40.7 38.9 43.5 64.8 79.6 85.2 49.1 76.9 30.6 93.5 73.1 32.4 77.8 72.2 57.4
Huang_THU_task1_1 wjhta Huang2017 65.5 22.2 50.9 57.4 60.2 77.8 96.3 65.7 90.7 46.3 99.1 77.8 21.3 75.9 73.1 67.6
Huang_THU_task1_2 wjhta Huang2017 65.4 30.6 48.1 63.9 65.7 76.9 95.4 63.9 91.7 37.0 99.1 77.8 10.2 75.9 79.6 64.8
Hussain_NUCES_task1_1 - Hussain2017 56.7 25.9 27.8 49.1 42.6 73.1 88.9 57.4 88.0 4.6 100.0 66.7 29.6 83.3 51.9 61.1
Hussain_NUCES_task1_2 - Hussain2017 59.5 28.7 37.0 37.0 73.1 67.6 79.6 55.6 84.3 27.8 100.0 67.6 24.1 85.2 59.3 65.7
Hussain_NUCES_task1_3 - Hussain2017 59.9 22.2 36.1 39.8 71.3 74.1 78.7 57.4 85.2 45.4 97.2 67.6 24.1 85.2 55.6 58.3
Hussain_NUCES_task1_4 - Hussain2017 55.4 38.9 21.3 59.3 40.7 69.4 92.6 54.6 75.0 14.8 80.6 67.6 20.4 81.5 53.7 60.2
Jallet_TUT_task1_1 CRNN-1 Jallet2017 60.7 15.7 51.9 61.1 75.0 88.0 88.9 56.5 65.7 27.8 87.0 91.7 21.3 55.6 80.6 44.4
Jallet_TUT_task1_2 CRNN-2 Jallet2017 61.2 24.1 55.6 62.0 70.4 88.9 90.7 63.9 70.4 29.6 87.0 84.3 23.1 55.6 72.2 39.8
Jimenez_CMU_task1_1 LapKernel Jimenez2017 59.9 69.4 43.5 65.7 72.2 62.0 79.6 47.2 73.1 26.9 76.9 81.5 25.9 63.0 62.0 50.0
Kukanov_UEF_task1_1 K-CRNN Kukanov2017 71.7 43.5 47.2 77.8 79.6 85.2 99.1 73.1 76.9 35.2 100.0 95.4 46.3 74.1 83.3 59.3
Kun_TUM_UAU_UP_task1_1 Wav_SVMs Kun2017 64.2 61.1 44.4 72.2 68.5 76.9 83.3 48.1 64.8 28.7 92.6 90.7 39.8 56.5 75.9 59.3
Kun_TUM_UAU_UP_task1_2 Wav_GRUs Kun2017 64.0 50.0 49.1 67.6 67.6 89.8 88.0 62.0 81.5 24.1 88.0 65.7 36.1 58.3 73.1 59.3
Lehner_JKU_task1_1 JKU_IVEC Lehner2017 68.7 91.7 65.7 79.6 76.9 70.4 90.7 65.7 88.0 58.3 76.9 50.9 22.2 75.9 71.3 46.3
Lehner_JKU_task1_2 JKU_ALL_av Lehner2017 66.8 57.4 64.8 73.1 80.6 91.7 88.9 79.6 77.8 35.2 64.8 71.3 36.1 38.0 83.3 59.3
Lehner_JKU_task1_3 JKU_CNN Lehner2017 64.8 47.2 59.3 73.1 78.7 88.0 87.0 75.0 74.1 31.5 63.0 69.4 48.1 37.0 83.3 57.4
Lehner_JKU_task1_4 JKU_All_ca Lehner2017 73.8 87.0 66.7 88.9 80.6 92.6 92.6 76.9 88.9 49.1 79.6 65.7 45.4 55.6 84.3 53.7
Li_SCUT_task1_1 LiSCUTt1_1 Li2017 53.7 14.8 38.0 50.9 55.6 83.3 68.5 60.2 95.4 20.4 80.6 34.3 17.6 70.4 54.6 61.1
Li_SCUT_task1_2 LiSCUTt1_2 Li2017 63.6 55.6 45.4 55.6 53.7 87.0 81.5 75.0 99.1 26.9 97.2 62.0 11.1 79.6 56.5 68.5
Li_SCUT_task1_3 LiSCUTt1_3 Li2017 61.7 51.9 33.3 48.1 64.8 83.3 82.4 70.4 99.1 24.1 99.1 50.0 14.8 78.7 53.7 72.2
Li_SCUT_task1_4 LiSCUTt1_4 Li2017 57.8 35.2 38.9 48.1 60.2 84.3 81.5 65.7 97.2 25.9 80.6 38.0 15.7 70.4 55.6 69.4
Maka_ZUT_task1_1 ASAWI Maka2017 47.5 60.2 40.7 61.1 57.4 31.5 65.7 44.4 78.7 16.7 33.3 45.4 0.9 69.4 59.3 48.1
Mun_KU_task1_1 GAN_SKMUN Mun2017 83.3 83.3 74.1 88.0 93.5 94.4 95.4 82.4 88.0 75.9 88.0 92.6 75.9 86.1 67.6 63.9
Park_ISPL_task1_1 ISPL Park2017 72.6 54.6 59.3 71.3 79.6 91.7 85.2 75.0 98.1 44.4 98.1 84.3 23.1 76.9 82.4 64.8
Phan_UniLuebeck_task1_1 CNN Phan2017 59.0 38.9 48.1 61.1 82.4 60.2 80.6 65.7 73.1 38.9 85.2 34.3 32.4 58.3 71.3 54.6
Phan_UniLuebeck_task1_2 ACNN Phan2017 55.9 41.7 45.4 51.9 79.6 56.5 67.6 62.0 70.4 35.2 88.9 33.3 31.5 52.8 72.2 50.0
Phan_UniLuebeck_task1_3 CNN+ Phan2017 58.3 41.7 44.4 68.5 74.1 57.4 94.4 66.7 66.7 27.8 68.5 76.9 21.3 40.7 71.3 54.6
Phan_UniLuebeck_task1_4 ACNN+ Phan2017 58.0 53.7 47.2 64.8 75.0 59.3 91.7 61.1 70.4 28.7 75.9 69.4 14.8 34.3 68.5 55.6
Piczak_WUT_task1_1 amb200 Piczak2017 70.6 29.6 66.7 71.3 71.3 91.7 80.6 46.3 88.0 56.5 99.1 69.4 49.1 75.9 81.5 82.4
Piczak_WUT_task1_2 dishes Piczak2017 69.6 32.4 63.9 65.7 77.8 91.7 84.3 49.1 76.9 67.6 99.1 56.5 56.5 67.6 82.4 72.2
Piczak_WUT_task1_3 amb100 Piczak2017 67.7 22.2 66.7 65.7 74.1 90.7 86.1 35.2 81.5 59.3 98.1 78.7 41.7 64.8 81.5 68.5
Piczak_WUT_task1_4 amb60 Piczak2017 62.0 19.4 63.9 51.9 65.7 89.8 88.9 21.3 67.6 43.5 92.6 81.5 43.5 73.1 63.9 63.0
Rakotomamonjy_UROUEN_task1_1 HBGS CNN Rakotomamonjy2017 61.5 9.3 74.1 41.7 83.3 84.3 87.0 64.8 96.3 40.7 87.0 26.9 37.0 50.9 81.5 58.3
Rakotomamonjy_UROUEN_task1_2 HBGS CNN-4 Rakotomamonjy2017 62.7 6.5 77.8 47.2 82.4 88.9 87.0 68.5 92.6 38.0 95.4 35.2 33.3 48.1 85.2 53.7
Rakotomamonjy_UROUEN_task1_3 HBGS CNN-19 Rakotomamonjy2017 62.8 5.6 78.7 48.1 83.3 88.9 84.3 65.7 93.5 38.9 93.5 40.7 29.6 49.1 87.0 54.6
Schindler_AIT_task1_1 multires Schindler2017 61.7 47.2 55.6 65.7 69.4 98.1 87.0 46.3 74.1 18.5 47.2 71.3 55.6 74.1 82.4 33.3
Schindler_AIT_task1_2 multires-p Schindler2017 61.7 56.5 56.5 62.0 66.7 99.1 91.7 45.4 75.9 25.0 37.0 79.6 40.7 63.0 88.9 38.0
Vafeiadis_CERTH_task1_1 CERTH_1 Vafeiadis2017 61.0 23.1 42.6 58.3 66.7 77.8 86.1 64.8 94.4 39.8 92.6 54.6 20.4 72.2 81.5 39.8
Vafeiadis_CERTH_task1_2 CERTH_2 Vafeiadis2017 49.5 35.2 23.1 58.3 63.0 90.7 90.7 57.4 61.1 20.4 38.0 53.7 25.9 45.4 59.3 20.4
Vij_UIET_task1_1 Vij_UIET_1 Vij2017 61.2 22.2 39.8 43.5 73.1 77.8 90.7 64.8 83.3 43.5 95.4 52.8 28.7 77.8 59.3 65.7
Vij_UIET_task1_2 Vij_UIET_2 Vij2017 57.5 21.3 32.4 36.1 64.8 73.1 79.6 50.9 71.3 35.2 99.1 66.7 30.6 83.3 54.6 63.9
Vij_UIET_task1_3 Vij_UIET_3 Vij2017 59.6 10.2 42.6 36.1 53.7 75.0 79.6 54.6 88.0 48.1 98.1 57.4 39.8 88.0 58.3 63.9
Vij_UIET_task1_4 Vij_UIET_4 Vij2017 65.0 16.7 38.9 65.7 74.1 84.3 98.1 64.8 85.2 40.7 98.1 84.3 25.9 69.4 70.4 58.3
Waldekar_IITKGP_task1_1 IITKGP_ABSP_Fusion Waldekar2017 67.0 13.9 61.1 76.9 70.4 86.1 90.7 63.0 85.2 49.1 98.1 81.5 19.4 80.6 73.1 56.5
Waldekar_IITKGP_task1_2 IITKGP_ABSP_Hierarchical Waldekar2017 64.9 15.7 58.3 78.7 63.9 82.4 84.3 63.0 88.0 50.0 97.2 84.3 15.7 70.4 70.4 50.9
Xing_SCNU_task1_1 DCNN_vote Weiping2017 74.8 77.8 88.0 71.3 81.5 78.7 73.1 76.9 67.6 49.1 95.4 82.4 57.4 73.1 88.0 61.1
Xing_SCNU_task1_2 DCNN_SVM Weiping2017 77.7 71.3 84.3 79.6 85.2 82.4 78.7 80.6 73.1 59.3 97.2 81.5 57.4 85.2 92.6 57.4
Xu_NUDT_task1_1 XuCnnMFCC Xu2017 68.5 27.8 43.5 70.4 84.3 88.0 96.3 66.7 91.7 40.7 100.0 85.2 13.9 82.4 72.2 63.9
Xu_NUDT_task1_2 XuCnnMFCC Xu2017 67.5 26.9 43.5 68.5 85.2 88.0 94.4 66.7 86.1 42.6 100.0 85.2 11.1 82.4 72.2 60.2
Xu_PKU_task1_1 autolog1 Xu2017a 65.9 29.6 42.6 58.3 80.6 79.6 98.1 67.6 51.9 53.7 100.0 90.7 32.4 70.4 75.0 58.3
Xu_PKU_task1_2 autolog2 Xu2017a 66.7 28.7 32.4 59.3 84.3 77.8 99.1 69.4 50.0 36.1 100.0 99.1 38.9 72.2 74.1 79.6
Xu_PKU_task1_3 autolog3 Xu2017a 64.6 25.0 37.0 60.2 84.3 74.1 98.1 64.8 43.5 33.3 100.0 94.4 25.0 68.5 84.3 76.9
Yang_WHU_TASK1_1 MFS Lu2017 61.5 10.2 55.6 52.8 76.9 79.6 94.4 50.0 79.6 30.6 94.4 55.6 33.3 68.5 75.9 65.7
Yang_WHU_TASK1_2 STD Lu2017 65.2 45.4 47.2 57.4 74.1 86.1 88.0 55.6 75.0 49.1 98.1 68.5 29.6 66.7 75.0 63.0
Yang_WHU_TASK1_3 MFS+STD Lu2017 62.8 53.7 42.6 54.6 78.7 88.9 88.9 61.1 75.9 47.2 90.7 48.1 15.7 61.1 71.3 63.9
Yang_WHU_TASK1_4 Pre-training Lu2017 63.6 42.6 45.4 57.4 71.3 97.2 89.8 51.9 81.5 38.0 99.1 62.0 20.4 67.6 70.4 60.2
Yu_UOS_task1_1 UOS_DualIn Jee-Weon2017 67.0 53.7 57.4 53.7 73.1 76.9 82.4 65.7 94.4 42.6 99.1 75.0 29.6 79.6 69.4 52.8
Yu_UOS_task1_2 UOS_BalCos Jee-Weon2017 66.2 55.6 57.4 47.2 72.2 75.9 83.3 65.7 92.6 43.5 99.1 75.0 27.8 77.8 69.4 50.0
Yu_UOS_task1_3 UOS_DatDup Jee-Weon2017 67.3 60.2 58.3 56.5 69.4 76.9 84.3 68.5 90.7 46.3 94.4 72.2 28.7 79.6 72.2 51.9
Yu_UOS_task1_4 UOS_res Jee-Weon2017 70.6 72.2 51.9 68.5 76.9 77.8 86.1 74.1 93.5 38.9 95.4 77.8 34.3 84.3 68.5 58.3
Zhao_ADSC_task1_1 MResNet-34 Zhao2017 70.0 41.7 69.4 69.4 93.5 63.9 98.1 71.3 79.6 32.4 100.0 81.5 37.0 84.3 68.5 59.3
Zhao_ADSC_task1_2 Conv Zhao2017 67.9 13.0 55.6 67.6 95.4 70.4 100.0 73.1 90.7 45.4 99.1 83.3 20.4 69.4 80.6 54.6
Zhao_UAU_UP_task1_1 GRNN Zhao2017a 63.8 47.2 46.3 70.4 66.7 77.8 88.9 65.7 85.2 28.7 86.1 70.4 38.0 56.5 74.1 55.6

System characteristics

Rank Code Name Technical
Report
Accuracy
(Eval)
Input Sampling
rate
Data
augmentation
Features Classifier Decision
making
Abrol_IITM_task1_1 Baseline Abrol2017 65.7 mono 44.1kHz CQT GMM, Archetypal Analysis, SVM majority vote on audio segments of a file
Amiriparian_AU_task1_1 S2S-AE Amiriparian2017 67.5 mixed 44.1kHz log-mel energies MLP
Amiriparian_AU_task1_2 Shahin_APTI Amiriparian2017a 59.1 mixed 44.1kHz log-mel energies MLP+SVM weighted late fusion
Biho_Sogang_task1_1 Biho1 Kim2017 56.5 mono 44.1kHz log-mel energies CNN majority vote
Biho_Sogang_task1_2 Biho2 Kim2017 60.5 mono 44.1kHz log-mel energies CNN majority vote
Bisot_TPT_task1_1 TPT1 Bisot2017 69.8 left, right 44.1kHz CQT NMF, MLP average log-probability
Bisot_TPT_task1_2 TPT2 Bisot2017 69.6 left, right 44.1kHz CQT NMF average log-probability
Chandrasekhar_IIITH_task1_1 - Chandrasekhar2017 45.9 mono 44.1kHz MFCC, Inverse Melfrequency cepstral coefficients DNN majority vote
Chou_SINICA_task1_1 TP_CNN_cv1 Chou2017 57.1 mono 44.1kHz spectrogram CNN majority vote
Chou_SINICA_task1_2 SINICA Chou2017 61.5 mono 44.1kHz spectrogram CNN majority vote
Chou_SINICA_task1_3 SINICA Chou2017 59.8 mono 44.1kHz spectrogram CNN majority vote
Chou_SINICA_task1_4 SINICA Chou2017 57.1 mono 44.1kHz spectrogram ensemble majority vote
Dang_NCU_task1_1 andang1 Dang2017 62.7 mono 44.1kHz MFCC CRNN majority vote
Dang_NCU_task1_2 andang1 Dang2017 62.7 mono 44.1kHz log-mel energies CNN majority vote
Dang_NCU_task1_3 andang1 Dang2017 63.7 mono 44.1kHz log-mel energies, MFCC CNN majority vote
Duppada_Seernet_task1_1 Seernet Duppada2017 57.0 mono 44.1kHz log-mel spectrogram CNN mean
Duppada_Seernet_task1_2 Seernet Duppada2017 59.9 mono 16kHz log-mel spectrogram CNN mean
Duppada_Seernet_task1_3 Seernet Duppada2017 64.1 mono 16kHz log-mel spectrogram CNN mean
Duppada_Seernet_task1_4 Seernet Duppada2017 63.0 mono 44.1kHz, 16kHz log-mel spectrogram CNN, ensemble mean
Foleiss_UTFPR_task1_1 MLPFeats Foleiss2017 64.5 mono 44.1kHz STFT MLP probability sum
Foleiss_UTFPR_task1_2 MLPFeatRF Foleiss2017 66.9 mono 44.1kHz STFT MLP, random forest majority vote
Fonseca_MTG_task1_1 MTG Fonseca2017 67.3 mono 44.1kHz various ensemble max of average score
Fraile_UPM_task1_1 GAMMA-UPM Fraile2017 58.3 binaural 44.1kHz modulation spectrum MLP a posteriori probablity
Gong_MTG_task1_1 MTG_GBMVGG Gong2017 61.2 multichannel 44.1kHz various GBM CNN fusion maximum
Gong_MTG_task1_2 MTG_GBM Gong2017 61.5 multichannel 44.1kHz various GBM fusion maximum
Gong_MTG_task1_3 MTG_VGG Gong2017 61.9 multichannel 44.1kHz log-mel energies CNN fusion maximum
Han_COCAI_task1_1 4fEnsemSel Han2017 79.9 mono, binaural 44.1kHz log-mel energies CNN, ensemble mean probability
Han_COCAI_task1_2 4fMeanAll Han2017 79.6 mono, binaural 44.1kHz log-mel energies CNN, ensemble mean probability
Han_COCAI_task1_3 FlEnsemSel Han2017 80.4 mono, binaural 44.1kHz log-mel energies CNN, ensemble mean probability
Han_COCAI_task1_4 flMeanAll Han2017 80.3 mono, binaural 44.1kHz log-mel energies CNN, ensemble mean probability
Hasan_BUET_task1_1 BUETBOSCH1 Hyder2017 74.1 mono 44.1kHz MFCC, log-mel energies GMM-SV, CNN-SV, Multiband CNN-SV majority vote
Hasan_BUET_task1_2 BUETBOSCH2 Hyder2017 72.2 mono 44.1kHz log-mel energies CNN-SV majority vote
Hasan_BUET_task1_3 BUETBOSCH3 Hyder2017 68.6 mono 44.1kHz MFCC, log-mel energies GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN majority vote
Hasan_BUET_task1_4 BUETBOSCH4 Hyder2017 72.0 mono 44.1kHz MFCC, log-mel energies, different functionals of various spectral and prosodic features GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN, DNN majority vote
DCASE2017 baseline Baseline Heittola2017 61.0 mono 44.1kHz log-mel energies MLP majority vote
Huang_THU_task1_1 wjhta Huang2017 65.5 mono 44.1kHz MFCC, CQT CNN majority vote
Huang_THU_task1_2 wjhta Huang2017 65.4 mono 44.1kHz pitch shifting MFCC, CQT CNN majority vote
Hussain_NUCES_task1_1 - Hussain2017 56.7 binaural 44.1kHz log-mel energies CNN
Hussain_NUCES_task1_2 - Hussain2017 59.5 binaural 44.1kHz log-mel energies DNN
Hussain_NUCES_task1_3 - Hussain2017 59.9 binaural 44.1kHz log-mel energies DNN
Hussain_NUCES_task1_4 - Hussain2017 55.4 binaural 44.1kHz log-mel energies CNN
Jallet_TUT_task1_1 CRNN-1 Jallet2017 60.7 mono 44.1kHz log-mel energies CRNN maximum
Jallet_TUT_task1_2 CRNN-2 Jallet2017 61.2 mono 44.1kHz log-mel energies CRNN majority vote
Jimenez_CMU_task1_1 LapKernel Jimenez2017 59.9 mono 44.1kHz emo_conf (opensmile) SVM highest score
Kukanov_UEF_task1_1 K-CRNN Kukanov2017 71.7 mono 44.1kHz log-mel energies CRNN majority vote
Kun_TUM_UAU_UP_task1_1 Wav_SVMs Kun2017 64.2 mono 44.1kHz wavelets, ComParE (openSMILE) SVM margin sampling value
Kun_TUM_UAU_UP_task1_2 Wav_GRUs Kun2017 64.0 mono 44.1kHz wavelets, ComParE (openSMILE) GRNN margin sampling value
Lehner_JKU_task1_1 JKU_IVEC Lehner2017 68.7 binaural 22.05kHz pitch shifting MFCC based i-vectors i-vector min. cosine distance
Lehner_JKU_task1_2 JKU_ALL_av Lehner2017 66.8 mono, binaural 22.05kHz pitch shifting MFCC, log-scaled spectrogram CNN, i-vector, ensemble model averaging
Lehner_JKU_task1_3 JKU_CNN Lehner2017 64.8 mono 22.05kHz log-scaled spectrogram CNN, ensemble fusion w/ logistic linear regression
Lehner_JKU_task1_4 JKU_All_ca Lehner2017 73.8 mono, binaural 22.05kHz pitch shifting mel-scaled spectrograms, i-vectors i-vector, CNN, ensemble fusion w/ logistic linear regression
Li_SCUT_task1_1 LiSCUTt1_1 Li2017 53.7 mono 44.1kHz DNN(MFCC) Bi-LSTM majority vote
Li_SCUT_task1_2 LiSCUTt1_2 Li2017 63.6 mono 44.1kHz DNN(MFCC) Bi-LSTM majority vote
Li_SCUT_task1_3 LiSCUTt1_3 Li2017 61.7 mono 44.1kHz DNN(MFCC) DNN majority vote
Li_SCUT_task1_4 LiSCUTt1_4 Li2017 57.8 mono 44.1kHz DNN(MFCC) Bi-LSTM majority vote
Maka_ZUT_task1_1 ASAWI Maka2017 47.5 binaural 44.1kHz cochleagram, onset map, binaural cues, low-level feature contours random forest
Mun_KU_task1_1 GAN_SKMUN Mun2017 83.3 left, right, mixed 22.05kHz GAN log-mel energies, spectrogram MLP, RNN, CNN, SVM majority vote
Park_ISPL_task1_1 ISPL Park2017 72.6 binaural 44.1kHz block mixing covariance of gammachirp energies, double FFT of gammachirp energies CNN maximum posterior
Phan_UniLuebeck_task1_1 CNN Phan2017 59.0 binaural 44.1kHz cross-validation with different data splits generalized label tree embedding CNN entire-signal classification
Phan_UniLuebeck_task1_2 ACNN Phan2017 55.9 binaural 44.1kHz cross-validation with different data splits generalized label tree embedding Attentive CNN entire-signal classification
Phan_UniLuebeck_task1_3 CNN+ Phan2017 58.3 binaural 44.1kHz cross-validation with different data splits generalized label tree embedding CNN entire-signal classification
Phan_UniLuebeck_task1_4 ACNN+ Phan2017 58.0 binaural 44.1kHz cross-validation with different data splits generalized label tree embedding Attentive CNN entire-signal classification
Piczak_WUT_task1_1 amb200 Piczak2017 70.6 mono 44.1kHz time delay, block mixing spectrogram CNN majority vote
Piczak_WUT_task1_2 dishes Piczak2017 69.6 mono 44.1kHz time delay, block mixing spectrogram CNN majority vote
Piczak_WUT_task1_3 amb100 Piczak2017 67.7 mono 44.1kHz time delay, block mixing spectrogram CNN majority vote
Piczak_WUT_task1_4 amb60 Piczak2017 62.0 mono 44.1kHz time delay, block mixing spectrogram CNN majority vote
Rakotomamonjy_UROUEN_task1_1 HBGS CNN Rakotomamonjy2017 61.5 mono 44.1kHz CQT CNN average prediction
Rakotomamonjy_UROUEN_task1_2 HBGS CNN-4 Rakotomamonjy2017 62.7 mono 44.1kHz CQT CNN average prediction over 4 models
Rakotomamonjy_UROUEN_task1_3 HBGS CNN-19 Rakotomamonjy2017 62.8 mono 44.1kHz CQT CNN average prediction over 19 models
Schindler_AIT_task1_1 multires Schindler2017 61.7 mono 44.1kHz time stretching, block mixing, pitch shifting, mixing files of same class, gaussian noise log-mel spectrogram CNN argmax of average softmax response per file
Schindler_AIT_task1_2 multires-p Schindler2017 61.7 mono 44.1kHz time stretching, block mixing, pitch shifting, mixing files of same class, gaussian noise log-mel spectrogram CNN argmax of average softmax response per file
Vafeiadis_CERTH_task1_1 CERTH_1 Vafeiadis2017 61.0 mono 44.1kHz MFCC, MFCC delta, MFCC acceleration, centroid, rolloff, ZCR SVM-HMM majority vote
Vafeiadis_CERTH_task1_2 CERTH_2 Vafeiadis2017 49.5 mono 44.1kHz speed and pitch change (downsampling), amplitude change (dynamic), gaussian noise log-mel spectrogram CNN majority vote
Vij_UIET_task1_1 Vij_UIET_1 Vij2017 61.2 binaural 44.1kHz feature frame concatenation log mel-filter bank RNN majority vote
Vij_UIET_task1_2 Vij_UIET_2 Vij2017 57.5 binaural 44.1kHz feature frame concatenation log mel-filter bank LSTM majority vote
Vij_UIET_task1_3 Vij_UIET_3 Vij2017 59.6 binaural 44.1kHz feature frame concatenation log mel-filter bank GRU majority vote
Vij_UIET_task1_4 Vij_UIET_4 Vij2017 65.0 binaural 44.1kHz feature frame concatenation log mel-filter bank CNN majority vote
Waldekar_IITKGP_task1_1 IITKGP_ABSP_Fusion Waldekar2017 67.0 binaural 44.1kHz combination [block-based MFCC; SCFC; CQCC] SVM fusion
Waldekar_IITKGP_task1_2 IITKGP_ABSP_Hierarchical Waldekar2017 64.9 binaural 44.1kHz combination [block-based MFCC; SCFC; CQCC] SVM fusion
Xing_SCNU_task1_1 DCNN_vote Weiping2017 74.8 binaural 22.05kHz spectrogram, CQT CNN majority vote
Xing_SCNU_task1_2 DCNN_SVM Weiping2017 77.7 binaural 22.05kHz spectrogram, CQT CNN SVM
Xu_NUDT_task1_1 XuCnnMFCC Xu2017 68.5 left, right, mixed 44.1kHz pitch shifting MFCC, spectrogram CNN majority vote
Xu_NUDT_task1_2 XuCnnMFCC Xu2017 67.5 left, right, mixed 44.1kHz pitch shifting MFCC, spectrogram CNN majority vote
Xu_PKU_task1_1 autolog1 Xu2017a 65.9 binaural 44.1kHz CQT Autoencoder and Logistic Regression majority vote
Xu_PKU_task1_2 autolog2 Xu2017a 66.7 binaural 44.1kHz CQT Autoencoder and Logistic Regression majority vote
Xu_PKU_task1_3 autolog3 Xu2017a 64.6 binaural 44.1kHz CQT Autoencoder and Logistic Regression majority vote
Yang_WHU_TASK1_1 MFS Lu2017 61.5 mono 44.1kHz log-mel energies CNN logsum
Yang_WHU_TASK1_2 STD Lu2017 65.2 mono 44.1kHz log-mel energies CNN logsum
Yang_WHU_TASK1_3 MFS+STD Lu2017 62.8 mono 44.1kHz log-mel energies CNN logsum
Yang_WHU_TASK1_4 Pre-training Lu2017 63.6 mono 44.1kHz log-mel energies CNN logsum
Yu_UOS_task1_1 UOS_DualIn Jee-Weon2017 67.0 left, right, mixed 44.1kHz mel-filterbank features MLP, ensemble score sum
Yu_UOS_task1_2 UOS_BalCos Jee-Weon2017 66.2 left, right, mixed 44.1kHz mel-filterbank features MLP, ensemble score sum
Yu_UOS_task1_3 UOS_DatDup Jee-Weon2017 67.3 left, right, mixed 44.1kHz stochastic duplication mel-filterbank features MLP, ensemble score sum
Yu_UOS_task1_4 UOS_res Jee-Weon2017 70.6 left, right, mixed 44.1kHz stochastic duplication mel-filterbank features MLP, ensemble score sum
Zhao_ADSC_task1_1 MResNet-34 Zhao2017 70.0 binaural 44.1kHz log-mel spectrogram CNN majority vote
Zhao_ADSC_task1_2 Conv Zhao2017 67.9 binaural 44.1kHz log-mel spectrogram CNN majority vote
Zhao_UAU_UP_task1_1 GRNN Zhao2017a 63.8 mono 44.1kHz spectrogram, scalogram, wavelets, ComParE (openSMILE) GRNN margin sampling value

Technical reports

GMM-AA System for Acoustic Scene Classification

Abstract

In this submission we propose to use Gaussian mixture modelling and Archetypal Analysis based system for DCASE17 acoustic scene classification task. We propose a feature learning approach via decomposing time-frequency (TF) representations with Archetypal Analysis (AA). In order to process large number of TF frames and capture the variations efficiently, firstly a class-specific GMM is build on frames of TF representations, followed by AA on GMM means to build class specific local dictionaries. Next, the TF representations are projected on the concatenated AA dictionary to get the non-negative sparse activations. Finally, the TF frames are reconstructed back using the computed activation vectors, and are then used to train a SVM classifier. The proposed method significantly outperforms the baseline system.

System characteristics
Input mono
Sampling rate 44.1kHz
Features CQT
Classifier GMM, Archetypal Analysis, SVM
Decision making majority vote on audio segments of a file
PDF

Sequence to Sequence Autoencoders for Unsupervised Representation Learning From Audio

Abstract

This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. An accuracy of 88.0 % is achieved on the official development set of the challenge – a relative improvement of 17.7 % over the challenge baseline.

System characteristics
Input mixed
Sampling rate 44.1kHz
Features log-mel energies
Classifier MLP
PDF

The Combined Augsburg / Passau / Tum / Icl System for DCASE 2017

Abstract

This technical report covers the fusion of two approaches towards the Acoustic Scene Classification sub-task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). The first system uses a novel recurrent sequence to sequence autoencoder approach for unsupervised representation learning. The second system is based on the late fusion of support vector machines trained on either wavelet features or an archetypal acoustic feature set. A weighted late-fusion combination of these two systems achieved an accuracy of 90.1 % on the official development set of the challenge – a relative percentage improvement of 20.2 % over the challenge baseline.

System characteristics
Input mixed
Sampling rate 44.1kHz
Features log-mel energies
Classifier MLP+SVM
Decision making weighted late fusion
PDF

Nonnegative Feature Learning Methods for Acoustic Scene Classification

Abstract

This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge.

System characteristics
Input left, right
Sampling rate 44.1kHz
Features CQT
Classifier NMF, MLP; NMF
Decision making average log-probability
PDF

Acoustic Scene Classification Using Deep Neural Network

Abstract

In this paper, deep neural networks (DNN) are applied for acoustic scene classification task provided by DCASE2017 challenge. We perform experiment on a dataset consisting of 15 types of acoustic scenes with a given total development data and evolution data of task1. We propose an DNN architecture for utterance level classification. Evaluation of models were performed on given evolution data of task1 for 4 folds using development data. In this approach MFCC and IMFCC feature vectors are used to train DNN model and their DNN scores were combined to test the system. On the official development data set of the task1 challenge, an accuracy of 81.28% is achieved.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC, Inverse Melfrequency cepstral coefficients
Classifier DNN
Decision making majority vote
PDF

FrameCNN: A Weakly-Supervised Learning Framework for Frame-Wise Acoustic Event Detection and Classification

Abstract

In this paper, we describe our contribution to the challenge of detection and classification of acoustic scenes and events (DCASE2017).We propose framCNN, a novel weakly supervised learning frame-work that improves the performance of convolutional neural net-work (CNN) for acoustic event detection by attending to details of each sound at various temporal levels. Most existing weakly-supervised frameworks replace fully-connected network with global average pooling after the final convolution layer. Such a method tends to identify only a few discriminative parts, leading to sub-optimal localization and classification accuracy. The key idea of our approach is to consciously classify the sound of each frame given by the corresponding label. The idea is general and can be applied to any network for achieving sound event detection and improving the performance of sound event classification. In acoustic scene classification (Task1), our approach obtained an average accuracy of 99.2% on the four-fold cross-validation for acoustic scene recognition, comparing to the provided baseline of 74.8%. In the large-scale weakly supervised sound event detection for smart cars(Task4), we obtained a F-score 53.8% for sound event audio tagging (subtask A), compared to the baseline of 19.8%, and a F-score32.8% for sound event detection (subtask B), compared to the base-line of 11.4%

System characteristics
Input mono
Sampling rate 44.1kHz
Features spectrogram
Classifier CNN; ensemble
Decision making majority vote
PDF

Deep Learning for DCASE2017 Challenge

Abstract

This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC; log-mel energies; log-mel energies, MFCC
Classifier CRNN; CNN
Decision making majority vote
PDF

Ensemble of Deep Neural Networks for Acoustic Scene Classification

Abstract

Deep neural networks (DNNs) have recently achieved great success in a multitude of classification tasks. Ensembles of DNNs have been shown to improve the performance. In this paper, we explore the recent state-of-the-art DNNs used for image classification. We modified these DNNs and applied them to the task of acoustic scene classification. We conducted a number of experiments on the TUT Acoustic Scenes 2017 dataset to empirically compare these methods. Finally, we show that the ensemble of these DNNs improves the baseline score for DCASE-2017 Task 1 by 10%

System characteristics
Input mono
Sampling rate 44.1kHz; 16kHz; 44.1kHz, 16kHz
Features log-mel spectrogram
Classifier CNN; CNN, ensemble
Decision making mean
PDF

MLP-Based Feature Learning for Automatic Acoustic Scene Classification

Abstract

This paper presents an experimental setup for feature learning in the context of Automatic Acoustic Scene Classification. The setup presented in this paper has been successfully used for Automatic Music Genre Classification by Sigtia and Dixon (2014). First a MLP is trained with audio frames calculated from a 2048-sample STFT and one-shot encoding. Then, the activations of each hidden layer of the MLP are stored as learned features for the entire dataset. Such features are then used to train Random Forests in order to increase classification performance. Our results on the DCASE 2017 development dataset reaches 80% accuracy across supplied folds.

System characteristics
Input mono
Sampling rate 44.1kHz
Features STFT
Classifier MLP; MLP, random forest
Decision making probability sum; majority vote
PDF

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

Abstract

This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled mel-spectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2%.

System characteristics
Input mono
Sampling rate 44.1kHz
Features various
Classifier ensemble
Decision making max of average score
PDF

Classification of Acoustic Scenes Based on the Modulation Spectrum

Abstract

A system for the automatic classification of acoustic scenes is proposed. This system calculates the spectral distribution of energy across auditory-relevant frequency bands and obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. This parametrisation scheme achieves good separation among scene classes, since it gets good classification results with a simple classifier consisting of a multilayer perceptron with only one hidden layer.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features modulation spectrum
Classifier MLP
Decision making a posteriori probablity
PDF

Acoustic Scene Classification by Fusing LightGBM and VGG-Net Multichannel Predictions

Abstract

This report provides a solution for the task 1 of DCASE 2017 challenge. We build two parallel audio scene classification systems -- LightGBM and VGG-net. The prediction scores are output from the multichannel version of the TUT Acoustic Scenes 2017 dataset. Finally, we perform a linear logistic regression method to fuse the LightGBM, VGG-net and LightGBM+VGG-net scores respectively. The evaluation is done on the development set, and three outputs are submitted for the challenge.

System characteristics
Input multichannel
Sampling rate 44.1kHz
Features various; log-mel energies
Classifier GBM CNN fusion; GBM fusion; CNN fusion
Decision making maximum
PDF

Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification

Abstract

In this paper, we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonic-percussive source separation, and background subtraction. We also present a network structure that can simultaneously analyse paired input, which makes the system benefit from a spatial information. The experimental results show that the proposed network structure and preprocessing method effectively learn acoustic characteristics from the audio recordings, and combining these with an ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development set using a mean ensemble.

System characteristics
Input mono, binaural
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN, ensemble
Decision making mean probability
PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier MLP
Decision making majority vote
PDF

A Multi-Scale Deep Convolutional Neural Network for Acoustic Scene Classification

Abstract

Deep neural networks have shown great classification performances in numbers of applications. We applied a multi-scale deep convolutional neural network to acoustic scene classification (ASC) which has been submitted to Task 1 of the DCASE-2017 challenge. In this report, we show our model for classifying short sequences of audio, represented by their Mel-Frequency Cepstral Coefficients and Constant-Q value. The system is evaluated on the public dataset provided by the organizers. The best accuracy we obtained on a 4-fold cross-validation setup is 84.4%.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation pitch shifting
Features MFCC, CQT
Classifier CNN
Decision making majority vote
PDF

Improved Acoustic Scene Classification with DNN and CNN

Abstract

This paper presents the acoustic scene classification (ASC) to differentiate between different acoustic environments corre-sponding to the DCASE 2017 challenge task1. In this contribution we have applied two techniques of classification i.e. Deep Neural Network (DNN) and Convolution Neural Network (CNN). DNN and CNN are widely used in speech recognition, computer vision, and natural language processing applications. These techniques have recently achieved great success in the field of audio classification for the various applications. We achieved higher accuracy than the previous work done on benchmark datasets provided in the DCASE 2016 challenge. We used frame level randomization of the training dataset and log mel energy features to achieve higher accuracy with DNN and CNN. It is observed that DNN achieved 90.41%, 90.03% and CNN achieved 90.71%, 88.86% accuracy on randomized data based on 80 and 60 mel energy features, respectively

System characteristics
Input binaural
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN; DNN
PDF

BUET Bosch Consortium (B2C) Acoustic Scene Classification Systems for DCASE 2017

Abstract

This technical report describes the systems jointly submitted by Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh, and Robert Bosch Research and Technology Center, Palo Alto, CA, USA, for the Acoustic scene classification (ASC) task of the DCASE 2017 challenge. Our sub-systems mainly consist of Convolutional Neural Network (CNN) based models trained on Spectrogram Image Features (SIF) using Mel and Log-scaled filter-banks. We also used a novel multi-band approach that learns the CNN models from different frequency bands separately using a single spectrogram. In a variant of CNN sub-systems, large dimensional audio segment level feature vectors are extracted from the flattening layer of a trained CNN model and later classified utilized a Probabilistic Linear Discriminant Analysis (PLDA) model. This sub-system is termed as the CNN-SuperVector (SV) system. We also implemented a GMM SuperVector system with a PLDA classifier and a feed-forward Neural Network (NN) classifier trained on an acoustic feature ensemble. Finally, we utilized linear score-fusion to combine the class-wise scores obtained from the different sub-systems.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC, log-mel energies; log-mel energies; MFCC, log-mel energies, different functionals of various spectral and prosodic features
Classifier GMM-SV, CNN-SV, Multiband CNN-SV; CNN-SV; GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN; GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN, DNN
Decision making majority vote
PDF

Acoustic Scene Classification Using CRNN

Abstract

This paper presents an application of a convolutiona lrecurrent neural network ( CRNN ) for the task of Acoustic Scene Classification ( ASC ). This is the first attempt, to the authors’ knowledge, to use this kind of network for the task of ASC, even though simple convolutional neural networks (CNN ) have already been applied and approved for this specific work. The submitted methods have been developed for the 2017 edition of the ”Detection and Classification of Acoustic Scenes and Events” ( DCASE ) challenge and consequently tested on the datasets provided for the task of ASC. In this paper, we use two based CRNN methods which score an overall accuracy of 78.9% and 80.8%.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
Decision making maximum; majority vote
PDF

DNN-Based Audio Scene Classification for DCASE 2017: Dual Inputfeatures, Balancing Cost, and Stochastic Data Duplication

Abstract

In this study, we explored DNN-based audio scene classification systems with dual input features. Dual input features take advantage of simultaneously utilizing two features with different levels of abstraction as inputs: a frame-level mel-filterbank feature and utterance-level identity vector. A new fine-tune cost that solves the drawback of dual input features was developed, as well as a data duplication method that enables DNN to clearly discriminate frequently misclassified classes. Combining the proposed methods with the latest DNN techniques such as residual learning achieved a fold-wise accuracy of 95.8% for the validation set provided by the Detection and Classification of Acoustic Scenes and Events community.

System characteristics
Input left, right, mixed
Sampling rate 44.1kHz
Data augmentation stochastic duplication
Features mel-filterbank features
Classifier MLP, ensemble
Decision making score sum
PDF

DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features

Abstract

The recordings from acoustic scenes contain information from multiple sound sources that can be captured by different type of handcrafted features. These features can be classified using kernel machines, such as the Support Vector Machines, which can approximate decision boundaries arbitrarily well. However, the complexity of training these methods increases with the dimensionality of the features and the size of the dataset. A solution is to take advantage of shift-invariant kernels to map the input features to a randomized low-dimensional feature space, then used the resulting random features to approximate non-linear kernels with linear kernel computation. In this work, we compared shift-invariant kernels such as Guassian, Laplacian and Cauchy and their corresponding random features. Experiments show that kernels outperformed the DCASE baseline by and absolute 4%. More importantly, the dimensionality of the random features in contrast to the input features is more than three times, from 6,553 to 2,048, with minimal loss of performance and more than 10 times and still outperformed the baseline. Random features approaches provide a strong alternative to perform acoustic scene classification with small or large number of instances. Moreover, they provide other benefits such as privacy preservation.

System characteristics
Input mono
Sampling rate 44.1kHz
Features emo_conf (opensmile)
Classifier SVM
Decision making highest score
PDF

Case 2017 Acoustic Scene Classification Using Convolutional Neural Network in Time Series

Abstract

This technical paper presents our approach for the acoustic scene classification (ASC) task in DACSE2017 challenge. We propose combination of recently deep learning algorithm for classification sequence of audio. We stack dilated causal convolution which is efficient for time series signal without recurrent structure and use SELU activation unit instead batch-normalization. Based on this, various experiments were evaluated on the ASC development dataset. The results were analyzed from different perspectives and the best accuracy score obtained by our system on 75.9% ..

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
Decision making majority vote
PDF

Recurrent Neural Network and Maximal Figure of Merit for Acoustic Event Detection

Abstract

In this report, we describe the systems submitted to the DCASE 2017 challenge. In particular, we explored convolutional recurrent neural network (CRNN) for acoustic scene classification (Task 1). For the weakly supervised sound event detection (Task 4), we utilized CRNN by embedding maximal figure-of-merit (CRNN-MFoM) into the binary cross-entropy objective function. On the development data set, the CRNN model achieves an average 14.7% relative accuracy improvement on the classification Task 1, the CRNN-MFoM improves F1-score from 10.9% to 33.5% on the detection Task 4 compared to the baseline system.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
Decision making majority vote
PDF

Wavelets Revisited for the Classification of Acoustic Scenes

Abstract

We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2%, and 82.6% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8%) of the official development set (p<0.001, one-tailed z-test).

System characteristics
Input mono
Sampling rate 44.1kHz
Features wavelets, ComParE (openSMILE)
Classifier SVM; GRNN
Decision making margin sampling value
PDF

Classifying Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Optimisations for the 2017 DCASE ASC Task

Abstract

This report describes the CP-JKU team's submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2017 challenge, and discusses some observations we made about the data and the classification setup. Our approach is based on the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on raw spectrograms. The data provided for the 2017 ASC task presented some new challenges -- in particular, audio stimuli of very short duration. These will be discussed in detail, and our measures for addressing them will be described. The result of our experiments is a classification system that achieves classification accuracies of around 90% on the provided development data, as estimated via the prescribed four-fold cross-validation scheme (which, we suspect, may be rather optimistic in relation to new data).

System characteristics
Input binaural; mono, binaural; mono
Sampling rate 22.05kHz
Data augmentation pitch shifting
Features MFCC based i-vectors; MFCC, log-scaled spectrogram; log-scaled spectrogram; mel-scaled spectrograms, i-vectors
Classifier i-vector; CNN, i-vector, ensemble; CNN, ensemble; i-vector, CNN, ensemble
Decision making min. cosine distance; model averaging; fusion w/ logistic linear regression
PDF

The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification

Abstract

In this report, we present our works about three tasks of IEEE AASP challenge on DCASE 2017, i.e. task 1: Acoustic Scene Classification (ASC), task 2: detection of rare sound events in artificially created mixtures and task 3: sound event detection in real life recordings. Tasks 2 and 3 belong to the same problem, i.e. Sound Event Detection (SED). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes or sound events. Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bi-directional Long Short Term Memory (Bi-LSTM) fed by the DAF is built for ASC and SED. Evaluated on the development datasets of DCASE 2017, our systems are superior to the corresponding baselines for tasks 1 and 2, and our system for task 3 performs as good as the baseline in terms of the predominant metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features DNN(MFCC)
Classifier Bi-LSTM; DNN
Decision making majority vote
PDF

Acoustic Scene Classifications

Abstract

In this paper, we present three approaches on Task1 Acoustic Scene Classification(ASC): a simple CNN with low time-complexity, a novelty feature extraction, and feature fusion. First, We propose a simplified CNN architecture with only two convolutional layers to avoid overfitting. The model had a balance between higher accuracy and lower time-complexity. Second, we extract identifiable audio features by a data-driven spectrogram down-sampling. Third, we do feature fusion by combining data-driven features with Mel-Frequency spectrogram(MFS) as the network input. All the three approaches improve classification accuracy, compared with baseline on the development set.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
Decision making logsum
PDF

Auditory Scene Classification Based on the Spectro-Temporal Structure Analysis

Abstract

In this report, we present a modular system for acoustic scenes classification. Our proposed system contains four modules to compute the representations describing spectro-temporal properties of audio data. The frequency components are extracted from cochleagram and low-level audio feature contours. An onset map is used to determine the properties of temporal structure, and binaural cues are additional components in the final feature space. Computed features are formed into vector and fed to random forests classifier for the purpose of classification. The results were submitted to the 2017 IEEE AASP DCASE challenge.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features cochleagram, onset map, binaural cues, low-level feature contours
Classifier random forest
PDF

Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane

Abstract

Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the cross-validated exper-iments on development DB, the usage of the generated features could improve ASC performance.

System characteristics
Input left, right, mixed
Sampling rate 22.05kHz
Data augmentation GAN
Features log-mel energies, spectrogram
Classifier MLP, RNN, CNN, SVM
Decision making majority vote
PDF

Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features

Abstract

This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6%, which is an improvement of 8.8%, 9.5%, 8.2% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively.

System characteristics
Input binaural
Sampling rate 44.1kHz
Data augmentation block mixing
Features covariance of gammachirp energies, double FFT of gammachirp energies
Classifier CNN
Decision making maximum posterior
PDF

Attention-Based CNN with Generalized Label Tree Embedding for Audio Scene Classification

Abstract

This report presents our audio scene classification systems submitted for Task 1 ("acoustic scene classification") of DCASE 2017 challenge. The systems rely on combinations of generalized label tree embedding representation, convolutional neural networks (CNNs), and attention mechanism. Our experimental results on the development data of the challenge show that our proposed system significantly outperform the challenge's baseline, improving the average classification accuracy from 74.8% of the baseline to 83.8%.

System characteristics
Input binaural
Sampling rate 44.1kHz
Data augmentation cross-validation with different data splits
Features generalized label tree embedding
Classifier CNN; Attentive CNN
Decision making entire-signal classification
PDF

The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification

Abstract

This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation time delay, block mixing
Features spectrogram
Classifier CNN
Decision making majority vote
PDF

Human-Based Greedy Search of CNN Architecture

Abstract

This paper presents the methodology we have followed for our submission at the DCASE 2017 competition on acoustic scene classification (Task 1). The approach is based convolutional neural networks. There is nothing original about this contribution, as we have just applied a human-based search of the best CNN architecture and hyper-parameters using a 4-fold cross-validation for selecting the best model. We hope that this approach will not reach the top entry of the challenge and that it will be outperformed by clever and beautiful methods.

System characteristics
Input mono
Sampling rate 44.1kHz
Features CQT
Classifier CNN
Decision making average prediction; average prediction over 4 models; average prediction over 19 models
PDF

Multi-Temporal Resolution Convolutional Neural Networks for the DCASE Acoustic Scene Classification Task

Abstract

In this paper we present our DCASE 2017 Challenge on Detection and Classification of Acoustic Scenes and Events contributions. We propose a parallel Convolutional Neural Network architecture for the task of classifying acoustic scenes and urban sound scapes. We propose a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolution are chosen to cover fine-grained characteristics of a scene's spectral texture as well as its distribution of acoustic events. The best performing variant of the proposed model scores 90.54% accuracy on the development dataset. This is a 6.81% improvement of the best performing single resolution model and 15.74% of the DCASE 2017 Acoustic Scenes Classification task baseline.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation time stretching, block mixing, pitch shifting, mixing files of same class, gaussian noise
Features log-mel spectrogram
Classifier CNN
Decision making argmax of average softmax response per file
PDF

Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning

Abstract

This report provides our contribution to the 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. We investigated two approaches for the acoustic scene classification task. Firstly, we used a combination of features in the time and frequency domain and a hybrid Support Vector Machines - Hidden Markov Model (SVM-HMM) classifier to achieve an average accuracy over 4-folds of 80.9%. Secondly, we used the log-mel spectrogram for feature extraction and a Convolutional Neural Network (CNN) to achieve an average accuracy over 4-folds of 83.7%. Moreover, by exploiting data-augmentation techniques and using the whole segment (as opposed to splitting into sub-sequences) as an input, the accuracy of our CNN system was boosted to 95.9%. Our two approaches outperformed the DCASE baseline method, which uses log-mel band energies for feature extraction and a MultiLayer Perceptron (MLP) to achieve an average accuracy over 4-folds of 74.8%

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation speed and pitch change (downsampling), amplitude change (dynamic), gaussian noise
Features MFCC, MFCC delta, MFCC acceleration, centroid, rolloff, ZCR; log-mel spectrogram
Classifier SVM-HMM; CNN
Decision making majority vote
PDF

Performance Evaluation of Deep Learning Architectures for Acoustic Scene Classification

Abstract

This paper is a submission to the sub-task Acoustic Scene Classification of the IEEE Audio and Acoustic Signal Processing challenge: Detection and Classification of Acoustic Scenes and Events 2017. The aim of the sub-task is to correctly detect 15 different acoustic scenes, which consist of indoor, outdoor, and vehicle categories. This work is based on log mel-filter bank features and deep learning. In this short paper, the impact of different parameters while applying a basic Deep Neural Network (DNN) architecture is first analyzed. The accuracy gains obtained by the different types of deep learning architectures such as Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) are then reported. It has been observed that the overall best scene classification accuracy was obtained with CNN.

System characteristics
Input binaural
Sampling rate 44.1kHz
Data augmentation feature frame concatenation
Features log mel-filter bank
Classifier RNN; LSTM; GRU; CNN
Decision making majority vote
PDF

IIT Kharagpur Submissions for DCASE2017 ASC Task: Audio Features in a Fusion-Based Framework

Abstract

This report describes two submissions for Acoustic Scene Classification (ASC) task of the IEEE AASP challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The first system follows an approach based on a score-level fusion of some well-known spectral features of audio processing. The second system uses the first proposed system in a two-stage hierarchical classification framework. On the DCASE 2017 development dataset, the two systems respectively show 18% and 21% better performance relative to that of the MLP-based baseline system.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features combination [block-based MFCC; SCFC; CQCC]
Classifier SVM
Decision making fusion
PDF

Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion

Abstract

Making sense of the environment by sounds is an important re-search in machine learning community. In this work, a Deep Con-volutional Neural Network (DCNN) model is presented to classi-fy acoustic scenes along with a multiple spectrograms fusion method. Firstly, the generations of raw spectrogram and CQT spectrogram are introduced separately. Corresponding features can then be extracted by feeding these spectrogram data into the proposed DCNN model. To fuse these multiple spectrogram fea-tures, two fusing mechanisms, namely the voting and the SVM methods, are designed. By fusing DCNN features of the raw and CQT spectrograms, the accuracy is significantly improved in our experiments, comparing with the single spectrogram schemes. This proves the effectiveness of the proposed multi-spectrograms fusion method.

System characteristics
Input binaural
Sampling rate 22.05kHz
Features spectrogram, CQT
Classifier CNN
Decision making majority vote; SVM
PDF

Fusion Model Based on Convolutional Neural Networks with Two Features for Acoustic Scene Classification

Abstract

This report describes two submissions for Task 1 (audio scene classification) of DCASE-2017 challenge of PDL team. We propose two different approaches for Task 1. First, we propose a new convolutional neural network (CNN) architecture trained on frame-level features such as mel-frequency cepstral coefficient (MFCC) of audio data. Second, we propose a late fusion of the proposed CNN trained with two different features, namely, MFCCs and spectrograms. We report the performance of our proposed methods on the cross-validation setup for Task 1 of DCASE-2017 challenge.

System characteristics
Input left, right, mixed
Sampling rate 44.1kHz
Data augmentation pitch shifting
Features MFCC, spectrogram
Classifier CNN
Decision making majority vote
PDF

Acoustic Scene Classification Using Autoencoder

Abstract

This report describes our contribution to the Acoustic Scene Classification (ASC) task of the 2017 IEEE AASP DCASE challenge. We apply an Autoencoder to capture the discriminative information underlying the audio. Then, a Logistic Regression model is employed to recognize different scenes under the compressed representation. In order to boost the performance, we train models based on different channels from the original recordings and simply apply majority voting method on the predictions. Our final system achieves 84.31% on a four-fold cross-validation setting, which outperforms the baseline system by 9.5%.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features CQT
Classifier Autoencoder and Logistic Regression
Decision making majority vote
PDF

ADSC Submission for DCASE 2017: Acoustic Scene Classification Using Deep Residual Convolutional Neural Networks

Abstract

This report describes our two submissions to the DCASE-2017 challenge for Task 1 (Acoustic scene classification). The first submission is motivated by the superior performance of the deep residual networks for both image and audio classifications. We propose a modified deep residual architecture trained on log-mel spectrogram patches in an end-to-end fashion for acoustic scene classification. We configure the number of layers and kernels for the deep residual nets and find that the modified deep residual net of 34 layers using binaural input features perform well on the DCASE-2017 development dataset. In the second submission, we implement a shallower network that consists of 3 convolutional layers and 2 fully connected layers to benchmark the performance of the residual network. Our two approaches improve the accuracy of the baseline by 10.8% and 10.6% respectively on the 4-fold cross-validation. We suggest that the size of the dataset for Task 1 is relatively small for deep networks to outperform shallower ones.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features log-mel spectrogram
Classifier CNN
Decision making majority vote
PDF

A System for 2017 DCASE Challenge Using Deep Sequential Image and Wavelet Features

Abstract

For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning for the audio scenes. First, deep representations extracted from the spectrogram and two types of scalograms using Convolutional Neural Networks, the ComparE features and two types of wavelet features are fed into the Gated Recurrent Neural Networks for classification separately. Predictions from the six models are then combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 83.3%, which increases 8.5% compared with the baseline (p<.001 by one-tailed z-test).

System characteristics
Input mono
Sampling rate 44.1kHz
Features spectrogram, scalogram, wavelets, ComParE (openSMILE)
Classifier GRNN
Decision making margin sampling value
PDF