Large-scale weakly supervised
sound event detection for smart cars


Challenge results

Task description

The task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The task employs a subset of AudioSet dataset by using 17 sound event classes from two categories (“Warning sounds” and “Vehicle sounds”).

Detailed task description can be found in the task description page.

Challenge results

Detailed description of metrics used can be found here.

Systems ranking

Subtask A - Audio tagging

Rank Submission Information Technical
Report
Evaluation dataset (overall) Development dataset (overall)
Code Name F1 Precision Recall F1 Precision Recall
Adavanne_TUT_task4_1 Ash_1 Adavanne2017 45.5 57.2 37.9
Adavanne_TUT_task4_2 Ash_2 Adavanne2017 46.6 58.0 38.9 43.2 47.5 39.6
Adavanne_TUT_task4_3 Ash_3 Adavanne2017 44.5 55.8 37.1
Adavanne_TUT_task4_4 Ash_4 Adavanne2017 26.3 33.2 21.8
Chou_SINICA_task4_1 FCNN_SM_1 Chou2017 47.6 43.8 52.2
Chou_SINICA_task4_2 FCNN_SM_2 Chou2017 49.0 51.9 46.4
Chou_SINICA_task4_3 FCNN_SM_3 Chou2017 47.9 48.4 47.4
Chou_SINICA_task4_4 FCNN_SM_3 Chou2017 49.0 53.8 45.0
DCASE2017 baseline Baseline Badlani2017 18.2 15.0 23.1 10.9 7.8 17.5
Kukanov_UEF_task4_1 K-CRNN-MFoM Kukanov2017 39.6 47.6 33.9 33.5 35.1 32.0
Lee_KAIST_task4_1 SDCNN_MAC Lee2017 40.3 31.3 56.7 35.3 25.6 56.7
Lee_KAIST_task4_2 MLMS5_MAC Lee2017 47.3 48.0 46.6 41.2 37.6 45.7
Lee_KAIST_task4_3 MLMS3_MAC Lee2017 47.2 49.6 45.0 38.7 37.3 40.2
Lee_KAIST_task4_4 MLMS8_MAC Lee2017 47.1 48.5 45.9 40.1 37.4 43.2
Lee_SNU_task4_1 EMSI1 Lee2017a 52.3 77.1 39.6 47.6 68.3 36.5
Lee_SNU_task4_2 EMSI2 Lee2017a 52.3 77.1 39.6 47.5 66.7 36.8
Lee_SNU_task4_3 EMSI3 Lee2017a 52.6 69.7 42.3 57.0 70.3 47.9
Lee_SNU_task4_4 EMSI4 Lee2017a 52.1 77.4 39.3 48.9 70.3 37.4
Salamon_NYU_task4_1 Salamon_1 Salamon2017 46.0 50.7 42.1 45.9 44.7 47.0
Salamon_NYU_task4_2 Salamon_2 Salamon2017 45.3 46.8 43.8 44.0 39.9 49.0
Salamon_NYU_task4_3 Salamon_3 Salamon2017 44.9 62.8 35.0 45.5 53.7 39.4
Salamon_NYU_task4_4 Salamon_4 Salamon2017 38.1 73.9 25.7 38.0 63.0 27.2
Toan_NCU_task4_1 ToanVu1 Vu2017 48.5 54.7 43.6 51.8 54.2 49.5
Toan_NCU_task4_2 ToanVu2 Vu2017 46.5 47.3 45.6 49.5 45.2 54.6
Tseng_Bosch_task4_1 Bosch1 Tseng2017 35.0 34.1 36.0 29.5 26.8 32.7
Tseng_Bosch_task4_2 Bosch2 Tseng2017 35.1 34.0 36.2 29.0 26.5 31.9
Tseng_Bosch_task4_3 Bosch3 Tseng2017 35.2 31.6 39.7 33.1 27.9 40.6
Tseng_Bosch_task4_4 Bosch4 Tseng2017 35.2 33.9 36.7 31.2 28.0 35.3
Xu_CVSSP_task4_1 Surrey1AB Xu2017 54.4 57.8 51.3 61.9 59.4 64.7
Xu_CVSSP_task4_2 Surrey2AB Xu2017 55.6 61.4 50.8
Xu_CVSSP_task4_3 Surrey3AB Xu2017 54.2 58.9 50.2
Xu_CVSSP_task4_4 Surrey4AB Xu2017 52.8 53.5 52.1

Subtask B - Sound event detection

Rank Submission Information Technical
Report
Evaluation dataset (segment-based, overall) Development dataset (segment-based, overall)
Code Name ER F1 ER F1
Adavanne_TUT_task4_1 Ash_1 Adavanne2017 0.8100 47.9 0.8400 38.8
Adavanne_TUT_task4_2 Ash_2 Adavanne2017 0.8000 48.3 0.8400 38.1
Adavanne_TUT_task4_3 Ash_3 Adavanne2017 0.8200 48.9 0.8400 38.6
Adavanne_TUT_task4_4 Ash_4 Adavanne2017 0.7900 49.0 0.8100 41.1
Chou_SINICA_task4_1 FCNN_SM_1 Chou2017 0.8300 42.4
DCASE2017 baseline Baseline Badlani2017 0.9300 28.4 1.0200 13.8
Lee_KAIST_task4_1 SDCNN_MAC Lee2017 0.8200 39.4 0.8800 28.1
Lee_KAIST_task4_2 MLMS5_MAC Lee2017 0.7800 42.6 0.8600 30.8
Lee_KAIST_task4_3 MLMS3_MAC Lee2017 0.7800 44.2 0.8600 31.3
Lee_KAIST_task4_4 MLMS8_MAC Lee2017 0.7500 47.1 0.8400 34.2
Lee_SNU_task4_1 EMSI1 Lee2017a 0.6700 54.4 0.7200 45.9
Lee_SNU_task4_2 EMSI2 Lee2017a 0.6700 54.4 0.8300 42.9
Lee_SNU_task4_3 EMSI3 Lee2017a 0.6700 55.4 0.7000 47.7
Lee_SNU_task4_4 EMSI4 Lee2017a 0.6600 55.5 0.7100 47.1
Salamon_NYU_task4_1 Salamon_1 Salamon2017 0.8200 46.2 0.8400 40.3
Salamon_NYU_task4_2 Salamon_2 Salamon2017 0.8500 45.6 0.8605 39.3
Salamon_NYU_task4_3 Salamon_3 Salamon2017 0.7700 45.9 0.7607 41.0
Salamon_NYU_task4_4 Salamon_4 Salamon2017 0.7700 45.9 0.7607 41.0
Toan_NCU_task4_2 ToanVu2 Vu2017 0.9400 43.0 0.9300 40.9
Toan_NCU_task4_3 ToanVu3 Vu2017 0.9000 42.7 0.9000 39.9
Toan_NCU_task4_4 ToanVu4 Vu2017 0.8700 41.6 0.8900 37.9
Xu_CVSSP_task4_1 Surrey1AB Xu2017 0.7300 51.8 0.7200 49.7
Xu_CVSSP_task4_2 Surrey2AB Xu2017 0.7800 47.5
Xu_CVSSP_task4_3 Surrey3AB Xu2017 1.0100 52.1
Xu_CVSSP_task4_4 Surrey4AB Xu2017 0.8000 50.4

Teams ranking

Table including only the best performing system per submitting team.

Subtask A - Audio tagging

Rank Submission Information Technical
Report
Evaluation dataset (overall) Development dataset (overall)
Code Name F1 Precision Recall F1 Precision Recall
Adavanne_TUT_task4_2 Ash_2 Adavanne2017 46.6 58.0 38.9 43.2 47.5 39.6
Chou_SINICA_task4_3 FCNN_SM_3 Chou2017 47.9 48.4 47.4
DCASE2017 baseline Baseline Badlani2017 18.2 15.0 23.1 10.9 7.8 17.5
Kukanov_UEF_task4_1 K-CRNN-MFoM Kukanov2017 39.6 47.6 33.9 33.5 35.1 32.0
Lee_KAIST_task4_2 MLMS5_MAC Lee2017 47.3 48.0 46.6 41.2 37.6 45.7
Lee_SNU_task4_3 EMSI3 Lee2017a 52.6 69.7 42.3 57.0 70.3 47.9
Salamon_NYU_task4_1 Salamon_1 Salamon2017 46.0 50.7 42.1 45.9 44.7 47.0
Toan_NCU_task4_1 ToanVu1 Vu2017 48.5 54.7 43.6 51.8 54.2 49.5
Tseng_Bosch_task4_3 Bosch3 Tseng2017 35.2 31.6 39.7 33.1 27.9 40.6
Xu_CVSSP_task4_2 Surrey2AB Xu2017 55.6 61.4 50.8

Subtask B - Sound event detection

Rank Submission Information Technical
Report
Evaluation dataset (segment-based, overall) Development dataset (segment-based, overall)
Code Name ER F1 ER F1
Adavanne_TUT_task4_4 Ash_4 Adavanne2017 0.7900 49.0 0.8100 41.1
Chou_SINICA_task4_1 FCNN_SM_1 Chou2017 0.8300 42.4
DCASE2017 baseline Baseline Badlani2017 0.9300 28.4 1.0200 13.8
Lee_KAIST_task4_4 MLMS8_MAC Lee2017 0.7500 47.1 0.8400 34.2
Lee_SNU_task4_4 EMSI4 Lee2017a 0.6600 55.5 0.7100 47.1
Salamon_NYU_task4_3 Salamon_3 Salamon2017 0.7700 45.9 0.7607 41.0
Toan_NCU_task4_4 ToanVu4 Vu2017 0.8700 41.6 0.8900 37.9
Xu_CVSSP_task4_1 Surrey1AB Xu2017 0.7300 51.8 0.7200 49.7

Class-wise performance

Subtask A - Audio tagging

Rank Submission Information Technical
Report
Overall
F1
Warning sounds Vehicle sounds
Code Name Air horn,
truck horn
Ambulance
(siren)
Car
alarm
Civil defense
siren
Fire engine,
fire truck (siren)
Police
car (siren)
Reversing
beeps
Screaming Train
horn
Bicycle Bus Car Car
passing by
Motorcycle Skateboard Train Truck
Adavanne_TUT_task4_1 Ash_1 Adavanne2017 45.5 7.8 0.0 0.0 78.6 51.7 48.9 8.5 77.9 21.8 39.0 21.1 67.2 0.0 54.7 79.5 71.8 50.0
Adavanne_TUT_task4_2 Ash_2 Adavanne2017 46.6 43.2 0.0 0.0 82.3 50.8 45.0 0.0 78.7 27.9 37.5 23.9 68.5 0.0 60.2 80.2 70.6 53.7
Adavanne_TUT_task4_3 Ash_3 Adavanne2017 44.5 16.8 0.0 0.0 80.7 54.0 48.9 0.0 70.5 9.1 32.4 7.6 68.3 0.0 62.3 80.5 66.7 52.5
Adavanne_TUT_task4_4 Ash_4 Adavanne2017 26.3 0.0 0.0 0.0 54.9 53.6 28.6 0.0 0.0 0.0 0.0 0.0 63.7 0.0 0.0 2.2 4.3 19.1
Chou_SINICA_task4_1 FCNN_SM_1 Chou2017 47.6 55.5 48.8 48.4 80.3 56.2 57.3 37.5 84.0 68.8 39.6 36.7 64.6 28.9 52.8 78.8 68.7 46.8
Chou_SINICA_task4_2 FCNN_SM_2 Chou2017 49.0 50.7 37.8 47.3 82.0 57.1 60.1 33.3 80.8 69.5 42.1 36.9 67.0 32.9 58.7 79.3 68.5 52.9
Chou_SINICA_task4_3 FCNN_SM_3 Chou2017 47.9 55.1 60.3 57.8 81.6 57.3 47.0 36.9 84.0 68.8 40.8 35.3 66.0 32.1 56.6 76.4 67.5 52.7
Chou_SINICA_task4_4 FCNN_SM_3 Chou2017 49.0 48.2 36.6 45.6 82.8 58.1 61.7 33.9 80.8 69.5 37.4 36.4 67.4 29.3 57.8 79.8 67.9 53.5
DCASE2017 baseline Baseline Badlani2017 18.2 0.0 0.0 0.0 48.0 19.4 38.8 0.0 0.0 14.2 4.2 0.0 30.0 0.0 14.3 0.0 7.8 0.0
Kukanov_UEF_task4_1 K-CRNN-MFoM Kukanov2017 39.6 0.0 3.8 0.0 80.7 0.5 55.1 0.0 45.4 27.9 21.4 10.8 57.1 0.0 63.5 57.4 61.8 41.7
Lee_KAIST_task4_1 SDCNN_MAC Lee2017 40.3 34.1 52.7 22.8 70.2 48.4 52.9 45.9 79.4 59.8 33.5 31.7 46.0 35.4 57.3 76.6 68.4 33.6
Lee_KAIST_task4_2 MLMS5_MAC Lee2017 47.3 30.0 50.0 15.8 82.2 59.3 53.4 33.3 79.2 72.0 48.4 34.5 60.9 15.1 61.7 74.7 72.0 42.5
Lee_KAIST_task4_3 MLMS3_MAC Lee2017 47.2 26.2 36.8 5.6 78.2 56.7 57.7 30.2 78.2 53.9 33.7 40.0 62.5 21.1 63.9 78.5 70.0 48.0
Lee_KAIST_task4_4 MLMS8_MAC Lee2017 47.1 24.1 38.4 10.9 80.0 53.9 57.0 30.2 78.0 54.8 35.7 37.2 63.8 18.2 64.3 74.3 69.8 44.3
Lee_SNU_task4_1 EMSI1 Lee2017a 52.3 58.8 3.8 35.3 89.5 59.6 43.6 26.4 72.4 50.3 23.1 9.4 78.9 0.0 66.7 82.3 84.8 39.5
Lee_SNU_task4_2 EMSI2 Lee2017a 52.3 58.8 3.8 35.3 89.5 59.6 43.6 26.4 72.4 50.3 23.1 9.4 78.9 0.0 66.7 82.3 84.8 39.5
Lee_SNU_task4_3 EMSI3 Lee2017a 52.6 58.8 0.0 37.6 87.2 53.7 52.0 31.0 74.1 64.0 24.6 8.8 74.6 0.0 65.5 81.5 85.2 45.2
Lee_SNU_task4_4 EMSI4 Lee2017a 52.1 57.8 3.8 35.3 87.5 54.2 44.4 29.6 71.6 50.3 25.9 6.5 79.1 0.0 66.0 81.5 85.2 39.7
Salamon_NYU_task4_1 Salamon_1 Salamon2017 46.0 52.0 36.6 31.5 79.0 55.8 57.1 48.5 66.2 68.4 38.9 21.1 31.5 2.4 59.5 61.7 63.0 31.6
Salamon_NYU_task4_2 Salamon_2 Salamon2017 45.3 0.5 41.4 39.1 80.0 56.6 44.3 36.1 65.5 74.7 36.2 20.2 61.9 0.0 56.4 65.3 67.7 36.3
Salamon_NYU_task4_3 Salamon_3 Salamon2017 44.9 45.9 16.4 18.4 81.1 48.6 60.4 32.1 62.7 59.4 27.3 6.2 70.9 2.8 63.6 55.8 59.2 30.6
Salamon_NYU_task4_4 Salamon_4 Salamon2017 38.1 26.8 0.0 2.9 81.5 47.3 39.6 20.0 32.8 38.8 11.1 0.0 75.7 0.0 56.2 41.4 49.4 17.5
Toan_NCU_task4_1 ToanVu1 Vu2017 48.5 47.0 57.1 38.6 82.9 54.1 55.8 51.5 0.8 69.9 28.6 31.1 70.5 35.2 60.5 63.9 73.5 42.9
Toan_NCU_task4_2 ToanVu2 Vu2017 46.5 54.8 46.3 51.0 67.9 57.0 44.6 61.0 66.7 67.3 31.1 31.2 66.1 24.1 58.5 65.9 73.1 43.1
Tseng_Bosch_task4_1 Bosch1 Tseng2017 35.0 44.4 36.4 14.1 69.7 46.4 49.8 4.2 47.1 33.3 20.0 17.1 61.1 17.9 37.5 35.6 31.0 34.4
Tseng_Bosch_task4_2 Bosch2 Tseng2017 35.1 44.4 36.4 16.3 69.7 46.1 49.5 6.9 57.6 41.3 20.2 17.1 60.5 17.9 37.5 37.5 26.5 34.4
Tseng_Bosch_task4_3 Bosch3 Tseng2017 35.2 42.9 35.9 21.2 70.1 40.5 46.8 15.1 46.6 41.8 18.0 19.7 54.8 20.6 36.9 36.9 46.1 36.3
Tseng_Bosch_task4_4 Bosch4 Tseng2017 35.2 43.4 36.4 16.3 69.7 46.4 48.2 16.1 49.7 40.5 20.0 17.1 60.7 17.9 37.5 36.9 30.0 34.4
Xu_CVSSP_task4_1 Surrey1AB Xu2017 54.4 54.3 59.5 78.8 83.9 63.2 62.2 65.8 86.6 80.2 39.1 32.5 71.5 41.8 64.5 71.6 80.1 46.4
Xu_CVSSP_task4_2 Surrey2AB Xu2017 55.6 63.7 35.6 72.9 86.4 65.7 63.8 60.3 91.2 73.6 40.5 39.7 72.9 27.1 63.5 74.5 79.2 52.3
Xu_CVSSP_task4_3 Surrey3AB Xu2017 54.2 59.5 52.7 72.5 85.0 53.2 43.0 65.9 88.9 74.9 44.2 41.7 73.0 39.1 69.4 73.1 80.5 46.5
Xu_CVSSP_task4_4 Surrey4AB Xu2017 52.8 63.1 58.9 70.9 81.5 62.1 57.6 66.7 82.3 76.5 28.6 36.6 69.4 32.8 65.0 72.0 75.1 44.0

Subtask B - Sound event detection

Rank Submission Information Technical
Report
Segment-based
(overall / evaluation dataset)
Warning sounds Vehicle sounds
Code Name ER F1 Air horn,
truck horn
Ambulance
(siren)
Car
alarm
Civil defense
siren
Fire engine,
fire truck (siren)
Police
car (siren)
Reversing
beeps
Screaming Train
horn
Bicycle Bus Car Car passing by Motorcycle Skateboard Train Truck
ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1
Adavanne_TUT_task4_1 Ash_1 Adavanne2017 0.8100 47.9 1.0700 0.0 1.0000 1.0300 2.8 0.3400 83.0 0.9700 54.4 1.1700 42.3 0.9700 6.8 1.3200 51.2 0.9600 47.3 1.5300 24.6 1.0100 8.1 1.3300 53.0 1.1000 3.3 0.9200 50.8 0.8500 62.6 0.7800 62.0 0.9900 44.4
Adavanne_TUT_task4_2 Ash_2 Adavanne2017 0.8000 48.3 1.0000 0.6 1.0000 1.3200 0.4 0.4300 80.4 1.2000 44.5 1.2000 31.5 1.0000 0.9100 58.1 1.1500 45.3 1.3000 26.0 1.0400 10.3 1.0700 56.4 1.0000 0.8200 57.2 1.0400 57.7 0.8300 62.9 1.0000 46.9
Adavanne_TUT_task4_3 Ash_3 Adavanne2017 0.8200 48.9 1.1600 2.5 1.0000 1.0000 0.3100 84.8 0.9900 49.8 1.0800 38.1 1.0700 27.9 1.0500 55.7 0.9900 48.8 1.7300 25.3 1.0900 20.4 1.2800 54.1 1.0000 0.7100 60.1 1.0700 59.7 0.7100 63.5 1.5500 44.6
Adavanne_TUT_task4_4 Ash_4 Adavanne2017 0.7900 49.0 1.1000 38.3 1.0000 1.0000 0.3200 84.0 1.1300 50.9 1.1800 31.2 1.0200 29.3 0.9900 54.5 1.0900 47.0 1.3200 32.5 1.1900 32.5 1.2200 55.1 1.0000 0.9200 54.8 0.8800 62.4 0.7800 60.5 1.0400 46.4
Chou_SINICA_task4_1 FCNN_SM_1 Chou2017 0.8300 42.4 0.8800 32.3 0.9000 25.7 0.8600 30.7 0.5900 73.5 1.3900 41.0 1.1500 47.9 1.0100 29.8 0.8700 51.9 0.9000 40.9 1.4400 15.4 1.3200 17.2 0.9600 47.3 1.1200 3.2 0.9400 46.4 1.3900 46.5 0.9300 46.3 1.6700 35.3
DCASE2017 baseline Baseline Badlani2017 0.9300 28.4 1.0000 1.0000 1.0000 0.6400 67.4 0.9800 16.5 1.0100 34.0 1.0000 1.0000 0.9800 3.9 0.9900 2.5 1.0000 1.7500 46.0 1.0000 0.9700 6.1 1.0000 0.9900 1.9 1.0000
Lee_KAIST_task4_1 SDCNN_MAC Lee2017 0.8200 39.4 0.9500 14.6 1.0000 0.9500 9.0 0.5000 75.1 1.0000 39.4 1.0100 34.3 0.8900 21.0 0.9100 31.2 0.9000 24.8 1.1000 15.6 1.0400 4.1 1.3900 50.7 1.0000 0.8000 44.7 0.9400 42.4 0.8300 37.4 0.9400 23.5
Lee_KAIST_task4_2 MLMS5_MAC Lee2017 0.7800 42.6 0.9000 23.9 0.9900 5.3 0.9400 11.1 0.4300 77.7 1.0700 42.0 1.0300 37.8 0.8600 24.5 0.8800 34.4 0.9100 27.1 1.1900 13.6 1.0300 7.2 1.2100 53.5 1.0000 0.7600 48.5 0.9200 47.0 0.8100 43.1 0.9400 37.0
Lee_KAIST_task4_3 MLMS3_MAC Lee2017 0.7800 44.2 0.9600 13.9 1.0000 1.0 0.9400 10.6 0.4200 79.1 0.9900 43.1 0.9500 42.7 0.8500 26.0 0.8600 46.4 0.9500 27.7 1.1900 17.6 1.0500 4.4 1.2300 54.8 1.0000 0.7500 49.0 0.8300 52.3 0.7800 46.8 0.9600 33.4
Lee_KAIST_task4_4 MLMS8_MAC Lee2017 0.7500 47.1 0.9500 15.1 1.0000 0.5 0.9000 17.6 0.3900 79.6 0.9700 44.4 0.9400 42.5 0.7800 38.6 0.8900 53.7 0.9100 33.3 1.4000 17.9 1.0000 4.6 1.1700 55.9 1.0000 0.7400 54.4 0.7900 56.1 0.7500 55.5 0.9300 37.1
Lee_SNU_task4_1 EMSI1 Lee2017a 0.6700 54.4 0.6600 53.9 1.0000 5.0 0.8300 31.1 0.2900 85.4 0.9100 52.7 0.9300 37.5 0.9000 22.6 0.7800 51.8 0.9100 38.2 0.9300 22.6 1.0600 2.0 0.7300 66.0 1.0000 0.6300 59.2 0.7200 63.6 0.5100 75.4 0.9300 32.4
Lee_SNU_task4_2 EMSI2 Lee2017a 0.6700 54.4 0.6600 53.9 1.0000 0.5 0.8300 31.1 0.2900 85.4 0.9100 52.7 0.9300 37.5 0.9000 22.6 0.7800 51.8 0.9100 38.2 0.9300 22.6 1.0600 2.0 0.7300 0.7 1.0000 0.6300 59.2 0.7200 63.6 0.5100 75.4 0.9300 32.4
Lee_SNU_task4_3 EMSI3 Lee2017a 0.6700 55.4 0.6600 54.1 1.0000 0.7700 39.8 0.3000 84.6 0.8700 54.8 0.9200 38.6 0.8600 29.4 0.8000 56.6 0.9400 40.2 0.9000 31.2 1.0400 1.3 0.7300 66.7 1.0000 0.6200 59.6 0.7500 64.0 0.5300 74.4 0.9300 33.8
Lee_SNU_task4_4 EMSI4 Lee2017a 0.6600 55.5 0.6700 53.2 1.0000 0.5 0.7800 38.2 0.3000 84.7 0.8600 54.4 0.9100 39.1 0.8800 26.4 0.7800 55.8 0.9800 37.8 0.8800 31.2 1.0400 1.3 0.7300 67.0 1.0000 0.6100 61.2 0.7300 64.1 0.5200 74.9 0.9200 33.5
Salamon_NYU_task4_1 Salamon_1 Salamon2017 0.8200 46.2 0.9500 39.8 0.9700 18.2 0.9200 20.3 0.3900 80.6 1.0100 53.1 1.0700 42.2 0.7400 48.0 0.9800 44.2 0.9500 47.3 1.8000 24.8 1.1500 11.5 1.3100 53.5 1.0900 1.1 0.9000 50.6 0.9900 50.8 0.8300 56.5 1.3100 23.7
Salamon_NYU_task4_2 Salamon_2 Salamon2017 0.8500 45.6 1.0000 38.8 1.0500 17.4 0.9300 22.5 0.3700 81.6 1.0600 52.0 1.1600 37.2 1.0000 33.0 1.0400 46.2 0.9600 55.5 2.1200 21.6 1.1900 10.0 1.4400 50.9 1.0600 0.0 1.0100 45.7 0.9200 50.8 0.7600 59.8 1.2000 26.1
Salamon_NYU_task4_3 Salamon_3 Salamon2017 0.7700 45.9 0.8700 33.1 1.0100 4.3 0.9600 8.4 0.3500 81.9 0.9100 50.1 0.9300 42.5 0.8500 27.6 0.8700 40.7 0.8200 45.6 1.2700 16.7 1.0300 1.7 0.9900 59.3 1.0100 0.0 0.6900 55.1 0.8900 39.1 0.8000 49.5 1.0100 22.7
Salamon_NYU_task4_4 Salamon_4 Salamon2017 0.7700 45.9 0.8700 33.1 1.0100 4.3 0.9600 8.4 0.3500 81.9 0.9100 50.1 0.9300 42.5 0.8500 27.6 0.8700 40.7 0.8200 45.6 1.2700 16.7 1.0300 1.7 0.9900 59.3 1.0100 0.0 0.6900 55.1 0.8900 39.1 0.8000 49.5 1.0100 22.7
Toan_NCU_task4_2 ToanVu2 Vu2017 0.9400 43.0 0.9100 38.3 1.1300 42.8 0.8500 38.6 0.6800 68.1 1.4200 45.3 0.9700 34.7 0.9400 43.9 1.0200 41.1 1.0400 44.4 2.3300 21.0 2.4000 26.4 0.9000 48.4 1.8900 22.0 1.0000 52.0 1.1900 41.2 0.8300 54.5 1.3100 33.3
Toan_NCU_task4_3 ToanVu3 Vu2017 0.9000 42.7 0.9100 36.5 1.1300 40.1 0.8100 39.8 0.6300 69.5 1.3800 44.9 0.9800 31.9 0.8900 45.0 0.9300 40.2 0.9300 45.1 2.1400 21.0 2.0900 27.8 0.8900 48.5 1.8600 20.4 0.9400 53.1 1.1100 41.3 0.8100 52.4 1.3100 32.4
Toan_NCU_task4_4 ToanVu4 Vu2017 0.8700 41.6 0.9000 34.1 1.1000 37.1 0.8200 36.0 0.5700 70.5 1.3100 45.5 0.9700 32.7 0.8200 45.5 0.9000 35.0 0.8500 43.2 1.8600 21.1 1.7500 26.3 0.8800 48.4 1.7600 22.0 0.9400 52.8 1.0200 41.1 0.8300 43.7 1.2600 32.1
Xu_CVSSP_task4_1 Surrey1AB Xu2017 0.7300 51.8 0.9000 47.6 0.9100 29.7 0.6700 53.3 0.2900 85.8 0.8500 55.9 0.8700 45.0 0.7900 48.1 0.7800 65.5 0.9700 53.7 1.1900 34.9 1.1400 21.6 0.7600 48.7 1.4000 18.1 0.8900 59.7 0.8600 58.7 0.6700 65.9 1.0900 35.2
Xu_CVSSP_task4_2 Surrey2AB Xu2017 0.7800 47.5 0.8700 48.4 0.9600 20.4 0.6600 58.6 0.3500 83.2 0.9800 52.4 0.9000 43.8 1.0100 39.7 0.8200 56.8 0.9900 53.3 1.4200 27.6 1.2500 23.2 0.8000 45.2 1.5100 17.8 0.9900 57.4 0.9600 50.1 0.7600 60.0 1.1800 23.1
Xu_CVSSP_task4_3 Surrey3AB Xu2017 1.0100 52.1 1.0600 45.0 1.1800 57.7 1.2500 53.9 0.3500 82.7 0.9400 60.2 0.9200 55.7 1.2500 51.7 1.6200 53.5 1.6700 50.2 2.4700 25.0 2.4900 29.3 0.8400 59.7 4.6800 24.1 1.1900 57.9 1.2600 52.8 0.9300 65.5 1.3400 44.5
Xu_CVSSP_task4_4 Surrey4AB Xu2017 0.8000 50.4 0.9700 45.9 0.9400 34.2 0.8900 50.7 0.3100 85.0 0.8600 56.2 0.9100 44.0 0.9200 52.4 1.0400 61.0 0.8000 65.1 1.9900 24.2 1.7100 28.9 0.9000 48.0 1.8100 21.0 1.0100 58.2 1.0100 55.4 0.8000 65.1 1.3000 38.3

System characteristics

Subtask A - Audio tagging

Rank Submission Information Technical
Report
(overall) System characteristics
Code Name F1 Input Sampling
rate
Data
augmentation
Features Classifier Decision
making
Adavanne_TUT_task4_1 Ash_1 Adavanne2017 45.5 mono 44.1kHz log-mel energies CRNN thresholding
Adavanne_TUT_task4_2 Ash_2 Adavanne2017 46.6 mono 44.1kHz log-mel energies CRNN thresholding
Adavanne_TUT_task4_3 Ash_3 Adavanne2017 44.5 mono 44.1kHz log-mel energies CRNN thresholding
Adavanne_TUT_task4_4 Ash_4 Adavanne2017 26.3 mono 44.1kHz log-mel energies CRNN thresholding
Chou_SINICA_task4_1 FCNN_SM_1 Chou2017 47.6 mono 44.1kHz spectrogram CNN majority vote
Chou_SINICA_task4_2 FCNN_SM_2 Chou2017 49.0 mono 44.1kHz spectrogram CNN majority vote
Chou_SINICA_task4_3 FCNN_SM_3 Chou2017 47.9 mono 44.1kHz spectrogram CNN majority vote
Chou_SINICA_task4_4 FCNN_SM_3 Chou2017 49.0 mono 44.1kHz spectrogram CNN majority vote
DCASE2017 baseline Baseline Badlani2017 18.2 mono 44.1kHz log-mel energies MLP median filtering
Kukanov_UEF_task4_1 K-CRNN-MFoM Kukanov2017 39.6 mono 44.1kHz log-mel energies CRNN-MFoM median filtering
Lee_KAIST_task4_1 SDCNN_MAC Lee2017 40.3 mono 44.1kHz raw waveforms CNN thresholding
Lee_KAIST_task4_2 MLMS5_MAC Lee2017 47.3 mono 44.1kHz raw waveforms CNN thresholding
Lee_KAIST_task4_3 MLMS3_MAC Lee2017 47.2 mono 44.1kHz raw waveforms CNN thresholding
Lee_KAIST_task4_4 MLMS8_MAC Lee2017 47.1 mono 44.1kHz raw waveforms CNN thresholding
Lee_SNU_task4_1 EMSI1 Lee2017a 52.3 mono 44.1kHz log-mel energies CNN, ensemble mean probability
Lee_SNU_task4_2 EMSI2 Lee2017a 52.3 mono 44.1kHz log-mel energies CNN, ensemble mean probability
Lee_SNU_task4_3 EMSI3 Lee2017a 52.6 mono 44.1kHz log-mel energies CNN, ensemble weighted mean probability
Lee_SNU_task4_4 EMSI4 Lee2017a 52.1 mono 44.1kHz log-mel energies CNN, ensemble weighted mean probability
Salamon_NYU_task4_1 Salamon_1 Salamon2017 46.0 mono 44.1kHz pitch shifting log-mel energies CRNN raw output
Salamon_NYU_task4_2 Salamon_2 Salamon2017 45.3 mono 44.1kHz pitch shifting log-mel energies CRNN raw output
Salamon_NYU_task4_3 Salamon_3 Salamon2017 44.9 mono 44.1kHz pitch shifting, dynamic range compression log-mel energies ensemble raw output
Salamon_NYU_task4_4 Salamon_4 Salamon2017 38.1 mono 44.1kHz pitch shifting, dynamic range compression log-mel energies ensemble raw output
Toan_NCU_task4_1 ToanVu1 Vu2017 48.5 mono 22050 Hz log-mel energies DenseNet
Toan_NCU_task4_2 ToanVu2 Vu2017 46.5 mono 22050 Hz log-mel energies DenseNet median filtering
Tseng_Bosch_task4_1 Bosch1 Tseng2017 35.0 mono 44.1kHz log-mel energies ensemble max pooling
Tseng_Bosch_task4_2 Bosch2 Tseng2017 35.1 mono 44.1kHz log-mel energies ensemble max pooling
Tseng_Bosch_task4_3 Bosch3 Tseng2017 35.2 mono 44.1kHz log-mel energies ensemble max pooling
Tseng_Bosch_task4_4 Bosch4 Tseng2017 35.2 mono 44.1kHz log-mel energies ensemble max pooling
Xu_CVSSP_task4_1 Surrey1AB Xu2017 54.4 mono 44.1kHz log-mel energies CRNN
Xu_CVSSP_task4_2 Surrey2AB Xu2017 55.6 mono 44.1kHz log-mel energies CRNN
Xu_CVSSP_task4_3 Surrey3AB Xu2017 54.2 mono 44.1kHz log-mel energies CRNN
Xu_CVSSP_task4_4 Surrey4AB Xu2017 52.8 mono 44.1kHz log-mel energies CRNN

Subtask B - Sound event detection

Rank Submission Information Technical
Report
Segment based (overall) System characteristics
Code Name ER F1 Input Sampling
rate
Data
augmentation
Features Classifier Decision
making
Adavanne_TUT_task4_1 Ash_1 Adavanne2017 0.8100 47.9 mono 44.1kHz log-mel energies CRNN thresholding
Adavanne_TUT_task4_2 Ash_2 Adavanne2017 0.8000 48.3 mono 44.1kHz log-mel energies CRNN thresholding
Adavanne_TUT_task4_3 Ash_3 Adavanne2017 0.8200 48.9 mono 44.1kHz log-mel energies CRNN thresholding
Adavanne_TUT_task4_4 Ash_4 Adavanne2017 0.7900 49.0 mono 44.1kHz log-mel energies CRNN thresholding
Chou_SINICA_task4_1 FCNN_SM_1 Chou2017 0.8300 42.4 mono 44.1kHz spectrogram CNN majority vote
DCASE2017 baseline Baseline Badlani2017 0.9300 28.4 mono 44.1kHz log-mel energies MLP median filtering
Lee_KAIST_task4_1 SDCNN_MAC Lee2017 0.8200 39.4 mono 44.1kHz raw waveforms CNN thresholding
Lee_KAIST_task4_2 MLMS5_MAC Lee2017 0.7800 42.6 mono 44.1kHz raw waveforms CNN thresholding
Lee_KAIST_task4_3 MLMS3_MAC Lee2017 0.7800 44.2 mono 44.1kHz raw waveforms CNN thresholding
Lee_KAIST_task4_4 MLMS8_MAC Lee2017 0.7500 47.1 mono 44.1kHz raw waveforms CNN thresholding
Lee_SNU_task4_1 EMSI1 Lee2017a 0.6700 54.4 mono 44.1kHz log-mel energies CNN, ensemble mean probability
Lee_SNU_task4_2 EMSI2 Lee2017a 0.6700 54.4 mono 44.1kHz log-mel energies CNN, ensemble mean probability
Lee_SNU_task4_3 EMSI3 Lee2017a 0.6700 55.4 mono 44.1kHz log-mel energies CNN, ensemble weighted mean probability
Lee_SNU_task4_4 EMSI4 Lee2017a 0.6600 55.5 mono 44.1kHz log-mel energies CNN, ensemble weighted mean probability
Salamon_NYU_task4_1 Salamon_1 Salamon2017 0.8200 46.2 mono 44.1kHz pitch shifting log-mel energies CRNN raw output
Salamon_NYU_task4_2 Salamon_2 Salamon2017 0.8500 45.6 mono 44.1kHz pitch shifting log-mel energies CRNN raw output
Salamon_NYU_task4_3 Salamon_3 Salamon2017 0.7700 45.9 mono 44.1kHz pitch shifting, dynamic range compression log-mel energies ensemble raw output
Salamon_NYU_task4_4 Salamon_4 Salamon2017 0.7700 45.9 mono 44.1kHz pitch shifting, dynamic range compression log-mel energies ensemble raw output
Toan_NCU_task4_2 ToanVu2 Vu2017 0.9400 43.0 mono 22050 Hz log-mel energies DenseNet median filtering
Toan_NCU_task4_3 ToanVu3 Vu2017 0.9000 42.7 mono 22050 Hz log-mel energies DenseNet median filtering
Toan_NCU_task4_4 ToanVu4 Vu2017 0.8700 41.6 mono 22050 Hz log-mel energies DenseNet median filtering
Xu_CVSSP_task4_1 Surrey1AB Xu2017 0.7300 51.8 mono 44.1kHz log-mel energies CRNN
Xu_CVSSP_task4_2 Surrey2AB Xu2017 0.7800 47.5 mono 44.1kHz log-mel energies CRNN
Xu_CVSSP_task4_3 Surrey3AB Xu2017 1.0100 52.1 mono 44.1kHz log-mel energies CRNN
Xu_CVSSP_task4_4 Surrey4AB Xu2017 0.8000 50.4 mono 44.1kHz log-mel energies CRNN

Technical reports

Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network

Abstract

This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log mel-band energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
Decision making thresholding
PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier MLP
Decision making median filtering
PDF

FrameCNN: A Weakly-Supervised Learning Framework for Frame-Wise Acoustic Event Detection and Classification

Abstract

In this paper, we describe our contribution to the challenge of detection and classification of acoustic scenes and events (DCASE2017).We propose framCNN, a novel weakly supervised learning frame-work that improves the performance of convolutional neural net-work (CNN) for acoustic event detection by attending to details of each sound at various temporal levels. Most existing weakly-supervised frameworks replace fully-connected network with global average pooling after the final convolution layer. Such a method tends to identify only a few discriminative parts, leading to sub-optimal localization and classification accuracy. The key idea of our approach is to consciously classify the sound of each frame given by the corresponding label. The idea is general and can be applied to any network for achieving sound event detection and improving the performance of sound event classification. In acoustic scene classification (Task1), our approach obtained an average accuracy of 99.2% on the four-fold cross-validation for acoustic scene recognition, comparing to the provided baseline of 74.8%. In the large-scale weakly supervised sound event detection for smart cars(Task4), we obtained a F-score 53.8% for sound event audio tagging (subtask A), compared to the baseline of 19.8%, and a F-score32.8% for sound event detection (subtask B), compared to the base-line of 11.4%

System characteristics
Input mono
Sampling rate 44.1kHz
Features spectrogram
Classifier CNN
Decision making majority vote
PDF

Recurrent Neural Network and Maximal Figure of Merit for Acoustic Event Detection

Abstract

In this report, we describe the systems submitted to the DCASE 2017 challenge. In particular, we explored convolutional recurrent neural network (CRNN) for acoustic scene classification (Task 1). For the weakly supervised sound event detection (Task 4), we utilized CRNN by embedding maximal figure-of-merit (CRNN-MFoM) into the binary cross-entropy objective function. On the development data set, the CRNN model achieves an average 14.7% relative accuracy improvement on the classification Task 1, the CRNN-MFoM improves F1-score from 10.9% to 33.5% on the detection Task 4 compared to the baseline system.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN-MFoM
Decision making median filtering
PDF

Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection

Abstract

This paper describes our method submitted to large-scale weakly supervised sound event detection for smart cars in DCASE Challenge 2017. It it based on two techniques that have been previously suggested for music auto-tagging. One is training sample-level deep convolutional neural networks using raw waveforms as feature extractors. The other is aggregating features on multi-scaled models of the CNNs and making final predictions. With this approach, we achieved the best results, 44.3% in F-score on subtask A (audio tagging) and 0.84 in error rate on subtask B (sound event detection). Finally, we visualize hierarchically learned filters from the challenge dataset in each layer of the raw waveform based model to explain how they discriminate the events.

System characteristics
Input mono
Sampling rate 44.1kHz
Features raw waveforms
Classifier CNN
Decision making thresholding
PDF

Ensemble of Convolutional Neural Networks for Weakly-Supervised Sound Event Detection Using Multiple Scale Input

Abstract

In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN, ensemble
Decision making mean probability; weighted mean probability
PDF

DCASE 2017 Submission: Multiple Instance Learning for Sound Event Detection

Abstract

This extended abstract describes the design and implementation of a multiple instance learning model for sound event detection. The submitted systems use a convolutional-recurrent neural network (CRNN) architecture to learn strong (temporally localized) predictors from weakly labeled data. Four variants of the proposed methods were submitted to DCASE 2017, Task 4.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation pitch shifting; pitch shifting, dynamic range compression
Features log-mel energies
Classifier CRNN; ensemble
Decision making raw output
PDF

Large-Scale Weakly Supervised Sound Event Detection

Abstract

State-of-the-art audio event detection (AED) systems fully rely on supervised-learning based on strongly labeled data.The dependence on strong labels severely limits the scalability of AED work. Large-scale manually annotated datasets are difficult and expensive to collect [1], whereas weakly labeled data could be much easier to acquire. In weakly labeled data, we only need to determine whether an event in the recording is present or absent. This not only makes manual labeling significantly easier but also makes automatically infer-ring labels from online multimedia or audio meta-information(titles, tags, etc) possible [2]. This work employs a subset of Google’s AudioSet [3], which is a large number of weakly labeled YouTube video excerpts. The subset focuses on transportation and warning sounds and consists of 17 sound events divided into two categories: Warning and Vehicle.We perform experiments on 3 sets of features, including standard Mel frequency cepstral coefficients (MFCC) and log-Mel spectrograms and pre-trained embeddings extracted from a deep convolutional network. Our system employs multiple instance learning (MIL) [4] approaches to deal with weak labels by bagging them to positive or negative bags. We apply 4 models, Deep Neural Network (DNN), Recurrent Neural Network (RNN) and Convolutional Deep NeuralNetwork. Using the late-fusion approach, we improve the performance of the baseline audio tagging (Subtask A) F1score 13.1% by 18.1%.The embeddings extracted by the convolutional neural networks significantly boosts the performance of all the models.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier ensemble
Decision making max pooling
PDF

Deep Learning for DCASE2017 Challenge

Abstract

This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.

System characteristics
Input mono
Sampling rate 22050 Hz
Features log-mel energies
Classifier DenseNet
Decision making median filtering
PDF

Surrey-CVSSP System for DCASE2017 Challenge Task4

Abstract

In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.72 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
PDF