Acoustic
scene classification


Task description
Challenge has ended. Full results for this task can be found here

Description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded — for example "park", "home", "office".

Figure 1: Overview of acoustic scene classification system.

Audio dataset

TUT Acoustic Scenes 2017 dataset will be used as development data for the task. The dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into segments with a length of 10 seconds. These audio segments are provided in individual files.

Acoustic scenes for the task (15):

  • Bus - traveling by bus in the city (vehicle)
  • Cafe / Restaurant - small cafe/restaurant (indoor)
  • Car - driving or traveling as a passenger, in the city (vehicle)
  • City center (outdoor)
  • Forest path (outdoor)
  • Grocery store - medium size grocery store (indoor)
  • Home (indoor)
  • Lakeside beach (outdoor)
  • Library (indoor)
  • Metro station (indoor)
  • Office - multiple persons, typical work day (indoor)
  • Residential area (outdoor)
  • Train (traveling, vehicle)
  • Tram (traveling, vehicle)
  • Urban park (outdoor)

Detailed description of acoustic scenes included in the dataset can be found DCASE2016 Task1 page.

The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2017. The data collection has received funding from the European Research Council.

ERC

Recording and annotation procedure

For all acoustic scenes, the recordings were captured each in a different location: different streets, different parks, different homes. Recordings were made using a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment.

Postprocessing of the recorded data involves aspects related to privacy of recorded individuals. For audio material recorded in private places, written consent was obtained from all people involved. Material recorded in public places does not require such consent, but was screened for content, and privacy infringing segments were eliminated. Microphone failure and audio distortions were annotated, and the annotations are provided with the data. Based on experiments in DCASE 2016, eliminating the error regions in training does not influence the final classification accuracy. The evaluation set does not contain any such audio errors.

Download

In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.

Development dataset


Or alternatively, use the following direct links to the individual packages:

Evaluation dataset

Task setup

TUT Acoustic Scenes 2017 dataset consist of two subsets: development dataset and evaluation dataset. The development dataset consists of the complete TUT Acoustic Scenes 2016 dataset (both development and evaluation data of the 2016 challenge). The partitioning of the data into subsets was done based on the location of the original recordings, so the evaluation dataset contains recordings of similar audio scenes but from different geographical locations. All segments obtained from the same original recording were included into a single subset - either development dataset or evaluation dataset. For each acoustic scene, there are 312 segments (52 minutes of audio) in the development dataset.

A detailed description of the data recording and annotation procedure is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF

Development dataset

A cross-validation setup is provided for the development dataset in order to make results reported with this dataset uniform. The setup consists of four folds distributing the available segments based on location. The folds are provided with the dataset in the directory evaluation setup.

Fold 1 of the provided setup reproduces the DCASE 2016 challenge setup, by using the 2016 development set as training subset and the 2016 evaluation set as test subset.

Important: If you are not using the provided cross-validation setup, pay attention to the segments extracted from same original recordings. Make sure that for each given fold, ALL segments from same location must be either in the training subset OR in the test subset.

Evaluation dataset

Evaluation dataset without ground truth will be released one month before the submission deadline. Full ground truth meta data for it will be published after the DCASE 2017 challenge and workshop are concluded.

Submission

Detailed information for the challenge submission can found on the submission page.

System output should be presented as a single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

Task rules

These are the general rules valid for all tasks. The same rules and additional information on technical report and submission requirements can be found here. Task specific rules are highlighted with green.

  • Participants are not allowed to use external data for system development. Data from another task is considered external data.
  • Manipulation of provided training and development data is allowed.

    The development dataset can be augmented without use of external data (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).

  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.

Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample.

The evaluation is done automatically in the baseline system. Evaluation is done using sed_eval toolbox.


Results

Rank Submission Information Corresponding Technical
Report
Classification Accuracy
with 95% confidence interval
Code Name Author Affiliation
Abrol_IITM_task1_1 Baseline Vinayak Abrol Multimedia Analytics and Systems Lab, SCEE, Indian Institute of Technology Mandi, Mandi, India task-acoustic-scene-classification-results#Abrol2017 65.7 (63.4 - 68.0)
Amiriparian_AU_task1_1 S2S-AE Shahin Amiriparian Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany; Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany; Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany task-acoustic-scene-classification-results#Amiriparian2017 67.5 (65.3 - 69.8)
Amiriparian_AU_task1_2 Shahin_APTI Shahin Amiriparian Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany; Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany; Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany task-acoustic-scene-classification-results#Amiriparian2017a 59.1 (56.7 - 61.5)
Biho_Sogang_task1_1 Biho1 Biho Kim Sogang university, Seoul, Korea task-acoustic-scene-classification-results#Kim2017 56.5 (54.1 - 59.0)
Biho_Sogang_task1_2 Biho2 Biho Kim Sogang university, Seoul, Korea task-acoustic-scene-classification-results#Kim2017 60.5 (58.1 - 62.9)
Bisot_TPT_task1_1 TPT1 Victor Bisot Image Data and Signal, Telecom ParisTech, Paris, France task-acoustic-scene-classification-results#Bisot2017 69.8 (67.6 - 72.1)
Bisot_TPT_task1_2 TPT2 Victor Bisot Image Data and Signal, Telecom ParisTech, Paris, France task-acoustic-scene-classification-results#Bisot2017 69.6 (67.3 - 71.8)
Chandrasekhar_IIITH_task1_1 - Paseddula Chandrasekhar Speech Processing Lab, International Institute of Information Technology, Hyderabad, Hyderabad, India task-acoustic-scene-classification-results#Chandrasekhar2017 45.9 (43.4 - 48.3)
Chou_SINICA_task1_1 TP_CNN_cv1 Szu-Yu Chou Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan; Research Center for IT innovation, Academia Sinica, Taipei, Taiwan task-acoustic-scene-classification-results#Chou2017 57.1 (54.7 - 59.5)
Chou_SINICA_task1_2 SINICA Szu-Yu Chou Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan; Research Center for IT innovation, Academia Sinica, Taipei, Taiwan task-acoustic-scene-classification-results#Chou2017 61.5 (59.2 - 63.9)
Chou_SINICA_task1_3 SINICA Szu-Yu Chou Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan; Research Center for IT innovation, Academia Sinica, Taipei, Taiwan task-acoustic-scene-classification-results#Chou2017 59.8 (57.4 - 62.1)
Chou_SINICA_task1_4 SINICA Szu-Yu Chou Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan; Research Center for IT innovation, Academia Sinica, Taipei, Taiwan task-acoustic-scene-classification-results#Chou2017 57.1 (54.7 - 59.5)
Dang_NCU_task1_1 andang1 Jia-Ching Wang Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan task-acoustic-scene-classification-results#Dang2017 62.7 (60.4 - 65.1)
Dang_NCU_task1_2 andang1 Jia-Ching Wang Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan task-acoustic-scene-classification-results#Dang2017 62.7 (60.4 - 65.1)
Dang_NCU_task1_3 andang1 Jia-Ching Wang Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan task-acoustic-scene-classification-results#Dang2017 63.7 (61.4 - 66.0)
Duppada_Seernet_task1_1 Seernet Venkatesh Duppada Data Science, Seernet Technologies, LLC, Mumbai, India task-acoustic-scene-classification-results#Duppada2017 57.0 (54.6 - 59.4)
Duppada_Seernet_task1_2 Seernet Venkatesh Duppada Data Science, Seernet Technologies, LLC, Mumbai, India task-acoustic-scene-classification-results#Duppada2017 59.9 (57.5 - 62.3)
Duppada_Seernet_task1_3 Seernet Venkatesh Duppada Data Science, Seernet Technologies, LLC, Mumbai, India task-acoustic-scene-classification-results#Duppada2017 64.1 (61.7 - 66.4)
Duppada_Seernet_task1_4 Seernet Venkatesh Duppada Data Science, Seernet Technologies, LLC, Mumbai, India task-acoustic-scene-classification-results#Duppada2017 63.0 (60.7 - 65.4)
Foleiss_UTFPR_task1_1 MLPFeats Juliano Foleiss Computing Department, Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil task-acoustic-scene-classification-results#Foleiss2017 64.5 (62.2 - 66.8)
Foleiss_UTFPR_task1_2 MLPFeatRF Juliano Foleiss Computing Department, Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil task-acoustic-scene-classification-results#Foleiss2017 66.9 (64.6 - 69.2)
Fonseca_MTG_task1_1 MTG Eduardo Fonseca Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain task-acoustic-scene-classification-results#Fonseca2017 67.3 (65.1 - 69.6)
Fraile_UPM_task1_1 GAMMA-UPM Ruben Fraile Group on Acoustics and Multimedia Applicationa, Universidad Politecnica de Madrid, Madrid, Spain task-acoustic-scene-classification-results#Fraile2017 58.3 (55.9 - 60.7)
Gong_MTG_task1_1 MTG_GBMVGG Rong Gong Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain task-acoustic-scene-classification-results#Gong2017 61.2 (58.8 - 63.5)
Gong_MTG_task1_2 MTG_GBM Rong Gong Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain task-acoustic-scene-classification-results#Gong2017 61.5 (59.1 - 63.9)
Gong_MTG_task1_3 MTG_VGG Rong Gong Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain task-acoustic-scene-classification-results#Gong2017 61.9 (59.5 - 64.2)
Han_COCAI_task1_1 4fEnsemSel Yoonchang Han Cochlear.ai, Seoul, Korea task-acoustic-scene-classification-results#Han2017 79.9 (78.0 - 81.9)
Han_COCAI_task1_2 4fMeanAll Yoonchang Han Cochlear.ai, Seoul, Korea task-acoustic-scene-classification-results#Han2017 79.6 (77.7 - 81.6)
Han_COCAI_task1_3 FlEnsemSel Yoonchang Han Cochlear.ai, Seoul, Korea task-acoustic-scene-classification-results#Han2017 80.4 (78.4 - 82.3)
Han_COCAI_task1_4 flMeanAll Yoonchang Han Cochlear.ai, Seoul, Korea task-acoustic-scene-classification-results#Han2017 80.3 (78.4 - 82.2)
Hasan_BUET_task1_1 BUETBOSCH1 Taufiq Hasan Department of Biomedical Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh task-acoustic-scene-classification-results#Hyder2017 74.1 (72.0 - 76.3)
Hasan_BUET_task1_2 BUETBOSCH2 Taufiq Hasan Department of Biomedical Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh task-acoustic-scene-classification-results#Hyder2017 72.2 (70.0 - 74.3)
Hasan_BUET_task1_3 BUETBOSCH3 Taufiq Hasan Department of Biomedical Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh task-acoustic-scene-classification-results#Hyder2017 68.6 (66.3 - 70.8)
Hasan_BUET_task1_4 BUETBOSCH4 Taufiq Hasan Department of Biomedical Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh task-acoustic-scene-classification-results#Hyder2017 72.0 (69.8 - 74.2)
DCASE2017 baseline Baseline Toni Heittola Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-acoustic-scene-classification-results#Heittola2017 61.0 (58.7 - 63.4)
Huang_THU_task1_1 wjhta Taoan Huang Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China task-acoustic-scene-classification-results#Huang2017 65.5 (63.2 - 67.8)
Huang_THU_task1_2 wjhta Taoan Huang Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China task-acoustic-scene-classification-results#Huang2017 65.4 (63.1 - 67.7)
Hussain_NUCES_task1_1 - Khalid Hussain Department of electrical engineering, National University of computer and emerging sciences, Pakistan task-acoustic-scene-classification-results#Hussain2017 56.7 (54.3 - 59.1)
Hussain_NUCES_task1_2 - Khalid Hussain Department of electrical engineering, National University of computer and emerging sciences, Pakistan task-acoustic-scene-classification-results#Hussain2017 59.5 (57.1 - 61.9)
Hussain_NUCES_task1_3 - Khalid Hussain Department of electrical engineering, National University of computer and emerging sciences, Pakistan task-acoustic-scene-classification-results#Hussain2017 59.9 (57.5 - 62.3)
Hussain_NUCES_task1_4 - Khalid Hussain Department of electrical engineering, National University of computer and emerging sciences, Pakistan task-acoustic-scene-classification-results#Hussain2017 55.4 (52.9 - 57.8)
Jallet_TUT_task1_1 CRNN-1 Hugo Jallet Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-acoustic-scene-classification-results#Jallet2017 60.7 (58.4 - 63.1)
Jallet_TUT_task1_2 CRNN-2 Hugo Jallet Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-acoustic-scene-classification-results#Jallet2017 61.2 (58.8 - 63.5)
Jimenez_CMU_task1_1 LapKernel Abelino Jimenez Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA task-acoustic-scene-classification-results#Jimenez2017 59.9 (57.6 - 62.3)
Kukanov_UEF_task1_1 K-CRNN Ivan Kukanov School of Computing, University of Eastern Finland, Joensuu, Finland; Institute for Infocomm Research, A*Star, Singapore task-acoustic-scene-classification-results#Kukanov2017 71.7 (69.5 - 73.9)
Kun_TUM_UAU_UP_task1_1 Wav_SVMs Qian Kun MISP group, Technische Universität München, Munich, Germany; Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany; Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany task-acoustic-scene-classification-results#Kun2017 64.2 (61.9 - 66.5)
Kun_TUM_UAU_UP_task1_2 Wav_GRUs Qian Kun MISP group, Technische Universität München, Munich, Germany; Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany; Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany task-acoustic-scene-classification-results#Kun2017 64.0 (61.7 - 66.3)
Lehner_JKU_task1_1 JKU_IVEC Bernhard Lehner Department of Computational Perception, Johannes Kepler University, Linz, Austria task-acoustic-scene-classification-results#Lehner2017 68.7 (66.4 - 71.0)
Lehner_JKU_task1_2 JKU_ALL_av Bernhard Lehner Department of Computational Perception, Johannes Kepler University, Linz, Austria task-acoustic-scene-classification-results#Lehner2017 66.8 (64.5 - 69.1)
Lehner_JKU_task1_3 JKU_CNN Bernhard Lehner Department of Computational Perception, Johannes Kepler University, Linz, Austria task-acoustic-scene-classification-results#Lehner2017 64.8 (62.5 - 67.1)
Lehner_JKU_task1_4 JKU_All_ca Bernhard Lehner Department of Computational Perception, Johannes Kepler University, Linz, Austria task-acoustic-scene-classification-results#Lehner2017 73.8 (71.7 - 76.0)
Li_SCUT_task1_1 LiSCUTt1_1 Yanxiong Li School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results#Li2017 53.7 (51.3 - 56.1)
Li_SCUT_task1_2 LiSCUTt1_2 Yanxiong Li School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results#Li2017 63.6 (61.3 - 66.0)
Li_SCUT_task1_3 LiSCUTt1_3 Yanxiong Li School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results#Li2017 61.7 (59.4 - 64.1)
Li_SCUT_task1_4 LiSCUTt1_4 Yanxiong Li School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results#Li2017 57.8 (55.4 - 60.2)
Maka_ZUT_task1_1 ASAWI Tomasz Maka Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland task-acoustic-scene-classification-results#Maka2017 47.5 (45.1 - 50.0)
Mun_KU_task1_1 GAN_SKMUN Seongkyu Mun Intelligent Signal Processing Laboratory, Korea University, Seoul, South Korea task-acoustic-scene-classification-results#Mun2017 83.3 (81.5 - 85.1)
Park_ISPL_task1_1 ISPL Hanseok Ko School of Electrical Engineering, Korea University, Seoul, Republic of Korea task-acoustic-scene-classification-results#Park2017 72.6 (70.4 - 74.8)
Phan_UniLuebeck_task1_1 CNN Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-acoustic-scene-classification-results#Phan2017 59.0 (56.6 - 61.4)
Phan_UniLuebeck_task1_2 ACNN Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-acoustic-scene-classification-results#Phan2017 55.9 (53.5 - 58.3)
Phan_UniLuebeck_task1_3 CNN+ Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-acoustic-scene-classification-results#Phan2017 58.3 (55.9 - 60.7)
Phan_UniLuebeck_task1_4 ACNN+ Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-acoustic-scene-classification-results#Phan2017 58.0 (55.6 - 60.4)
Piczak_WUT_task1_1 amb200 Karol Piczak Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland task-acoustic-scene-classification-results#Piczak2017 70.6 (68.4 - 72.8)
Piczak_WUT_task1_2 dishes Karol Piczak Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland task-acoustic-scene-classification-results#Piczak2017 69.6 (67.3 - 71.8)
Piczak_WUT_task1_3 amb100 Karol Piczak Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland task-acoustic-scene-classification-results#Piczak2017 67.7 (65.4 - 69.9)
Piczak_WUT_task1_4 amb60 Karol Piczak Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland task-acoustic-scene-classification-results#Piczak2017 62.0 (59.6 - 64.3)
Rakotomamonjy_UROUEN_task1_1 HBGS CNN Alain Rakotomamonjy LITIS EA4108, Université de Rouen, Saint Etienne du Rouvray, France task-acoustic-scene-classification-results#Rakotomamonjy2017 61.5 (59.2 - 63.9)
Rakotomamonjy_UROUEN_task1_2 HBGS CNN-4 Alain Rakotomamonjy LITIS EA4108, Université de Rouen, Saint Etienne du Rouvray, France task-acoustic-scene-classification-results#Rakotomamonjy2017 62.7 (60.3 - 65.0)
Rakotomamonjy_UROUEN_task1_3 HBGS CNN-19 Alain Rakotomamonjy LITIS EA4108, Université de Rouen, Saint Etienne du Rouvray, France task-acoustic-scene-classification-results#Rakotomamonjy2017 62.8 (60.4 - 65.1)
Schindler_AIT_task1_1 multires Alexander Schindler Center for Digital Safety and Security, Autrian Institute of Technology, Vienna, Austria task-acoustic-scene-classification-results#Schindler2017 61.7 (59.4 - 64.1)
Schindler_AIT_task1_2 multires-p Alexander Schindler Center for Digital Safety and Security, Autrian Institute of Technology, Vienna, Austria task-acoustic-scene-classification-results#Schindler2017 61.7 (59.4 - 64.1)
Vafeiadis_CERTH_task1_1 CERTH_1 Anastasios Vafeiadis Information Technologies Institute, Center for Research & Technology Hellas, Thessaloniki, Greece task-acoustic-scene-classification-results#Vafeiadis2017 61.0 (58.6 - 63.4)
Vafeiadis_CERTH_task1_2 CERTH_2 Anastasios Vafeiadis Information Technologies Institute, Center for Research & Technology Hellas, Thessaloniki, Greece task-acoustic-scene-classification-results#Vafeiadis2017 49.5 (47.1 - 51.9)
Vij_UIET_task1_1 Vij_UIET_1 Dinesh Vij Computer Science and Engineering, University Institute of Engineering and Technology, Panjab University, Chandigarh, India task-acoustic-scene-classification-results#Vij2017 61.2 (58.9 - 63.6)
Vij_UIET_task1_2 Vij_UIET_2 Dinesh Vij Computer Science and Engineering, University Institute of Engineering and Technology, Panjab University, Chandigarh, India task-acoustic-scene-classification-results#Vij2017 57.5 (55.1 - 59.9)
Vij_UIET_task1_3 Vij_UIET_3 Dinesh Vij Computer Science and Engineering, University Institute of Engineering and Technology, Panjab University, Chandigarh, India task-acoustic-scene-classification-results#Vij2017 59.6 (57.2 - 62.0)
Vij_UIET_task1_4 Vij_UIET_4 Dinesh Vij Computer Science and Engineering, University Institute of Engineering and Technology, Panjab University, Chandigarh, India task-acoustic-scene-classification-results#Vij2017 65.0 (62.7 - 67.3)
Waldekar_IITKGP_task1_1 IITKGP_ABSP_Fusion Shefali Waldekar Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India task-acoustic-scene-classification-results#Waldekar2017 67.0 (64.7 - 69.3)
Waldekar_IITKGP_task1_2 IITKGP_ABSP_Hierarchical Shefali Waldekar Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India task-acoustic-scene-classification-results#Waldekar2017 64.9 (62.6 - 67.2)
Xing_SCNU_task1_1 DCNN_vote Xing Xiaotao School of Computer, South China Normal University, Guangzhou, China task-acoustic-scene-classification-results#Weiping2017 74.8 (72.6 - 76.9)
Xing_SCNU_task1_2 DCNN_SVM Xing Xiaotao School of Computer, South China Normal University, Guangzhou, China task-acoustic-scene-classification-results#Weiping2017 77.7 (75.7 - 79.7)
Xu_NUDT_task1_1 XuCnnMFCC Jinwei Xu Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, China task-acoustic-scene-classification-results#Xu2017 68.5 (66.2 - 70.7)
Xu_NUDT_task1_2 XuCnnMFCC Jinwei Xu Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, China task-acoustic-scene-classification-results#Xu2017 67.5 (65.3 - 69.8)
Xu_PKU_task1_1 autolog1 Xiaoshuo Xu Institute of Computer Science and Technology, Peking University, Beijing, China task-acoustic-scene-classification-results#Xu2017a 65.9 (63.6 - 68.2)
Xu_PKU_task1_2 autolog2 Xiaoshuo Xu Institute of Computer Science and Technology, Peking University, Beijing, China task-acoustic-scene-classification-results#Xu2017a 66.7 (64.4 - 69.0)
Xu_PKU_task1_3 autolog3 Xiaoshuo Xu Institute of Computer Science and Technology, Peking University, Beijing, China task-acoustic-scene-classification-results#Xu2017a 64.6 (62.3 - 67.0)
Yang_WHU_TASK1_1 MFS Yuhong Yang National Engineering Research Center for Multimedia Software, Wuhan University, Hubei, China; Collaborative Innovation Center of Geospatial Technology, Wuhan, China task-acoustic-scene-classification-results#Lu2017 61.5 (59.2 - 63.9)
Yang_WHU_TASK1_2 STD Yuhong Yang National Engineering Research Center for Multimedia Software, Wuhan University, Hubei, China; Collaborative Innovation Center of Geospatial Technology, Wuhan, China task-acoustic-scene-classification-results#Lu2017 65.2 (62.9 - 67.6)
Yang_WHU_TASK1_3 MFS+STD Yuhong Yang National Engineering Research Center for Multimedia Software, Wuhan University, Hubei, China; Collaborative Innovation Center of Geospatial Technology, Wuhan, China task-acoustic-scene-classification-results#Lu2017 62.8 (60.5 - 65.2)
Yang_WHU_TASK1_4 Pre-training Yuhong Yang National Engineering Research Center for Multimedia Software, Wuhan University, Hubei, China; Collaborative Innovation Center of Geospatial Technology, Wuhan, China task-acoustic-scene-classification-results#Lu2017 63.6 (61.3 - 66.0)
Yu_UOS_task1_1 UOS_DualIn Yu Ha-Jin School of Computer Science, University of Seoul, Seoul, Republic of South Korea task-acoustic-scene-classification-results#Jee-Weon2017 67.0 (64.7 - 69.3)
Yu_UOS_task1_2 UOS_BalCos Yu Ha-Jin School of Computer Science, University of Seoul, Seoul, Republic of South Korea task-acoustic-scene-classification-results#Jee-Weon2017 66.2 (63.9 - 68.5)
Yu_UOS_task1_3 UOS_DatDup Yu Ha-Jin School of Computer Science, University of Seoul, Seoul, Republic of South Korea task-acoustic-scene-classification-results#Jee-Weon2017 67.3 (65.1 - 69.6)
Yu_UOS_task1_4 UOS_res Yu Ha-Jin School of Computer Science, University of Seoul, Seoul, Republic of South Korea task-acoustic-scene-classification-results#Jee-Weon2017 70.6 (68.3 - 72.8)
Zhao_ADSC_task1_1 MResNet-34 Shengkui Zhao Illinois at Singapore, Advanced Digital Sciences Center, Singapore task-acoustic-scene-classification-results#Zhao2017 70.0 (67.8 - 72.2)
Zhao_ADSC_task1_2 Conv Shengkui Zhao Illinois at Singapore, Advanced Digital Sciences Center, Singapore task-acoustic-scene-classification-results#Zhao2017 67.9 (65.6 - 70.2)
Zhao_UAU_UP_task1_1 GRNN Ren Zhao Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany; Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany task-acoustic-scene-classification-results#Zhao2017a 63.8 (61.5 - 66.2)

Complete results and technical reports can be found here.

Baseline system

The baseline system for the task is provided. The system is meant to implement a basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for all tasks share the code base, implementing quite similar approach for all tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on a multilayer perceptron architecture using log mel-band energies as features. A 5-frame context is used, resulting in a feature vector length of 200. Using these features, a neural network containing two dense layers of 50 hidden units per layer and 20% dropout is trained for 200 epochs. Classification decision is based on the network output layer which is of softmax type. A detailed description is available in the baseline system documentation. The baseline system includes evaluation of results using accuracy as metric.

The baseline system is implemented using Python (version 2.7 and 3.6). Participants are allowed to build their system on top of the given baseline system. The system has all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline system is also a good starting point for entry level researchers.

Python implementation


Results for TUT Acoustic scenes 2017, development dataset

Evaluation setup

  • 4-fold cross-validation, average classification accuracy over folds
  • 15 acoustic scene classes
  • Classification unit: one file (10 seconds of audio).
  • Python 2.7.13 used

System parameters

  • Frame size: 40 ms (with 50% hop size)
  • Feature vector: 40 log mel-band energies in 5 consecutive frames = 200 values
  • MLP: 2 layers x 50 hidden units, 20% dropout, 200 epochs (using early stopping criteria, monitoring started after 100 epoch, 10 epoch patience), learning rate 0.001, softmax output layer
  • Trained and tested on full audio
Acoustic scene classification results, averaged over evaluation folds.
Acoustic scene Accuracy
Beach 75.3 %
Bus 71.8 %
Cafe / Restaurant 57.7 %
Car 97.1 %
City center 90.7 %
Forest path 79.5 %
Grocery store 58.7 %
Home 68.6 %
Library 57.1 %
Metro station 91.7 %
Office 99.7 %
Park 70.2 %
Residential area 64.1 %
Train 58.0 %
Tram 81.7 %
Overall accuracy 74.8 %