Acoustic scene classification

Task description


Annamaria Mesaros
Toni Heittola


Listening experiment


We are currently conducting a listening experiment to assess human performance for the task of acoustic scene classification. Help us by participating to this experiment.

The listening experiment should take approximately 25 minutes.


Reference data for evaluation dataset published


The reference data is now published. At the same time, the evaluation dataset has been moved to Zenodo for permanent storage.



The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "home", "office".

Figure 1: Overview of acoustic scene classification system.

Audio dataset

TUT Acoustic scenes 2016 dataset will be used for the task. The dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into 30-second segments for the challenge.

Acoustic scenes for the task (15):

  • Bus - traveling by bus in the city (vehicle)
  • Cafe / Restaurant - small cafe/restaurant (indoor)
  • Car - driving or traveling as a passenger, in the city (vehicle)
  • City center (outdoor)
  • Forest path (outdoor)
  • Grocery store - medium size grocery store (indoor)
  • Home (indoor)
  • Lakeside beach (outdoor)
  • Library (indoor)
  • Metro station (indoor)
  • Office - multiple persons, typical work day (indoor)
  • Residential area (outdoor)
  • Train (traveling, vehicle)
  • Tram (traveling, vehicle)
  • Urban park (outdoor)

Detailed description of acoustic scenes included in the dataset can be found here.

The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2016. The data collection has received funding from the European Research Council.


Recording and annotation procedure

For all acoustic scenes, the recordings were captured each in a different location: different streets, different parks, different homes. Recordings were made using a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment.

Postprocessing of the recorded data involves aspects related to privacy of recorded individuals, and possible errors in the recording process. For audio material recorded in private places, written consent was obtained from all people involved. Material recorded in public places does not require such consent, but was screened for content, and privacy infringing segments were eliminated. Microphone failure and audio distortions were also annotated and segments containing such errors were also eliminated.

After eliminating the problematic segments, the remaining audio material was cut into segments of 30 seconds length.

Challenge setup

TUT Acoustic scenes 2016 dataset consist of two subsets: development dataset and evaluation dataset. The partitioning of the data into the subsets was done based on the location of the original recordings. All segments obtained from the same original recording were included into a single subset - either development dataset or evaluation dataset. For each acoustic scene, 78 segments (39 minutes of audio) were included in the development dataset and 26 segments (13 minutes of audio) were kept for evaluation. Development set contains in total 9h 45mins of audio, and evaluation set 3h 15mins.

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.

Download datasets:

In publications using the datasets, cite as:

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, Tut database for acoustic scene classification and sound event detection, In 24rd European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. PDF

Cross-validation with development dataset

A cross-validation setup is provided for the development dataset in order to make results reported with this dataset uniform. The setup consists of four folds distributing the 78 available segments based on location. The folds are provided with the dataset in the directory evaluation setup.

If not using the provided cross-validation setup, pay attention to the segments extracted from same original recordings. Make sure that all files recorded in same location are placed on the same side of the evaluation.

Evaluation dataset

Evaluation dataset without ground truth will be released shortly before the submission deadline. Full ground truth meta data for it will be published after the DCASE 2016 challenge.


Detailed information for the challenge submission can found from submission page.

One should submit single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]


The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample.

Code for evaluation is available with the baseline system:

  • Python implementation from src.evaluation import DCASE2016_SceneClassification_Metrics.
  • Matlab implementation, use class src/evaluation/DCASE2016_SceneClassification_Metrics.m.


Rank Submission Information Corresponding Classification
Code Name Author Affiliation
Aggarwal - Naveen Aggarwal UIET, Panjab University, Chandigarh, India 74.4
Bae CLC Soo Hyun Bae Seoul National University Department of Electrical and Computer Engineering, South Korea 84.1
Bao - Xiao Bao National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China 83.1
Battaglino - Daniele Battaglino EURECOM, France 80.0
Bisot - Victor Bisot Telecom ParisTech, France 87.7
DCASE DCASE2016_baseline Toni Heittola Tampere University of Technology, Department of Signal Processing, Tampere, Finland 77.2
Duong_1 Tec_SVM_A Quang-Khanh-Ngoc Duong Technicolor, France 76.4
Duong_2 Tec_SVM_V Quang-Khanh-Ngoc Duong Technicolor, France 80.5
Duong_3 Tec_MLP Quang-Khanh-Ngoc Duong Technicolor, France 73.1
Duong_4 Tec_CNN Quang-Khanh-Ngoc Duong Technicolor, France 62.8
Eghbal-Zadeh_1 CPJKU16_BMBI Hamid Eghbal-Zadeh Johannes Kepler University of Linz, Austria 86.4
Eghbal-Zadeh_2 CPJKU16_CBMBI Hamid Eghbal-Zadeh Johannes Kepler University of Linz, Austria 88.7
Eghbal-Zadeh_3 CPJKU16_DCNN Hamid Eghbal-Zadeh Johannes Kepler University of Linz, Austria 83.3
Eghbal-Zadeh_4 CPJKU16_LFCBI Hamid Eghbal-Zadeh Johannes Kepler University of Linz, Austria 89.7
Foleiss JFTT Juliano Henrique Foleiss Universidade Tecnologica Federal do Parana, Brazil 76.2
Hertel All-ConvNet Lars Hertel Institute for Signal Processing, University of Luebeck, Germany 79.5
Kim QRK Taesu Kim Qualcomm Research, South Korea 82.1
Ko_1 KU_ISPL1_2016 Hanseok Ko Korea University, South Korea 87.2
Ko_2 KU_ISPL2_2016 Hanseok Ko Korea University, South Korea 82.3
Kong QK Qiuqiang Kong University of Surrey, United Kingdom 81.0
Kumar Gauss Anurag Kumar Carnegie Mellon University, USA 85.9
Lee_1 MARGNet_MWFD Kyogu Lee Music and Audio Research Group (MARG), Seoul National University, Seoul, Korea 84.6
Lee_2 MARGNet_ZENS Kyogu Lee Music and Audio Research Group (MARG), Seoul National University, Seoul, Korea 85.4
Liu_1 liu-re Jiaming Liu Department of Control Science and Engineering, Tongji University, Shanghai, China 83.8
Liu_2 liu-pre Jiaming Liu Department of Control Science and Engineering, Tongji University, Shanghai, China 83.6
Lostanlen LostanlenAnden_2016 Vincent Lostanlen ENS Paris, France 80.8
Marchi Marchi_2016 Erik Marchi University of Passau, Germany; audEERING GmbH, Gilching, Germany 86.4
Marques DRKNN_2016 Gonçalo Marques Instituto Superior de Engenharia de Lisboa Electronic Telecom. and Comp. Dept., Portugal 83.1
Moritz - Niko Moritz Fraunhofer IDMT, Project Group for Hearing, Speech, and Audio Processing, Germany 79.0
Mulimani - Manjunath Mulimani National Institute of Technology, Karnataka 65.6
Nogueira - Waldo Nogueira Medical University Hannover and Cluster of Excellence Hearing4all, Hannover, Germany 81.0
Patiyal IITMandi_2016 Rohit Patiyal Indian Institute of Technology Mandi, Himachal Pradesh, India 78.5
Phan CNN-LTE Huy Phan Institute for Signal Processing, University of Luebeck, Germany 83.3
Pugachev - Alexei Pugachev ITMO University, St. Petersburg, Russia 73.1
Qu_1 - Shuhui Qu Stanford University, USA 80.5
Qu_2 - Shuhui Qu Stanford University, USA 84.1
Qu_3 - Shuhui Qu Stanford University, USA 82.3
Qu_4 - Shuhui Qu Stanford University, USA 80.5
Rakotomamonjy_1 RAK_2016_1 Alain Rakotomamonjy Normandie Université, France 82.1
Rakotomamonjy_2 RAK_2016_2 Alain Rakotomamonjy Normandie Université, France 79.2
Santoso SWW Andri Santoso National Central University, Taiwan 80.8
Schindler_1 CQTCNN_1 Alexander Schindler AIT Austrian Institute of Technology GmbH, Austria 81.8
Schindler_2 CQTCNN_2 Alexander Schindler AIT Austrian Institute of Technology GmbH, Austria 83.3
Takahashi UTNII_2016 Gen Takahashi University of Tsukuba, Japan 85.6
Valenti - Michele Valenti Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy 86.2
Vikaskumar ABSP_IITKGP_2016 Ghodasara Vikaskumar Electronics & Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, India. 81.3
Vu - Toan H. Vu Department of Computer Science and Information Engineering, National Central University, Taiwan 80.0
Xu HL-DNN-ASC_2016 Yong Xu Centre for Vision, Speech and Signal Processing, University of Surrey, United Kingdom 73.3
Zoehrer - Matthias Zöhrer Signal Processing and Speech Communication Laboratory Graz University of Technology, Austria 73.1

Complete results and technical reports can be found here.


  • Only the provided development dataset can be used to train the submitted system.
  • The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
  • Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process and Frequently Asked Questions.

Baseline system

The baseline system for the task is provided. The system is meant to implement basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for task 1 and task 3 share the code base, and implements quite similar approach for both tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on MFCC acoustic features and GMM classifier. The acoustic features include MFCC static coefficients (0th coefficient included), delta coefficients and acceleration coefficients. The system learns one acoustic model per acoustic scene class, and does the classification with maximum likelihood classification scheme.

The baseline system provides also reference implementation of evaluation metric. Baseline systems are provided for both Python and Matlab. Python implementation is regarded as the main implementation.

Participants are allowed to build their system on top of the given baseline systems. The systems have all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline systems are also good starting point for entry level researchers.

In publications using the baseline, cite as:

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, Tut database for acoustic scene classification and sound event detection, In 24rd European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. PDF

Python implementation

Latest release (version 1.0.6) (.zip)

Matlab implementation

Latest release (version 1.0.5) (.zip)

Results for TUT Acoustic scenes 2016, development set

Evaluation setup

  • 4-fold cross-validation, average classification accuracy over folds
  • 15 acoustic scene classes
  • Classification unit: one file (30 seconds of audio).

System parameters

  • Frame size: 40 ms (with 50% hop size)
  • Number of Gaussians per acoustic scene class model: 16
  • Feature vector: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values
  • Trained and tested on full audio
  • Python implementation
Acoustic scene classification results, averaged over evaluation folds.
Acoustic scene Accuracy
Beach 69.3 %
Bus 79.6 %
Cafe / Restaurant 83.2 %
Car 87.2 %
City center 85.5 %
Forest path 81.0 %
Grocery store 65.0 %
Home 82.1 %
Library 50.4 %
Metro station 94.7 %
Office 98.6 %
Park 13.9 %
Residential area 77.7 %
Train 33.6 %
Tram 85.4 %
Overall accuracy 72.5 %