Acoustic scene classification


Task description

Coordinators

Annamaria Mesaros
Toni Heittola
Challenge has ended. Full results for this task can be found here

Description

The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "home", "office".

Figure 1: Overview of acoustic scene classification system.

Audio dataset

TUT Acoustic scenes 2016 dataset will be used for the task. The dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into 30-second segments for the challenge.

Acoustic scenes for the task (15):

  • Bus - traveling by bus in the city (vehicle)
  • Cafe / Restaurant - small cafe/restaurant (indoor)
  • Car - driving or traveling as a passenger, in the city (vehicle)
  • City center (outdoor)
  • Forest path (outdoor)
  • Grocery store - medium size grocery store (indoor)
  • Home (indoor)
  • Lakeside beach (outdoor)
  • Library (indoor)
  • Metro station (indoor)
  • Office - multiple persons, typical work day (indoor)
  • Residential area (outdoor)
  • Train (traveling, vehicle)
  • Tram (traveling, vehicle)
  • Urban park (outdoor)

Detailed description of acoustic scenes included in the dataset can be found here.

The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2016. The data collection has received funding from the European Research Council.

ERC

Recording and annotation procedure

For all acoustic scenes, the recordings were captured each in a different location: different streets, different parks, different homes. Recordings were made using a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment.

Postprocessing of the recorded data involves aspects related to privacy of recorded individuals, and possible errors in the recording process. For audio material recorded in private places, written consent was obtained from all people involved. Material recorded in public places does not require such consent, but was screened for content, and privacy infringing segments were eliminated. Microphone failure and audio distortions were also annotated and segments containing such errors were also eliminated.

After eliminating the problematic segments, the remaining audio material was cut into segments of 30 seconds length.

Challenge setup

TUT Acoustic scenes 2016 dataset consist of two subsets: development dataset and evaluation dataset. The partitioning of the data into the subsets was done based on the location of the original recordings. All segments obtained from the same original recording were included into a single subset - either development dataset or evaluation dataset. For each acoustic scene, 78 segments (39 minutes of audio) were included in the development dataset and 26 segments (13 minutes of audio) were kept for evaluation. Development set contains in total 9h 45mins of audio, and evaluation set 3h 15mins.

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.

Download datasets:


In publications using the datasets, cite as:

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, Tut database for acoustic scene classification and sound event detection, In 24rd European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. PDF

Cross-validation with development dataset

A cross-validation setup is provided for the development dataset in order to make results reported with this dataset uniform. The setup consists of four folds distributing the 78 available segments based on location. The folds are provided with the dataset in the directory evaluation setup.

If not using the provided cross-validation setup, pay attention to the segments extracted from same original recordings. Make sure that all files recorded in same location are placed on the same side of the evaluation.

Evaluation dataset

Evaluation dataset without ground truth will be released shortly before the submission deadline. Full ground truth meta data for it will be published after the DCASE 2016 challenge.

Submission

Detailed information for the challenge submission can found from submission page.

One should submit single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]

Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample.

Code for evaluation is available with the baseline system:

  • Python implementation from src.evaluation import DCASE2016_SceneClassification_Metrics.
  • Matlab implementation, use class src/evaluation/DCASE2016_SceneClassification_Metrics.m.

Results

Rank Submission Information Corresponding Technical
Report
Classification
Accuracy
Code Name Author Affiliation
Aggarwal_task1_1 - Naveen Aggarwal UIET, Panjab University, Chandigarh, India task-results-acoustic-scene-classification#Vij2016 74.4
Bae_task1_1 CLC Soo Hyun Bae Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea task-results-acoustic-scene-classification#Bae2016 84.1
Bao_task1_1 - Xiao Bao National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China task-results-acoustic-scene-classification#Bao2016 83.1
Battaglino_task1_1 - Daniele Battaglino NXP Software, France; EURECOM, France task-results-acoustic-scene-classification#Battaglino2016 80.0
Bisot_task1_1 - Victor Bisot Telecom ParisTech, Paris, France task-results-acoustic-scene-classification#Bisot2016 87.7
DCASE2016 baseline DCASE2016_baseline Toni Heittola Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-results-acoustic-scene-classification#Heittola2016 77.2
Duong_task1_1 Tec_SVM_A Quang-Khanh-Ngoc Duong Technicolor, France task-results-acoustic-scene-classification#Sena_Mafra2016 76.4
Duong_task1_2 Tec_SVM_V Quang-Khanh-Ngoc Duong Technicolor, France task-results-acoustic-scene-classification#Sena_Mafra2016 80.5
Duong_task1_3 Tec_MLP Quang-Khanh-Ngoc Duong Technicolor, France task-results-acoustic-scene-classification#Sena_Mafra2016 73.1
Duong_task1_4 Tec_CNN Quang-Khanh-Ngoc Duong Technicolor, France task-results-acoustic-scene-classification#Sena_Mafra2016 62.8
Eghbal-Zadeh_task1_1 CPJKU16_BMBI Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-results-acoustic-scene-classification#Eghbal-Zadeh2016 86.4
Eghbal-Zadeh_task1_2 CPJKU16_CBMBI Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-results-acoustic-scene-classification#Eghbal-Zadeh2016 88.7
Eghbal-Zadeh_task1_3 CPJKU16_DCNN Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-results-acoustic-scene-classification#Eghbal-Zadeh2016 83.3
Eghbal-Zadeh_task1_4 CPJKU16_LFCBI Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-results-acoustic-scene-classification#Eghbal-Zadeh2016 89.7
Foleiss_task1_1 JFTT Juliano Henrique Foleiss Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil task-results-acoustic-scene-classification#Foleiss2016 76.2
Hertel_task1_1 All-ConvNet Alfred Mertins Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-results-acoustic-scene-classification#Hertel2016 79.5
Kim_task1_1 QRK Alfred Mertins Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-results-acoustic-scene-classification#Yun2016 82.1
Ko_task1_1 KU_ISPL1_2016 Hanseok Ko School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea task-results-acoustic-scene-classification#Park2016 87.2
Ko_task1_2 KU_ISPL2_2016 Hanseok Ko School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea task-results-acoustic-scene-classification#Mun2016 82.3
Kong_task1_1 QK Qiuqiang Kong Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-results-acoustic-scene-classification#Kong2016 81.0
Kumar_task1_1 Gauss Anurag Kumar Carnegie Mellon University, Pittsburgh, USA task-results-acoustic-scene-classification#Elizalde2016 85.9
Lee_task1_1 MARGNet_MWFD Kyogu Lee Music and Audio Research Group, Seoul National University, Seoul, South Korea task-results-acoustic-scene-classification#Han2016 84.6
Lee_task1_2 MARGNet_ZENS Kyogu Lee Music and Audio Research Group, Seoul National University, Seoul, South Korea task-results-acoustic-scene-classification#Kim2016 85.4
Liu_task1_1 liu-re Jiaming Liu Department of Control Science and Engineering, Tongji University, Shanghai, China task-results-acoustic-scene-classification#Liu2016 83.8
Liu_task1_2 liu-pre Jiaming Liu Department of Control Science and Engineering, Tongji University, Shanghai, China task-results-acoustic-scene-classification#Liu2016 83.6
Lostanlen_task1_1 LostanlenAnden_2016 Vincent Lostanlen Departement d’Informatique, Ecole normale superieure, Paris, France task-results-acoustic-scene-classification#Lostanlen2016 80.8
Marchi_task1_1 Marchi_2016 Erik Marchi Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany; audEERING GmbH, Gilching, Germany task-results-acoustic-scene-classification#Marchi2016 86.4
Marques_task1_1 DRKNN_2016 Gonçalo Marques Electronic Telecom. and Comp. Dept., Instituto Superior de Engenharia de Lisboa, Lisboa, Portugal task-results-acoustic-scene-classification#Marques2016 83.1
Moritz_task1_1 - Niko Moritz Project Group for Hearing, Speech, and Audio Processing, Fraunhofer IDMT, Oldenburg, Germany task-results-acoustic-scene-classification#Moritz2016 79.0
Mulimani_task1_1 - Manjunath Mulimani Dept. of Computer Science & Engineering, National Institute of Technology, Karnataka, India task-results-acoustic-scene-classification#Mulimani2016 65.6
Nogueira_task1_1 - Waldo Nogueira Medical University Hannover, Hannover, Germany; Cluster of Excellence Hearing4all, Hannover, Germany task-results-acoustic-scene-classification#Nogueira2016 81.0
Patiyal_task1_1 IITMandi_2016 Rohit Patiyal School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Himachal Pradesh, India task-results-acoustic-scene-classification#Patiyal2016 78.5
Phan_task1_1 CNN-LTE Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany task-results-acoustic-scene-classification#Phan2016 83.3
Pugachev_task1_1 - Alexei Pugachev Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia task-results-acoustic-scene-classification#Pugachev2016 73.1
Qu_task1_1 - Shuhui Qu Stanford University, Stanford, USA task-results-acoustic-scene-classification#Dai2016 80.5
Qu_task1_2 - Shuhui Qu Stanford University, Stanford, USA task-results-acoustic-scene-classification#Dai2016 84.1
Qu_task1_3 - Shuhui Qu Stanford University, Stanford, USA task-results-acoustic-scene-classification#Dai2016 82.3
Qu_task1_4 - Shuhui Qu Stanford University, Stanford, USA task-results-acoustic-scene-classification#Dai2016 80.5
Rakotomamonjy_task1_1 RAK_2016_1 Alain Rakotomamonjy Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France task-results-acoustic-scene-classification#Rakotomamonjy2016 82.1
Rakotomamonjy_task1_2 RAK_2016_2 Alain Rakotomamonjy Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France task-results-acoustic-scene-classification#Rakotomamonjy2016 79.2
Santoso_task1_1 SWW Andri Santoso National Central University, Taiwan task-results-acoustic-scene-classification#Santoso2016 80.8
Schindler_task1_1 CQTCNN_1 Alexander Schindler Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria task-results-acoustic-scene-classification#Lidy2016 81.8
Schindler_task1_2 CQTCNN_2 Alexander Schindler Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria task-results-acoustic-scene-classification#Lidy2016 83.3
Takahashi_task1_1 UTNII_2016 Gen Takahashi University of Tsukuba, Tsukuba, Japan task-results-acoustic-scene-classification#Takahashi2016 85.6
Valenti_task1_1 - Michele Valenti Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy task-results-acoustic-scene-classification#Valenti2016 86.2
Vikaskumar_task1_1 ABSP_IITKGP_2016 Ghodasara Vikaskumar Electronics & Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India task-results-acoustic-scene-classification#Vikaskumar2016 81.3
Vu_task1_1 - Toan H. Vu Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan task-results-acoustic-scene-classification#Vu2016 80.0
Xu_task1_1 HL-DNN-ASC_2016 Yong Xu Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-results-acoustic-scene-classification#Xu2016 73.3
Zoehrer_task1_1 - Matthias Zöhrer Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria task-results-acoustic-scene-classification#Zoehrer2016 73.1

Complete results and technical reports can be found here.

Rules

  • Only the provided development dataset can be used to train the submitted system.
  • The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
  • Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process and Frequently Asked Questions.

Baseline system

The baseline system for the task is provided. The system is meant to implement basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for task 1 and task 3 share the code base, and implements quite similar approach for both tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on MFCC acoustic features and GMM classifier. The acoustic features include MFCC static coefficients (0th coefficient included), delta coefficients and acceleration coefficients. The system learns one acoustic model per acoustic scene class, and does the classification with maximum likelihood classification scheme.

The baseline system provides also reference implementation of evaluation metric. Baseline systems are provided for both Python and Matlab. Python implementation is regarded as the main implementation.

Participants are allowed to build their system on top of the given baseline systems. The systems have all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline systems are also good starting point for entry level researchers.

In publications using the baseline, cite as:

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, Tut database for acoustic scene classification and sound event detection, In 24rd European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. PDF

Python implementation

Latest release (version 1.0.6) (.zip)

Matlab implementation

Latest release (version 1.0.5) (.zip)

Results for TUT Acoustic scenes 2016, development set

Evaluation setup

  • 4-fold cross-validation, average classification accuracy over folds
  • 15 acoustic scene classes
  • Classification unit: one file (30 seconds of audio).

System parameters

  • Frame size: 40 ms (with 50% hop size)
  • Number of Gaussians per acoustic scene class model: 16
  • Feature vector: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values
  • Trained and tested on full audio
  • Python implementation
Acoustic scene classification results, averaged over evaluation folds.
Acoustic scene Accuracy
Beach 69.3 %
Bus 79.6 %
Cafe / Restaurant 83.2 %
Car 87.2 %
City center 85.5 %
Forest path 81.0 %
Grocery store 65.0 %
Home 82.1 %
Library 50.4 %
Metro station 94.7 %
Office 98.6 %
Park 13.9 %
Residential area 77.7 %
Train 33.6 %
Tram 85.4 %
Overall accuracy 72.5 %