Sound event detection
in synthetic audio

Task 2


Emmanouil Benetos
Mathieu Lagrange
Grégoire Lafay


This task will focus on event detection of office sounds in synthetic mixtures. This task will focus on event detection of overlapping office sounds in synthetic mixtures. By using synthetic mixtures in testing, this task will study the behaviour of tested algorithms when facing different levels of complexity (noise, polyphony), with the added benefit of a very accurate ground truth.

Figure 1: Overview of sound event detection system.

Audio dataset

Training material for this task consists of isolated sound events for each class and synthetic mixtures of the same examples in multiple SNR and event density conditions. The participants are allowed to use any combination of them for training their system. The test data will consist of synthetic mixtures of (source-independent) sound examples at various SNR levels, event density conditions and polyphony.

The provided sound event categories are: (11)

  • Clearing throat
  • Coughing
  • Door knock
  • Door slam
  • Drawer
  • Human laughter
  • Keyboard
  • Keys (put on table)
  • Page turning
  • Phone ringing
  • Speech

There will be 20 samples provided for each sound event class in the training set, plus a development set consisting of 18 minutes of synthetic mixture material in 2 minute length audio files. The test set will be provided close to the challenge deadline.

Recording and annotation procedure

Audio is provided by IRCCYN, École Centrale de Nantes. The material was recorded in a calm environment, using the shotgun microphone AT8035 connected to a ZOOM H4n recorder. Audio files are sampled at 44.1kHz and are monophonic. Parameters controlling the synthesized material include the event-to-background ratio (EBR) with values -6, 0, 6 dB, the presence/absence of overlapping events (monophonic/polyphonic scene), as well as the number of events per class. Isolated examples in the training set will be annotated with start time, end time and event label for all sound events, while for the synthetic mixtures annotations are provided automatically by the event sequence synthesizer.

Challenge setup

Task 2 consists of two public subsets: a training dataset and a development dataset. The training dataset consists of 20 isolated sound segments per event class. The development dataset consists of 18 2min recordings, in various noise and event density conditions (see the README.txt file in the dataset folder for more details).

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed. Participants are allowed to use any combination of the training and development datasets for training their systems.

Download datasets:

Test dataset

The test dataset without ground truth annotations will be released shortly before the submission deadline. Full ground truth meta data for it will be published after the DCASE 2016 challenge.


Detailed information for the challenge submission can found from submission page. One should submit single .txt file per evaluated audio recording. The output file should contain a list of detected events, specified by the onset, offset and the event ID separated by a tab. Format:

[event onset in seconds (float)][tab][event offset in seconds (float)][tab][event ID (string)]

Example file

1.387392290    3.262403627    pageturn
5.073560090    5.793378684    knock

There should be no additional tab characters anywhere, and there should be no whitespace added after the label, just the newline. The 11 event IDs to be used for the .txt output are: clearthroat, cough, doorslam, drawer, keyboard, keys, knock, laughter, pageturn, phone, speech.


Tasks 2 and 3 will use the same metrics. The main metric for the challenge will be Total error rate ER. Error rate will be evaluated in one-second segments over the entire test set. Ranking of submitted systems will be done using this metric. We will also use the onset-only event-based F-measure (with 200ms tolerance) as an additional metric.

Detailed description of metrics used can be found here.

Code for evaluation is available with the baseline system. Use classes:

  • metrics/DCASE2016_EventDetection_SegmentBasedMetrics.m
  • metrics/DCASE2016_EventDetection_EventBasedMetrics.m


Rank Submission Information Corresponding Technical
Segment-based (overall)
Code Name Author Affiliation ER F1
Choi_task2_1 Choi Inkyu Choi Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea task-results-sound-event-detection-in-synthetic-audio#Choi2016 0.3660 78.7
DCASE2016 baseline DCASE2016_Baseline Emmanouil Benetos Queen Mary University of London, London, United Kingdom task-results-sound-event-detection-in-synthetic-audio#Benetos2016 0.8933 37.0
Giannoulis_task2_1 Giannoulis Panagiotis Giannoulis School of ECE, National Technical University of Athens, Athens, Greece; Athena Research and Innovation Center, Maroussi, Greece task-results-sound-event-detection-in-synthetic-audio#Giannoulis2016 0.6774 55.8
Gutierrez_task2_1 Gutierrez J.M. Gutiérrez-Arriola Escuela Técnica Superior de Ingeniería y Sistemas de Telecomunicacíon, Universidad Politécnica de Madrid, Madrid, Spain task-results-sound-event-detection-in-synthetic-audio#Gutirrez-Arriola2016 2.0870 25.0
Hayashi_task2_1 BLSTM-PP Tomoki Hayashi Nagoya University, Nagoya, Japan task-results-sound-event-detection-in-synthetic-audio#Hayashi2016 0.4082 78.1
Hayashi_task2_2 BLSTM-HMM Tomoki Hayashi Nagoya University, Nagoya, Japan task-results-sound-event-detection-in-synthetic-audio#Hayashi2016 0.4958 76.0
Komatsu_task2_1 Komatsu Tatsuya Komatsu Data Science Research Laboratories, NEC Corporation, Kawasaki, Japan task-results-sound-event-detection-in-synthetic-audio#Komatsu2016 0.3307 80.2
Kong_task2_1 Kong Qiuqiang Kong Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-results-sound-event-detection-in-synthetic-audio#Kong2016 3.5464 12.6
Phan_task2_1 Phan Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany task-results-sound-event-detection-in-synthetic-audio#Phan2016 0.5901 64.8
Pikrakis_task2_1 Pikrakis Aggelos Pikrakis Department of Informatics, University of Piraeus, Piraeus, Greece task-results-sound-event-detection-in-synthetic-audio#Pikrakis2016 0.7499 37.4
Vu_task2_1 Vu Toan H. Vu Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan task-results-sound-event-detection-in-synthetic-audio#Vu2016 0.8979 52.8

Complete results and technical reports can be found here.


  • Only the provided development dataset can be used to train the submitted system.
  • The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
  • Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process and Frequently Asked Questions.

Baseline system

A baseline system for the task is provided. The system is meant to implement a basic approach for detecting overlapping acoustic events, and provide some comparison point for the participants while developing their systems.

The baseline system is based on supervised non-negative matrix factorization (NMF), and uses a dictionary of spectral templates for performing detection, which is extracted during the training phase. The output of the NMF system is a non-binary matrix denoting event activation, which is post-processed into a list of detected events.

The baseline system provides also reference implementation of the evaluation metrics (provided by Toni Heittola). The baseline system is provided for Matlab.

Matlab implementation

Latest release (version 1.0.2)

Baseline results for development set

System parameters

  • Input: variable-Q transform spectrogram (60 bins/octave, 10ms step)
  • NMF with beta-divergence (30 iterations, beta=0.6, activation threshold=1.0)
  • Postprocessing: 90ms median filter span, up to 5 concurrent events, 60ms minimum event duration
Sound event detection results.
Segment-based overall metrics Event-based overall metrics
ER F-score F-score (onset-only)
0.7859 41.6 % 30.3 %