Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Supporting web site (last updated 27.03.2019)

Publication info

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Emre Cakir, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen

Transactions on Audio, Speech and Language Processing: Special issue on Sound Scene and Event Analysis, 25(6):1291–1303, June 2017. doi:10.1109/TASLP.2017.2690575


This is the supporting web site for the article "Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection".

The evaluations in the paper are conducted with four datasets: TUT-SED Synthetic 2016, TUT-SED 2009, TUT-SED 2016, and CHiME-Home. This page collect basic information about these datasets and provides necessary information to reproduce evaluation setups. In addition to this, download links are provided to the datasets when it is permitted.


TUT-SED Synthetic 2016

TUT-SED Synthetic 2016 contains is mixture signals artificially generated from isolated sound events samples. This approach is used to get more accurate onset and offset annotations than in dataset using recordings from real acoustic environments where the annotations are always subjective. This dataset was created for this publication to be used as primary evaluation dataset.

Mixture signals in the dataset are created by randomly selecting and mixing isolated sound events from 16 sound event classes together. The resulting mixtures contains sound events with varying polyphony. All together 994 sound event samples were purchased from Stockmusic. From the 100 mixtures created, 60% were assigned for training, 20% for testing and 20% for validation. The total amount of audio material in the dataset is 566 minutes.

Different instances of the sound events are used to synthesize the training, validation and test partitions. Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 10-15 seconds. Between events, random length silent region was introduced. Such tracks were created for four to nine event classes, and were then mixed together to form the mixture signal. As sound events are not consistently active during the samples (e.g. footsteps), automatic signal energy based annotation was applied to obtain accurate event activity within the sample. Annotation of the mixture signal was created by pooling together event activity annotation of used samples.

In publications using the datasets, cite as:

Emre Cakir,Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen, Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection, submitted, 2016.

Detailed information of the dataset


Dataset access rights

The dataset is intended for academic research only. The content of this dataset is password protected. In order to obtain your user name and password to download this dataset, please read, sign and send a scanned copy of the End User License Agreement (below) to

Audio 1/5 (1.0 GB)
Audio 2/5 (1.0 GB)
Audio 3/5 (1.1 GB)
Audio 4/5 (1.0 GB)
Audio 5/5 (265 MB)

Acoustic feature extraction parameters:

  • log mel energies
  • 40 mel bands (htk type mel equation, 0hz-22050hz) from magnitude spectra
  • Frame length 40ms, frame hop length 20ms
  • FFT length 1024

TUT-SED 2009

This dataset consists of 8 to 14 binaural recordings from 10 real-life scenes. Over the year it has been used in many publications from our research group, and it is used in the current publication to provide comparison point to large selection of previous methods.

Each recording in the dataset is 10 to 30 minutes long, for a total of 1133 minutes. The 10 scenes are: basketball game, beach, inside a bus, inside a car, hallway, office, restaurant, shop, street and stadium with track and field events. A total of 61 classes were defined, including (wind, yelling, car, shoe squeaks, etc.) and one extra class for unknown or rare events. The average number of events active at the same time is 2.53. Event activity annotations were done manually, which introduces a degree of subjectivity. The database has a five-fold cross-validation setup with training, validation and test set split, each consisting of about 60%, 20% and 20% of the data respectively from each scene.

This is proprietary dataset and we cannot unfortunately publish it.

TUT-SED 2016

This dataset has been published for DCASE2016 challenge as development dataset, and it consists of recordings from two real-life scenes: residential area and home [Mesaros2016].

The recordings are captured each in a different location (i.e. different streets, different homes) leading to a large variability on active sound event classes between recordings. For each location, a 3-5 minute long binaural audio recording is provided, adding up to 78 minutes of audio. The recordings have been manually annotated. In total, there are seven annotated sound event classes for residential area recordings and 11 annotated sound event classes for home recordings. The four-fold cross-validation setup published along with the dataset is used in the evaluations for this paper. As in this paper we are concentrating on scene-independent sound event detection, we discard the information about the scene (contrary to the DCASE2016 challenge setup). Therefore, instead of training a separate classifier for each scene, we train a single classifier to be used in all scenes. Twenty percent of the training set recordings are assigned for validation in the training stage of the neural networks.

In publications using the datasets, cite as:

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, TUT database for acoustic scene classification and sound event detection, In 24rd European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. PDF



Chime-home dataset was used in the DCASE2016 challenge, domestic tagging task. In the dataset, the prominent sound sources in 4-second chunks are annotated. Audio is recorded in home environment, and annotated sounds are related to human-activity: female speech, male speech, child speech, video games, percussive sounds (e.g. crash, bang, knock, or footsteps), broadband noise (e.g. household applicances), and other identifiable sounds. Dataset contain audio in two sampling rates (16Khz and 48kHz), however, in this work only 16kHz signals are used.

In publications using the datasets, cite as:

Foster et al., CHiME-Home: A dataset for sound source recognition in a domestic environment, Proc WASPAA, Oct 2015. PDF



Paper used two baselines: a simple GMM-based sound event detection system (DCASE2016 baseline) and basic feed-forward neural network based.

GMM baseline

The GMM-based baseline system for the paper implements basic approach for sound event detection: MFCC are used as acoustic features and GMM is used as classifier. The acoustic features include MFCC static coefficients (0th coefficient excluded), delta coefficients and acceleration coefficients, feature vector length being 59. For each event class, a binary classifier is set up. The class model is trained using the audio segments annotated as belonging to the modeled event class, and a negative model is trained using the rest of the audio. The decision is based on likelihood ratio between the positive and negative models for each individual class, with a sliding window of one second.