This task will focus on event detection of office sounds in synthetic mixtures. This task will focus on event detection of overlapping office sounds in synthetic mixtures. By using synthetic mixtures in testing, this task will study the behaviour of tested algorithms when facing different levels of complexity (noise, polyphony), with the added benefit of a very accurate ground truth.
Training material for this task consists of isolated sound events for each class and synthetic mixtures of the same examples in multiple SNR and event density conditions. The participants are allowed to use any combination of them for training their system. The test data will consist of synthetic mixtures of (source-independent) sound examples at various SNR levels, event density conditions and polyphony.
The provided sound event categories are: (11)
- Clearing throat
- Door knock
- Door slam
- Human laughter
- Keys (put on table)
- Page turning
- Phone ringing
There will be 20 samples provided for each sound event class in the training set, plus a development set consisting of 18 minutes of synthetic mixture material in 2 minute length audio files. The test set will be provided close to the challenge deadline.
Recording and annotation procedure
Audio is provided by IRCCYN, École Centrale de Nantes. The material was recorded in a calm environment, using the shotgun microphone AT8035 connected to a ZOOM H4n recorder. Audio files are sampled at 44.1kHz and are monophonic. Parameters controlling the synthesized material include the event-to-background ratio (EBR) with values -6, 0, 6 dB, the presence/absence of overlapping events (monophonic/polyphonic scene), as well as the number of events per class. Isolated examples in the training set will be annotated with start time, end time and event label for all sound events, while for the synthetic mixtures annotations are provided automatically by the event sequence synthesizer.
Task 2 consists of two public subsets: a training dataset and a development dataset. The training dataset consists of 20 isolated sound segments per event class. The development dataset consists of 18 2min recordings, in various noise and event density conditions (see the README.txt file in the dataset folder for more details).
Participants are not allowed to use external data for system development. Manipulation of provided data is allowed. Participants are allowed to use any combination of the training and development datasets for training their systems.
The test dataset without ground truth annotations will be released shortly before the submission deadline. Full ground truth meta data for it will be published after the DCASE 2016 challenge.
Detailed information for the challenge submission can found from submission page. One should submit single .txt file per evaluated audio recording. The output file should contain a list of detected events, specified by the onset, offset and the event ID separated by a tab. Format:
[event onset in seconds (float)][tab][event offset in seconds (float)][tab][event ID (string)]
1.387392290 3.262403627 pageturn 5.073560090 5.793378684 knock ...
There should be no additional tab characters anywhere, and there should be no whitespace added after the label, just the newline.
The 11 event IDs to be used for the .txt output are:
Tasks 2 and 3 will use the same metrics. The main metric for the challenge will be Total error rate ER. Error rate will be evaluated in one-second segments over the entire test set. Ranking of submitted systems will be done using this metric. We will also use the onset-only event-based F-measure (with 200ms tolerance) as an additional metric.
Detailed description of metrics used can be found here.
Code for evaluation is available with the baseline system. Use classes:
|Choi_task2_1||Choi||Inkyu Choi||Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea||task-results-sound-event-detection-in-synthetic-audio#Choi2016||0.3660||78.7|
|DCASE2016 baseline||DCASE2016_Baseline||Emmanouil Benetos||Queen Mary University of London, London, United Kingdom||task-results-sound-event-detection-in-synthetic-audio#Benetos2016||0.8933||37.0|
|Giannoulis_task2_1||Giannoulis||Panagiotis Giannoulis||School of ECE, National Technical University of Athens, Athens, Greece; Athena Research and Innovation Center, Maroussi, Greece||task-results-sound-event-detection-in-synthetic-audio#Giannoulis2016||0.6774||55.8|
|Gutierrez_task2_1||Gutierrez||J.M. Gutiérrez-Arriola||Escuela Técnica Superior de Ingeniería y Sistemas de Telecomunicacíon, Universidad Politécnica de Madrid, Madrid, Spain||task-results-sound-event-detection-in-synthetic-audio#Gutirrez-Arriola2016||2.0870||25.0|
|Hayashi_task2_1||BLSTM-PP||Tomoki Hayashi||Nagoya University, Nagoya, Japan||task-results-sound-event-detection-in-synthetic-audio#Hayashi2016||0.4082||78.1|
|Hayashi_task2_2||BLSTM-HMM||Tomoki Hayashi||Nagoya University, Nagoya, Japan||task-results-sound-event-detection-in-synthetic-audio#Hayashi2016||0.4958||76.0|
|Komatsu_task2_1||Komatsu||Tatsuya Komatsu||Data Science Research Laboratories, NEC Corporation, Kawasaki, Japan||task-results-sound-event-detection-in-synthetic-audio#Komatsu2016||0.3307||80.2|
|Kong_task2_1||Kong||Qiuqiang Kong||Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom||task-results-sound-event-detection-in-synthetic-audio#Kong2016||3.5464||12.6|
|Phan_task2_1||Phan||Huy Phan||Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany||task-results-sound-event-detection-in-synthetic-audio#Phan2016||0.5901||64.8|
|Pikrakis_task2_1||Pikrakis||Aggelos Pikrakis||Department of Informatics, University of Piraeus, Piraeus, Greece||task-results-sound-event-detection-in-synthetic-audio#Pikrakis2016||0.7499||37.4|
|Vu_task2_1||Vu||Toan H. Vu||Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan||task-results-sound-event-detection-in-synthetic-audio#Vu2016||0.8979||52.8|
Complete results and technical reports can be found here.
- Only the provided development dataset can be used to train the submitted system.
- The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
- The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
- Technical report with sufficient description of the system has to be submitted along with the system outputs.
A baseline system for the task is provided. The system is meant to implement a basic approach for detecting overlapping acoustic events, and provide some comparison point for the participants while developing their systems.
The baseline system is based on supervised non-negative matrix factorization (NMF), and uses a dictionary of spectral templates for performing detection, which is extracted during the training phase. The output of the NMF system is a non-binary matrix denoting event activation, which is post-processed into a list of detected events.
The baseline system provides also reference implementation of the evaluation metrics (provided by Toni Heittola). The baseline system is provided for Matlab.
Baseline results for development set
- Input: variable-Q transform spectrogram (60 bins/octave, 10ms step)
- NMF with beta-divergence (30 iterations, beta=0.6, activation threshold=1.0)
- Postprocessing: 90ms median filter span, up to 5 concurrent events, 60ms minimum event duration
|Segment-based overall metrics||Event-based overall metrics|
|0.7859||41.6 %||30.3 %|