Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
Public evaluations are common in many areas of research, with some challenges being active for many consecutive years. They help push the boundaries of algorithm developments to deal with more and more complex tasks. TRECVID Multimedia Event detection  is another of the long tradition evaluations, with focus on audiovisual, multi-modal event detection in video recordings. Such public evaluations provide a good opportunity for code dissemination, unification and definition of terms, procedures, benchmark datasets and evaluation metrics. It is our wish to provide a similar tool for computational auditory scene analysis, specifically for detection and classification of sound scenes and events.
The previously organized DCASE2013 challenge (sponsored by the IEEE AASP TC, and held at WASPAA 2013) attracted the interest of the research community and had a good participation rate. It also contributed on creating benchmark datasets and fostered reproducible research (6 out of 18 participating teams had their source code released through the challenge). Based on its success, we propose to organize the follow-up challenge on the performance evaluation of systems for the detection and classification of sound events. This challenge will move the DCASE setup closer to real world applications, by providing more complex problems. This will help defining a common ground for researchers that actively pursue research on this field, and offer a reference point for systems developed to perform parts of this task.
|8th February 2016||Publication of development datasets and baseline systems|
|1st June 2016||Release of evaluation datasets|
|20th June 2016||Workshop paper submission|
|7th July 2016||Submission of results and technical report|
|3rd September 2016||Workshop (satellite of EUSIPCO2016)|
Continuing the tasks of the previous DCASE, the proposed tasks for the challenge are acoustic scene classification and sound event detection within a scene.
Acoustic scene classification
The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded -- for example "park", "street", "office". The acoustic data will include recordings from 15 contexts, approximately one hour of data from each context. The setup is similar to the previous DCASE challenge, but with a higher number of classes and diversity of data.Task description Results
Sound event detection in synthetic audio
The goal of sound event detection is to detect the sound events (for example “bird singing”, “car passing by”) that are present within an audio signal, estimate their start and end times, and give a class label to each of the events.
The sound event detection challenge will consist of 2 distinct tasks. This task will focus on event detection of office sounds, and will use training material provided as isolated sound events for each class, and synthetic mixtures of the same examples in multiple SNR and event density conditions (sounds have been recorded at IRCCYN, École Centrale de Nantes). The participants will be allowed to use any combination of them for training their system. The test data will consist of synthetic mixtures of (source-independent) sound examples at various SNR levels, event density conditions and polyphony. Thus, the aim of this task is to study the behaviour of tested algorithms when facing different levels of complexity, with an added benefit that the ground truth will be most accurate, even for polyphonic mixtures.Task description Results
Sound event detection in real life audio
The third task will use training and testing material recorded in real life environments. This task evaluates performance of the sound event detection systems in multisource conditions similar to our everyday life, where the sound sources are rarely heard in isolation. In this case, there is no control over the number of overlapping sound events at each time, not in the training nor the testing audio data. The annotations of event activities are done manually, and can therefore be somewhat subjective.Task description Results
Domestic audio tagging
This task will use binaural audio recordings made in a domestic environment. Prominent sound sources in the acoustic environment are two adults and two children, television and electronic gadgets, kitchen appliances, footsteps and knocks produced by human activity, in addition to sound originating from outside the house. The audio data is provided as four-second chunks. The objective of the task is to label each chunk with one or more of a multi-label set of labels such as Child speech, Adult male speech and/or Video game/TV.Task description Results