This task is based on audio recordings made in a domestic environment. The objective of the task is to perform multi-label classification on 4-second audio chunks (i.e. assign zero or more labels to each 4-second audio chunk). We motivate this task for applications such as human activity monitoring, where identifying the precise boundaries of acoustic events is of secondary importance, compared to determining the presence of events in the acoustic scene. Furthermore, when obtaining annotations for this task we observed that manually tagging audio chunks was much less time-consuming compared to manually locating event boundaries within recordings. We believe our chosen approach carries substantial potential for reducing the time cost and thus improving the tractability of obtaining manual annotations for large audio databases.
Prominent sound sources in the acoustic environment are two adults and two children, television and electronic gadgets, kitchen appliances, footsteps and knocks produced by human activity, in addition to sound originating from outside the house [Christensen2010]. The audio data are provided as 4-second chunks at two sampling rates (48kHz and 16kHz) with the 48kHz data in stereo and with the 16kHz data in mono. The 16kHz recordings were obtained by downsampling the right-hand channel of the 48kHz recordings. Each audio file corresponds to a single chunk.
All available audio data may be used for system development, however the evaluation will be performed using the monophonic audio data sampled at 16kHz, with the aim of approximating typical recording capabilities of commodity hardware.
Out of 6137 chunks, 4378 chunks are available for system development, based on partitioning at the level of 5-minute recording segments.
PLEASE NOTE: If you downloaded the CHiME-Home dataset prior to the release of this information, please make sure to use the latest version of the development dataset, which includes monophonic audio data sampled at 16kHz.
The annotations are based on a set of 7 label classes, listed in Table 1. For each chunk, multi-label annotations were first obtained for each of 3 annotators. A detailed description of the annotation procedure is provided in [Foster2015]. In this task, the evaluation is based on those chunks where 2 or more annotators agreed about label presence across label classes. There are 1946 such 'strong agreement' chunks is the development dataset, and 816 such 'strong agreement' chunks in the evaluation dataset. Based on a majority vote, annotations are combined across annotators to form a single multi-label annotation (referred to as CHiME-Home-refine in [Foster2015] ).
|Label||Description||Number of occurrences
(development dataset strong agreement chunks)
|m||Adult male speech||174|
|f||Adult female speech||409|
|v||Video game / TV||1181|
|p||Percussive sounds, e.g. crash, bang, knock, footsteps||765|
|b||Broadband noise, e.g. household appliances||19|
|o||Other identifiable sounds||361|
For the 1946 'strong agreement' chunks in the development dataset, label occurrences are summarised in Table 1. These chunks have been partitioned at the level of 5-minute recording segments for 5-fold cross validation (please refer to the file
development_chunks_refined_crossval_dcase2016.csv in the updated version of the CHiME-Home dataset). In the partition, the 5-minute recording constraint was omitted for chunks labelled 'b', owing to the low number of associated label occurrences. While not used for evaluation, the remaining 2432 chunks in the development dataset may be used to train models, for example for unsupervised learning.
Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.
For each chunk, output a classification score for each of the 7 label classes listed in Table 1.
Label prediction performance is quantified using the equal error rate (EER), which is defined as the fixed point of the graph of false negative rate versus false positive rate [Murphy2012, p. 181], for which Python and Matlab implementations are provided. The EER is computed individually for each label. When using the development data, EERs are computed individually for each cross-validation fold, before averaging the obtained EERs across folds.
|Equal Error Rate|
|Cakir_task4_1||Emre Cakir||Tampere University of Technology, Tampere, Finland||task-results-audio-tagging#Cakir2016||16.8|
|DCASE2016 baseline||Peter Foster||Queen Mary University of London, London, United Kingdom||task-results-audio-tagging#Foster2016||20.9|
|Hertel_task4_1||Lars Hertel||Institute for Signal Processing, University of Luebeck, Luebeck, Germany||task-results-audio-tagging#Hertel2016||22.1|
|Kong_task4_1||Qiuqiang Kong||Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom||task-results-audio-tagging#Kong2016||18.9|
|Lidy_task4_1||Thomas Lidy||Institute of Software Technology, Vienna University of Technology, Vienna, Austria||task-results-audio-tagging#Lidy2016||16.6|
|Vu_task4_1||Toan H. Vu||Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan||task-results-audio-tagging#Vu2016||21.1|
|Xu_task4_1||Yong Xu||Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom||task-results-audio-tagging#Xu2016||19.5|
|Xu_task4_2||Yong Xu||Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom||task-results-audio-tagging#Xu2016||19.8|
|Yun_task4_1||Sungrack Yun||Qualcomm Research, Seoul, South Korea||task-results-audio-tagging#Yun2016||17.4|
Complete results and technical reports can be found here.
- Only the provided development dataset can be used to train the submitted system.
- The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
- The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
- Technical report with sufficient description of the system has to be submitted along with the system outputs.
A baseline system implemented in Python using MFCCs as features performs multi-label classification by associating a binary classifier with each label class. Classification scores are obtained as log-likelihood ratios using a pair of GMMs, respectively associated with label presence/absence.
PLEASE NOTE: The provided baseline system should attempt to download the dataset by default, prior to training and testing models using the provided cross-validation partition. The relevant script for invoking this procedure is
Results for CHiME-Home, development set
- 5-fold cross-validation
- 7 classes
- Average EER across folds
- Frame size: 20 ms (50% hop size)
- Number of Gaussians per audio tag model: 8
- Features: 14 MFCC static coefficients (excluding 0th)
|adult female speech||0.29|
|adult male speech||0.30|
Classification scores should be output to a text file containing the score associated with each (chunk, label) combination. Each line should contain a file name identifying a chunk, followed by a comma-delimited character indicating the label, followed by the classification score, e.g.
file1.wav,f,0.8751. (Since there are 7 label classes, there should be 7 lines output for each chunk.) The file should be ASCII-formatted and lines should be terminated by the newline (
\n) character. Relative to the baseline script location, example output may be found at
Detailed information for the challenge submission can found from submission page.
[Christensen2010] Christensen et al., "The CHiME corpus: a resource and a challenge for computational hearing in multisource environments", Proc INTERSPEECH, pp.1918.1921, 2010. PDF
[Foster2015] Foster et al., "CHiME-Home: A dataset for sound source recognition in a domestic environment", Proc WASPAA, Oct 2015. PDF
[Murphy2012] Murphy, K. P., "Machine Learning: A Probabilistic Perspective.", MIT Press, 2012. Link