Domestic audio tagging

Task 4


Peter Foster
Mark D. Plumbley


This task is based on audio recordings made in a domestic environment. The objective of the task is to perform multi-label classification on 4-second audio chunks (i.e. assign zero or more labels to each 4-second audio chunk). We motivate this task for applications such as human activity monitoring, where identifying the precise boundaries of acoustic events is of secondary importance, compared to determining the presence of events in the acoustic scene. Furthermore, when obtaining annotations for this task we observed that manually tagging audio chunks was much less time-consuming compared to manually locating event boundaries within recordings. We believe our chosen approach carries substantial potential for reducing the time cost and thus improving the tractability of obtaining manual annotations for large audio databases.

Figure 1: Overview of audio tagging system.

Audio dataset

Prominent sound sources in the acoustic environment are two adults and two children, television and electronic gadgets, kitchen appliances, footsteps and knocks produced by human activity, in addition to sound originating from outside the house [Christensen2010]. The audio data are provided as 4-second chunks at two sampling rates (48kHz and 16kHz) with the 48kHz data in stereo and with the 16kHz data in mono. The 16kHz recordings were obtained by downsampling the right-hand channel of the 48kHz recordings. Each audio file corresponds to a single chunk.

All available audio data may be used for system development, however the evaluation will be performed using the monophonic audio data sampled at 16kHz, with the aim of approximating typical recording capabilities of commodity hardware.

Out of 6137 chunks, 4378 chunks are available for system development, based on partitioning at the level of 5-minute recording segments.

Download datasets:

PLEASE NOTE: If you downloaded the CHiME-Home dataset prior to the release of this information, please make sure to use the latest version of the development dataset, which includes monophonic audio data sampled at 16kHz.


The annotations are based on a set of 7 label classes, listed in Table 1. For each chunk, multi-label annotations were first obtained for each of 3 annotators. A detailed description of the annotation procedure is provided in [Foster2015]. In this task, the evaluation is based on those chunks where 2 or more annotators agreed about label presence across label classes. There are 1946 such 'strong agreement' chunks is the development dataset, and 816 such 'strong agreement' chunks in the evaluation dataset. Based on a majority vote, annotations are combined across annotators to form a single multi-label annotation (referred to as CHiME-Home-refine in [Foster2015] ).

Table 1: Labels used in annotations.
Label Description Number of occurrences
(development dataset strong agreement chunks)
c Child speech 1214
m Adult male speech 174
f Adult female speech 409
v Video game / TV 1181
p Percussive sounds, e.g. crash, bang, knock, footsteps 765
b Broadband noise, e.g. household appliances 19
o Other identifiable sounds 361

Development dataset

For the 1946 'strong agreement' chunks in the development dataset, label occurrences are summarised in Table 1. These chunks have been partitioned at the level of 5-minute recording segments for 5-fold cross validation (please refer to the file development_chunks_refined_crossval_dcase2016.csv in the updated version of the CHiME-Home dataset). In the partition, the 5-minute recording constraint was omitted for chunks labelled 'b', owing to the low number of associated label occurrences. While not used for evaluation, the remaining 2432 chunks in the development dataset may be used to train models, for example for unsupervised learning.

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.

Prediction task

For each chunk, output a classification score for each of the 7 label classes listed in Table 1.


Label prediction performance is quantified using the equal error rate (EER), which is defined as the fixed point of the graph of false negative rate versus false positive rate [Murphy2012, p. 181], for which Python and Matlab implementations are provided. The EER is computed individually for each label. When using the development data, EERs are computed individually for each cross-validation fold, before averaging the obtained EERs across folds.

Repository (Python & Matlab)


Rank Submission
Corresponding Equal Error Rate
Author Affiliation
Cakir Emre Cakir Tampere University of Technology, Finland 16.8
DCASE Peter Foster Queen Mary University of London, United Kingdom 20.9
Hertel Lars Hertel University of Luebeck, Germany 22.1
Kong Qiuqiang Kong University of Surrey, United Kingdom 18.9
Lidy Thomas Lidy Vienna University of Technology, Austria 16.6
Vu Toan H. Vu National Central University, Taiwan 21.1
Xu_1 Yong Xu University of Surrey, United Kingdom 19.5
Xu_2 Yong Xu University of Surrey, United Kingdom 19.8
Yun Sungrack Yun Qualcomm Research, South Korea 17.4

Complete results and technical reports can be found here.


  • Only the provided development dataset can be used to train the submitted system.
  • The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
  • Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process and Frequently Asked Questions.

Baseline system

A baseline system implemented in Python using MFCCs as features performs multi-label classification by associating a binary classifier with each label class. Classification scores are obtained as log-likelihood ratios using a pair of GMMs, respectively associated with label presence/absence.

Python implementation

PLEASE NOTE: The provided baseline system should attempt to download the dataset by default, prior to training and testing models using the provided cross-validation partition. The relevant script for invoking this procedure is

Results for CHiME-Home, development set

Evaluation setup

  • 5-fold cross-validation
  • 7 classes
  • Average EER across folds

System parameters

  • Frame size: 20 ms (50% hop size)
  • Number of Gaussians per audio tag model: 8
  • Features: 14 MFCC static coefficients (excluding 0th)
Audio tagging results over evaluation folds.
Audio tag EER
adult female speech 0.29
adult male speech 0.30
broadband noise 0.09
child speech 0.20
other 0.29
percussive sound 0.25
video game/tv 0.07
Mean error 0.21


Classification scores should be output to a text file containing the score associated with each (chunk, label) combination. Each line should contain a file name identifying a chunk, followed by a comma-delimited character indicating the label, followed by the classification score, e.g. file1.wav,f,0.8751. (Since there are 7 label classes, there should be 7 lines output for each chunk.) The file should be ASCII-formatted and lines should be terminated by the newline (\n) character. Relative to the baseline script location, example output may be found at data/CHiMeHome-audiotag-development/evaluation_setup.

Detailed information for the challenge submission can found from submission page.


[Christensen2010] Christensen et al., "The CHiME corpus: a resource and a challenge for computational hearing in multisource environments", Proc INTERSPEECH, pp.1918.1921, 2010. PDF

[Foster2015] Foster et al., "CHiME-Home: A dataset for sound source recognition in a domestic environment", Proc WASPAA, Oct 2015. PDF

[Murphy2012] Murphy, K. P., "Machine Learning: A Probabilistic Perspective.", MIT Press, 2012. Link