TUT Acoustic scenes 2016
Not necessarily, but we recommend using it.
The original data collected for the dataset was 3-5 minute long audio recordings, and these recordings were cut into 30 second long segments. The partitioning of the data into the cross-validation subsets was done based on the location of the original recordings. In the provided setup, all segments obtained from the same original recording were included into a single subset - either train subset or test subset.
If you create folds without accounting for this, you may obtain over-optimistic results, because the system learns the acoustic characteristic of the scene at the specific location.
The file names for the segments indicate which files are part of the same longer recording, so if you use different cross-validation splits, you should partition the data based on this information.
These errors affect very small proportion of the total duration of the audio. However, in our tests with the development set folds, there is no statistically significant difference in performance when the baseline system was trained using the clean audio (excluding audio error regions) and tested on clean audio (excluding audio error regions when feeding the test sample to the system).
Annotations of these errors are provided with the development dataset, and you are allowed to take them into account when training your system. The errors are radio interference from mobile phone and temporary microphone failures, and affect approximately 4% of the number of files. Time-wise however, they affect only approximately 1% of the total duration of the audio.
The evaluation data is selected such that there will be no errors in the audio.