Drum sound detection in polyphonic music with hidden Markov models

EURASIP Journal on Audio, Speech, and Music Processing. Volume 2009 (2009), Article ID 497292, 9 pages. (pdf). DOI:10.1155/2009/497292
Jouni Paulus, Anssi Klapuri

The method is essentially an reimplementation of the HMM baseline method from the ICASSP06 paper and from the ISMIR07 paper. The main differences are that the feature set was reduced to only 13 MFCCs and their first order derivatives. Feature processing with PCA and LDA is evaluated, and the results suggest that the LDA provides a considerable performance increase. Both drum combination modelling and drumwise detector modelling are evaluated and the results suggest that the detectors work better. Finally, unsupervised model parameter adaptation with MLLR is evaluated for the Gaussian mean parameter of the models. The results suggest that the adaptation improves the performance slightly, but with the test set used the difference is not statistically significant.

The demo signals are selected from the data set used in the evaluations. They contain real drumming mixed with synthetic accompaniment so that the overall drums-to-accompaniment ratio is -1.2 dB. The system input is in the right channel of each signal. It was analysed with the proposed method (drumwise detectors, LDA, no MLLR adaptation) to retrieve the locations of bass drum, snare drum, and hi-hat. The analysis result was then synthesized with timidity to produce the signal in the left channel. The provided signals are only 15-sec excerpts from the longer signals and resampled to 22.05kHz.

Signals: signal 1, signal 2, signal 3, signal 4, signal 5.

Combining temporal and spectral features in HMM-based drum transcription

8th International Conference on Music Information Retrieval, Vienna, Austria, September 23.-27. 2007. (pdf) (presentation)
Jouni Paulus, Anssi Klapuri

The following signal are generated to demonstrate the performance of the proposed HMM-based drum transcription method. The target drums are kick drum, snare drum and hi-hat. For each combination of the target drums, a 4-state HMM is trained. In addition to the combination models, also one 1-state model is created to serve as a background model. The transcription does not use any onset detection information. Instead it just tries to find an optimal path through the network of the models based only on the features.

In addition to the basic HMM relying only on short-time features, we propose to use also long-term temporal pattern features. The information acquired from the temporal features is incorporated to the baseline HMM in the observation probability stage before model decoding.

The following signals are taken from the data set used in the evaluations for the publication. The first 7 signals are randomly selected from the used subset of RWC Popular music database[1], and the last 6 are from the "complex drums" dataset, which does not contain any other instruments. The original signal which was used as the input to the system is in the right channel, and the synthesized transcription result is in the left channel. The signals are restricted to more convenient lengths.

[1] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, "RWC music database: Popular, classical, and jazz music databases," in Proc. 3rd International Conference on Music Information Retrieval, Oct. 2002.

RWC Popular music db
RWC-MDB-P-2001 No. 14 mp3
RWC-MDB-P-2001 No. 22 mp3
RWC-MDB-P-2001 No. 30 mp3
RWC-MDB-P-2001 No. 32 mp3
RWC-MDB-P-2001 No. 42 mp3
RWC-MDB-P-2001 No. 43 mp3
RWC-MDB-P-2001 No. 48 mp3
Drums only
16-beat mp3
free improvisation mp3
free improvisation mp3
free improvisation mp3
shuffle (interesting double snare error) mp3
tomtom (toms not in target drums) mp3

Drum transcription with non-negative spectrogram factorisation

13th European Signal Processing Conference (EUSIPCO2005), Antalya, Turkey,September 4-8, 2005. (pdf) (poster)
Jouni Paulus, Tuomas Virtanen

The following table contains some demonstrational signals of simple drum transcription system. The first column contains the dry down-mix of close-miked drums, the second the "production-grade" mixdown with compression and reverb effect. These signals are the input to the system. The third column contains the resynthesised transcription results. The gained result was approximately the same with both input signal types, and the overall hit rate for the three drums was 96%.

The algorithm uses pre-calculated spectral models for bass drum, snare drum and hi-hats. The main aim is to model the spectrogram of the input signal as a combination of the spectrograms of these three sources. The source spectrograms are estimated by approximating the needed gains for the pre-calculated spectral models to model the input spectrogram with as little error as possible. The algorithm used for estimating the gains stems from non-negative matrix factorisation algorithm, hence the name. The transcription is done by estimating the instrument onset times from the calculated temporally varying gain curves.

original unprocessed signal original processed signal resynthesized transcription result
unprocessed 1 processed 1 result 1
unprocessed 2 processed 2 result 2
unprocessed 3 processed 3 result 3
unprocessed 4 processed 4 result 4

If the signal to be analysed does not fit the used model (in this case, contain only bass drum, snare drum and hihats), the transcription result will degrade rapidly. The following table contains few examples of this kind of cases, where signal and model do not fit. The sequences were played with all instruments in the used drum kits and only bass drum, snare drum and hihats were transcribed.

original processed signal resynthesized transcription result comments
1 in 1 out different playing styles are not conveyed
2 in 2 out surprisingly good
3 in 3 out severe problems with ghost hits on snare
4 in 4 out ringing of the symbals is not transcribed

Model-based event labeling in the transcription of the percussive audio signals

6th Int. Conference on Digital Audio Effects (DAFx-03), London, UK, September 8-11, 2003. (pdf) (presentation)
Jouni Paulus, Anssi Klapuri

The following table contains some demonstrations of using metrical position based models in the transcription of percussive signals generated by using non-drum sounds. All the signals are synthesized from MIDI files. The first column contains the excerpt synthesized with normal drum set using the Timidity program, the second column same signal, but synthesized monophonically using a sound set based on speech sounds, the third column the transcribed and resynthesized version, the fourth same signal synthesized with tapping sound and the fifth the resynthesized transcription result.

The algorithm is in short the following: initially all sound event onsets are detected and signal is segmented from the locations, then a set of features is extracted from each segment and the segments are clustered into three clusters blindly. Finally, the clustering result is labeled using the labels Bass drum, Snare drum and Hi-hat based on the temporal positions of the cluster occurrences. (All transcription results are obtained using the post-labeling cluster change enhancement.)

original signal signal synthesized with speech-based sounds resynthesized transcription result signal synthesized with tapping sounds resynthesized transcription result
blues blues_1 blues_1_trans blues_2 blues_2_trans
dance/pop dance/pop_1 dance_pop_1_trans dance/pop_2 dance/pop_2_trans
hard rock hard rock_1 hard rock_1_trans hard rock_2 hard_rock2_trans
rap rap_1 rap_1_trans rap_2 rap_2_trans
soft rock soft rock_1 soft rock_1_trans soft rock_2 soft rock_2_trans

Back to main page.

Valid HTML 4.01 Transitional - 27.1.2010 - Jouni Paulus