The method is essentially an reimplementation of the HMM baseline method from the ICASSP06 paper and from the ISMIR07 paper. The main differences are that the feature set was reduced to only 13 MFCCs and their first order derivatives. Feature processing with PCA and LDA is evaluated, and the results suggest that the LDA provides a considerable performance increase. Both drum combination modelling and drumwise detector modelling are evaluated and the results suggest that the detectors work better. Finally, unsupervised model parameter adaptation with MLLR is evaluated for the Gaussian mean parameter of the models. The results suggest that the adaptation improves the performance slightly, but with the test set used the difference is not statistically significant.
The demo signals are selected from the data set used in the evaluations. They contain real drumming mixed with synthetic accompaniment so that the overall drums-to-accompaniment ratio is -1.2 dB. The system input is in the right channel of each signal. It was analysed with the proposed method (drumwise detectors, LDA, no MLLR adaptation) to retrieve the locations of bass drum, snare drum, and hi-hat. The analysis result was then synthesized with timidity to produce the signal in the left channel. The provided signals are only 15-sec excerpts from the longer signals and resampled to 22.05kHz.
Signals: signal 1, signal 2, signal 3, signal 4, signal 5.
The following signal are generated to demonstrate the performance of the proposed HMM-based drum transcription method. The target drums are kick drum, snare drum and hi-hat. For each combination of the target drums, a 4-state HMM is trained. In addition to the combination models, also one 1-state model is created to serve as a background model. The transcription does not use any onset detection information. Instead it just tries to find an optimal path through the network of the models based only on the features.
In addition to the basic HMM relying only on short-time features, we propose to use also long-term temporal pattern features. The information acquired from the temporal features is incorporated to the baseline HMM in the observation probability stage before model decoding.
The following signals are taken from the data set used in the evaluations for the publication. The first 7 signals are randomly selected from the used subset of RWC Popular music database[1], and the last 6 are from the "complex drums" dataset, which does not contain any other instruments. The original signal which was used as the input to the system is in the right channel, and the synthesized transcription result is in the left channel. The signals are restricted to more convenient lengths.
[1] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, "RWC music database: Popular, classical, and jazz music databases," in Proc. 3rd International Conference on Music Information Retrieval, Oct. 2002.
| RWC Popular music db | |
| RWC-MDB-P-2001 No. 14 | mp3 |
| RWC-MDB-P-2001 No. 22 | mp3 |
| RWC-MDB-P-2001 No. 30 | mp3 |
| RWC-MDB-P-2001 No. 32 | mp3 |
| RWC-MDB-P-2001 No. 42 | mp3 |
| RWC-MDB-P-2001 No. 43 | mp3 |
| RWC-MDB-P-2001 No. 48 | mp3 |
| Drums only | |
| 16-beat | mp3 |
| free improvisation | mp3 |
| free improvisation | mp3 |
| free improvisation | mp3 |
| shuffle (interesting double snare error) | mp3 |
| tomtom (toms not in target drums) | mp3 |
The following table contains some demonstrational signals of simple drum transcription system. The first column contains the dry down-mix of close-miked drums, the second the "production-grade" mixdown with compression and reverb effect. These signals are the input to the system. The third column contains the resynthesised transcription results. The gained result was approximately the same with both input signal types, and the overall hit rate for the three drums was 96%.
The algorithm uses pre-calculated spectral models for bass drum, snare drum and hi-hats. The main aim is to model the spectrogram of the input signal as a combination of the spectrograms of these three sources. The source spectrograms are estimated by approximating the needed gains for the pre-calculated spectral models to model the input spectrogram with as little error as possible. The algorithm used for estimating the gains stems from non-negative matrix factorisation algorithm, hence the name. The transcription is done by estimating the instrument onset times from the calculated temporally varying gain curves.
| original unprocessed signal | original processed signal | resynthesized transcription result |
| unprocessed 1 | processed 1 | result 1 |
| unprocessed 2 | processed 2 | result 2 |
| unprocessed 3 | processed 3 | result 3 |
| unprocessed 4 | processed 4 | result 4 |
If the signal to be analysed does not fit the used model (in this case, contain only bass drum, snare drum and hihats), the transcription result will degrade rapidly. The following table contains few examples of this kind of cases, where signal and model do not fit. The sequences were played with all instruments in the used drum kits and only bass drum, snare drum and hihats were transcribed.
| original processed signal | resynthesized transcription result | comments |
| 1 in | 1 out | different playing styles are not conveyed |
| 2 in | 2 out | surprisingly good |
| 3 in | 3 out | severe problems with ghost hits on snare |
| 4 in | 4 out | ringing of the symbals is not transcribed |
The following table contains some demonstrations of using metrical position based models in the transcription of percussive signals generated by using non-drum sounds. All the signals are synthesized from MIDI files. The first column contains the excerpt synthesized with normal drum set using the Timidity program, the second column same signal, but synthesized monophonically using a sound set based on speech sounds, the third column the transcribed and resynthesized version, the fourth same signal synthesized with tapping sound and the fifth the resynthesized transcription result.
The algorithm is in short the following: initially all sound event onsets are detected and signal is segmented from the locations, then a set of features is extracted from each segment and the segments are clustered into three clusters blindly. Finally, the clustering result is labeled using the labels Bass drum, Snare drum and Hi-hat based on the temporal positions of the cluster occurrences. (All transcription results are obtained using the post-labeling cluster change enhancement.)
| original signal | signal synthesized with speech-based sounds | resynthesized transcription result | signal synthesized with tapping sounds | resynthesized transcription result |
| blues | blues_1 | blues_1_trans | blues_2 | blues_2_trans |
| dance/pop | dance/pop_1 | dance_pop_1_trans | dance/pop_2 | dance/pop_2_trans |
| hard rock | hard rock_1 | hard rock_1_trans | hard rock_2 | hard_rock2_trans |
| rap | rap_1 | rap_1_trans | rap_2 | rap_2_trans |
| soft rock | soft rock_1 | soft rock_1_trans | soft rock_2 | soft rock_2_trans |
Back to main page.