Automatic Transcription of Music Anssi Klapuri

Literature Review

"Someone's got to do the interesting jobs, too"
    -Esa Hämäläinen, physics student at TUT

A literature review was conducted on the automatic transcription of music and related areas of interest. These are psychoacoustics, computational auditory scene analysis, musical instrument analysis, rhythm tracking, fundamental frequency tracking, and digital signal processing. The meaning of these terms was given in the introduction.

In this document, we first describe the methods that were used in search of the literature. Then we briefly describe the state-of-the-art and history of designing automatic transcription systems. Finally, we summarize and discuss the most important references we found in the abovementioned research areas.

1 Methods

The research in the Signal Processing Laboratory at Tampere University of Technology comprises several areas of interest, including audio signal processing, but an analysis of musical signals had not been attempted before this project. Therefore, the first step in finding music transcription literature was to contact the Acoustic Laboratory of Helsinki University of Technology. The staff of that department provided us with initial references and a wider framework of the research topics that should be involved.

After becoming familiar with the relevant topics, we started searching for publications on them. A major facilitating observation was that most universities of technology give their publication indices on-line on the World Wide Web. In addition to that, we conducted searches in the proceedings of international signal processing conferences, and in Ei Compendex, which is the most comprehensive interdisciplinary engineering information database in the world. Keywords that we used in the search operations were the topics of the research areas and their principal concepts. The most important international journals in the search were Journal of the Acoustic Society America, International Computer Music Journal, IEEE Transactions on Acoustics, Speech and Signal Processing, and Journal of the Audio Engineering Society.

We studied again and in detail the publication indices of the most promising research institutes, which were Machine Listening Group of the Massachusetts Institute of Technology, Center for Computer Research in Music and Acoustics of the University of Stanford, and Institut de Recherche et Coordination Acoustique / Musique in Paris. After having publications we studied their reference lists. In the very final phase we followed reference chains and, when necessary, corresponded with the authors via e-mail. This resulted in a collection that we consider covering enough.

Our scope and treatise is limited by several factors, but especially by the limited amount of resources compared to the wide range of topics that are related with music transcription. Moreover, engaging in a research that is quite new to our laboratory, analysis of musical signals, called for paying the required attention to just finding the right points of emphasis and avoiding wrong assumptions in an early phase. Obtaining the publications was easier than we expected\x11- we failed to have only a very few publications.

2 Published transcription systems

The state-of-the-art in music transcription is discussed in the abovementined MSc thesis of Klapuri, where the subproblems of the task are taken under consideration. There we will also refer to the most up-to-date research in different areas. In this section we will take a glance at the different transcription systems that have been presented until now.

We summarize some figures of merit of the different systems in Table 3. These performance statistics were not explicitly stated in some publications, but had to be deduced from the presented simulation material and results. For this reason, the figures should be taken as rough characterizations only. Furthermore, it is always hard to know how selective the presentation of simulation results of each system has been, and how much has been attained just by a careful tuning of parameters. In the table, polyphony refers to the maximum polyphony in presented transcription simulations, sounds represent the istruments that were involved, note range gives the number of different note pitches involved, and knowledge used column lists the types of knowledge that were incorporated into each system.

Table 2: Transcription systems
Reference Institute Performance Knowledge used
Moorer75a,b Stanford University Polyphony:2 (severe limitations on content). Sounds: violin, guitar. Note range: 24. Heuristic approach.
Chafe82,85,86 Stanford University Polyphony:2 (presented simulation results insufficient). Sound: piano. Note range: 19. Heuristic approach.
Maher89,90 Illinois University Polyphony: 2. Sounds: clarinet, bassoon, trumpet,tuba, synthesized. Note ranges: severe limitation, pitch ranges must not overlap. Heuristic approach.
Katayose89 Osaka University Polyphony:5 (several errors allowed). Sounds: piano, guitar, shamisen. Note r.: 32. Heuristic approach.
Nunn94 Durham University Polyphony: up to 8 (several errors allowed, perceptual similarity). Sound: organ. Note range: 48. Perceptual rules.Architecture: bottom-up abstraction hierarchy.
Kashino93,95 Tokyo University Polyphony: 3 (quite reliable). Sounds: flute, piano, trumpet, automatic adaptation to tone. Note range: 18. Perceptual rules, timbre models, tone memories, statistical chord transition dictionary. Architecture: blackboard, Bayesian probability network
Martin96a,b MIT Polyphony: 4 (quite reliable). Sound: piano. Note range: 33. Perceptual rules. Architecture: blackboard

The first polyphonic transcription system, that of Moorer's, was introduced in Introduction [Moorer75b]. Moorer's work was carried on by a group of researchers at Stanford in the beginning of the 1980s [Chafe82,85,86]. Further development was made by Maher [Maher89,90]. However, polyphony was still restricted to two voices, and the range of fundamental frequencies for each voice was restricted to nonoverlapping ranges.

In the late 1980s, Osaka University in Japan started a project which was aimed at extracting sentiments (feelings) from musical signals, and at constructing a robotic system that could respond to music as a human listener does [Katayose89]. Two transcription systems were designed in the course of the project. One of them transcribed monophonic Japanese folksongs, and employed knowledge of the scale in Japanese songs to cope with the ambiguity of the human voice. The other transcribed polyphonic compositions for piano, guitar, or shamisen. The polyphony of this system was extended up to five simultaneous voices, but only at the expense of allowing some more errors to occur in the output.

In 1993, Hawley published his research on computational auditory scene analysis, and also addressed the problem of transcribing polyphonic piano compositions [Hawley93]. We failed to have his publication, but according to Martin [Martin96b], Hawley's system was reported to be fairly successful.

Douglas Nunn works with transcription at Durham University, UK. His transcription system is characterized by a mainly heuristic signal processing approach, and has been applied to synthetic signals which involve even up to eight simultaneous organ voices [Nunn94]. However, Nunn emphasizes perceived similarity between the original and transcribed pieces, allowing a few more errors to occur in the output.

A significant stride was taken in the history of automatic transcription, when a group of researchers at the University of Tokyo published their transcription system, which employed several new techniques [Kashino93]. They were the first to clearly list and take into use human auditory separation rules, i.e., auditory cues that promote either fusion or segregation of simultaneous frequency components in a signal. Further, they introduced tone model based processing (using information about instrument sounds in processing), and proposed an algorithm for automatic tone modeling, which means automatic extraction of the tone models from the analysed signal. In 1995, they further improved the system by employing a so-called blackboard architecture, which seems to be particularly fitted to transcription since it allows a flexible integration of information from several diverse knowledge sources without a global control module [Kashino95]. The architecture was used to implement a Bayesian probability network, which propagates the impact of new information through the system.

Another recent transcription system, that of Keith Martin's (MIT), also uses a blackboard architecture [Martin96a]. He has put quite a lot of effort in implementing the blackboard structure, but does not utilize high-level musical knowledge to the same extent as that of Kashino's, and does not build a probabilistic information propagation network. Automatic tone modeling is not addressed, either. However, along with Kashino's system, Martin's approach represents the state-of-the-art in music transcription. Later, Martin still upgraded his system by adding a perceptually more motivated front end, which employs correlograms in signal analysis [Martin96b].

3 Related work

The discussion above concerned implemented transcription systems that are purported to transcribe polyphonic music consisting of harmonic sounds (no drums). As stated earlier, there are several other fields of science that are related to music transcription. Here we introduce some of the most important research.

Two excellent sources of information in the field of psychoacoustics are [Bregman90] and [Moore95]. Albert Bregman's book "Auditory Scene Analysis - the Perceptual Organization of Sound" (773 pages) comprises results of a three decade research work of this experimental psychologist, and has been widely referenced in the branches of computer science that are related to auditory perception. Music perception is also addressed in the book. Another, newer and not so well known, is "Hearing - Handbook of Perception and Cognition" (468 pages), which is edited by Brian Moore, and also covers research on auditory perception over times. Both of these are excellent sources of psychoacoustic information for the design of an automatic transcription system.

Computational auditory scene analysis (CASA) refers to the computational analysis of the acoustic information coming from a physical environment, and the interpretation of numerous distinct events in it. In 1991, David Mellinger prepared a review of psychoacoustic and neuropsychological studies concerning the human auditory scene analysis [Mellinger91]. He did not implement a complete computer model of the auditory system, but tested them computationally, actually using musical signals as test material. More recently, the work of Daniel Ellis represents the up to date research on CASA [Ellis96]. His study also comprises prediction-driven processing, which means utilization of the predictions of an internal world model and higher-level knowledge. Ellis evaluated his computational model, and obtained a good agreement between the events detected by the model and by human listeners.

Our research on the analysis of musical instrumental sounds was limited by time, and mainly covers the sinusoidal representation, which we found the most useful. This will be discussed in Section 3.3. Some references on that area are [McAulay86, Smith87, Serra89,97, Qian97]. Meillier's comment on the importance of the attack transient of a sound is worth noticing [Meillier91].

Some systems have been proposed that are aimed at transcribing polyphonic music which consists of drum-like instruments only [Stautner82, Schloss85]. Since they are more related to the systems that track rhythm, they are discussed in the appropriate chapter of the MSc thesis of Klapuri. There we will summarizes the different rhythm tracking systems. Monophonic transcription, more generally called fundamental frequency tracking, will also be separately treated.

4 Commercial products

Not even the first commercial transcription system has been released which would be able to transcribe polyphonic music reliably. On the contrary, monophonic transcription machines have been integrated to several studio equipment. They include pitch-to-MIDI changers and allow symbolic editing and fixing of mistuned singing, for example [Opcode96]. The achronym MIDI stands for Musical Instrument Digital Interface, and is a standard way of representing and communicating musical notes and their parameters between two digital devices [General91].

To this web version we include links to some commercial transcription systems, since they were not covered above. Polyphonic transcribers - sad to say - work very poorly. Monophonic systems are more robust, of course. Transcribers are listed in an arbitrary order, and the list is not exhaustive.

References



Last modified: Mon May 19 11:10:27 EEST 2003 - Anssi Klapuri, klap @ cs tut fi