Automatic Transcription of Music Anssi Klapuri

Top-Down Processing

Bottom-up processing techniques are characterized by the fact that all information flows bottom-up: information is observed in an acoustic waveform, combined to provide meaningful auditory cues, and passed to higher level processes for further interpretation. This approach is also called data-driven processing.

Top-down processing utilizes internal, high-level models of the acoustic environment and prior knowledge of the properties and dependencies of the objects in it. In this approach information also flows top-down: a sensing system collects evidence that would either justify or cause a change in an internal world model and in the state of the objects in it. This approach is double-called prediction-driven processing, because it is strongly dependent on the predictions of an abstracted internal model, and on prior knowledge of the sound sources [Ellis96].

The science of perception and the perceptual models of audition are dominated by a bottom-up approach, most often ignoring top-down flow of information in the human auditory system [Slaney95]. In our transcription system, the only top-down mechanism is its use of tone models, i.e., prior knowledge of the instrument sounds. On the other hand, musical knowledge was totally ignored, assuming all note combinations and sequences equally probable.

The foundation of signal analysis is still in reliable low-level observations. Without being able to reliably extract information at the lowest level, no amount of higher level processing is going to resolve a musical signal. It is therefore clear that top-down processing cannot replace the functions of a bottom-up analysis. Instead, top-down techniques can add to bottom-up processing and help it to solve otherwise ambiguous situations. Top-down rules may confirm an interpretation or cancel out some others. On the other hand, high-level knowledge can guide the attention and sensitivity of the low-level analysis.

It should be noted that top-down processing does not require that the information should have originated at a high-level process. Conversely, all top-down processing should be highly interactive and adapt to the signal at hands. For example, context is utilized when information is collected at a low level, interpreted at a higher level, and brought back to affect low-level processes. Similarly, internal tone models should originate from low-level observations.

A hierarchy of information representations, as discussed in Chapter 3 of [Klapuri98], does not need to imply bottom-up processing: representations may be fixed, and still information flows both ways. Data representations, processing algorithms, and an implementation architecture should be discussed separately, according to the desirable properties of each.

In this chapter we first discuss the shortcomings of pure bottom-up systems and represent some top-down phenomena in human hearing. Second, we represent top-down knowledge sources that can be used in music transcription: use of context, instrument models, and primitive `linguistic' dependencies of note sequences and combinations in music. Then we investigate how that information could be practically used, and finally propose some implementation architectures.

1     Shortcomings of pure bottom-up systems

The critique of pure bottom-up models of the human auditory system is not only due to the fact that they fail to model some important processes in human perception. The very basic reason for criticism is that often these models are not sufficient to provide a reliable enough interpretation of acoustic information that comes from an auditory scene. This applies especially to a general purpose computational auditory scene analysis, where some top-down predictions of an internal world model are certainly needed. For example, even a human listener will find it difficult to analyze the number of persons and their actions in a room by acoustic information only, without prior knowledge of the actual physical setting in that room.

Transcription of musical signals is a substantially more specialized task, and thus more tractable for bottom-up systems. Nevertheless, inflexibility, when it comes to the generality of sounds and musical styles, is a problem that is largely derived from systems' inability to make overall observations of the data and let that knowledge flow back top-down for the purpose of adapting the low-level signal analysis. Another lacking facet of flexibility is the ability to work with obscured or partly corrupted data, which would still be easily handled by the human auditory system.

In the following sections, we will represent knowledge sources and mechanisms that add the intelligence of bottom-up systems. That knowledge can be used to prevent the bottom-up analysis being misled in some situations, or to solve otherwise ambiguous mixtures of sounds.

2     Top-down phenomena in hearing

Psychoacoustic experiments have revealed several phenomena that suggest top-down processing to take place in human auditory perception. Although this has been known already for years, explicit critique of pure bottom-up perception models has emerged only very recently [Slaney95, Ellis96, Scheirer96]. In this section we review some phenomena that motivate a top-down viewpoint to perception.

Auditory restoration refers to an unconscious mechanism in hearing, which compensates the effect of masking sounds [Bregman90]. In an early experiment, listeners were played a speech recording in which a certain syllable had been deleted and replaced by a noise burst [Warren70]. Because of the linguistic context, the listeners also `heard' the removed syllable, and were even unable to identify exactly where the masking noise burst had occurred.

Slaney describes an experiment of Remez and Rubin which indicates that top-down processing takes place in organizing simultaneous spectral features [Slaney95]. Sine-wave speech, in which the acoustic signal was modelled by a small number of sinusoid waves, was played to a group of listeners. Most listeners first recognized that signal as a series of tones, chirps, and blips with no apparent linguistic meaning. But after some period of time, all listeners unmistakably heard the words and had difficulties in separating the tones and blips. The linguistic information changed the perception of the signal. In music, internal models of the instrument sounds and tonal context have an analogous effect.

Scheirer mentions Thomassen's observation, which indicates that high-level melodic understanding in music may affect the low-level perception of the attributes of a single sound in a stream [Scheirer96a]. Thomassen observed that certain frequency contours of melodic lines lead to a percept of an accented sound -\x11as it would have been played stronger, although there was no change in the loudness of the sounds [Thomassen82].

Slaney illustrates the effect of context by explaining Ladefoged's experiment, where the same constant sample was played after two different introductory sentences [Slaney95]. Depending on the speaker of the introductory sentence "Please say what this word is: -", the listeners heard the subsequent constant sample to be either "bit" or "bet" [Ladefoged89].

Memory and hearing interact. In [Klapuri98] we have stated that paying attention to time intervals in rhythm and to frequency intervals of concurrent sounds has a certain goal among others: to unify the sounds to form a coherent structure that is able to express more than any of the sounds alone. We propose that also the structure in music has this function: similarities in two sound sequencies tie these bigger entities together, although they may be separated in time and may differ from each other in details. These redundancies and repetitions facilitate the task of a human listener, and raise expectations in his mind. Only portions of a common theme need to be explicitly repeated to reconstruct the whole sequence in a listener's mind, and special attention can be paid to intentional variations in repeated sequencies.

3     Utilization of context

Certain aspects of the human auditory restoration can be quite reliably modelled, if the low-level conditions and functions of that phenomenon can just be determined. These include illusory continuity either in time or in frequency dimension. If a stable sound becomes broken for a short time and an explaining event (such as a noise burst) is present, that period can be interpolated, since the human auditory system will do the same. On the other hand, if the harmonic partials of a sound are missing or corrupted at a certain frequency band, the sound can still be recovered by finding a harmonic series at adjacent bands.

The auditory restoration example in Section 2 represents a type of restoration that is based on linguistic knowledge and cannot be done as easily. That requires applying higher level knowledge. Primitive `linguistic' dependences in music will be discussed in Sections 5 and 6.

In [Klapuri98] we have pointed out that separating harmonically related sounds is theoretically ambiguous without internal models of the component sounds. These models do not necessarily need to be pre-defined, but may be found in the context of an ambiguous mixture. Internal models should be assembled and adapted at moments when that information is laid bare from under the interference of other sounds, and used at more complicated points. It is well known that if two sound sequences move in parallel in music, playing the same notes or two similar melodies with a harmonically related frequency separation, the auditory system will inevitably consider them a single unit. Typically this is not the composer's intention, but music must introduce the atomic patterns, sounds, by representing them as varying combinations. By recovering these atomic sounds, melodic lines of two instruments can be perceptually separated from each other, although they would instantaneously totally overlap each other.

Musical signals are redundant at many levels. Not only do they consist of a limited set of instrumental sounds, but they also have structural redundancy. The ability of a listener to follow rich polyphonies can be improved by a thematic repetition, and by adding new melodic layers or variations one by one. This strategy is often employed, for example, in classical and in jazz music, when the richness and originality of climax points could not be directly accepted by a listener.

4     Instrument models

In our transcription system, tone models were the only top-down knowledge source. However, careful compilation of these models, coupled with the presented number theoretical methods could solve transcription problems that have been earlier impossible. Indeed, we want to emphasize that the use of instrument models is a powerful top-down processing mechanism, provided that this information can be collected. Even more, we showed in Section 5.5 that these models are indispensable because of the theoretical ambiguity that will otherwise be met in separating harmonic mixtures of musical sounds. Separating them without instrument models is possible only if there are other distinguishing perceptual cues. However, practically in every single piece of music there will be a multitude of note mixtures that cannot be resolved without instrument models.

This does not mean that instrument models must be built beforehand from a training material. They can be assembled either from separate training material, or during the transcription from the musical signal itself, or by combining these two: starting from a set of standard instrument models that are then adapted, reconciled using the musical signal. We use the terms tone model and timbre model in different meanings, the former referring to a model of a sound at a certain pitch, and the latter to the model of instrument characteristics that apply to its whole range of sounds. The term timbre alone means the distinctive character of the sounds of a musical instrument, apart from their pitch and intensity, `sound colour'. It seems that the human auditory system uses timbre models, since it tends to model the source, not single sounds.

If our transcription system were just able to extract its tone models from the musical piece itself, it would be an efficient and general purpose transcription tool. A critical problem in doing that is to recognize an instrument, to know what model to use when needed, or to reconcile by adapting when it is possible. Controlled adapting by accounting only the non-interfered partials would then be possible, based on the analysis in Chapter\x116. Another problem then is to find a parametrization for the models that allows using the information of just one sound in refining the timbre model of the whole instrument.

Since we think that distinguishing different instruments from each other is of critical importance, we concentrate on studying what is the information that should be stored in the models, i.e., what are the types of information that give a distinctive character to the sounds of an instrument. Parametrization and use of that information goes beyond the scope of this thesis.

We have taken a dimensional approach in organizing and distinguishing timbres of different instruments. This means that their relations or similarities cannot be expressed using just one attribute, but a timbre can resemble another in different ways, as well as a physical object may resemble another in size, shape, material, or colour. A classic study on organizing timbres to a feature space of features was published by Grey in 1977 [Grey77]. Grey found three distinctive dimensions of timbres. 1) The brightness of sounds: spectral energy distribution to high and low harmonic partials. 2) The amount of spectral fluctuation through time, i.e., the degree of synchronicity in the amplitude envelopes of the partials. 3) Presence and strength of high frequency energy during the attack period of the sounds.

We propose a selection of dimensions that pays attention to the fact that a human tends to hear the source of a sound, and understands a sound by imagining its source. Therefore we propose that the fundamental dimensions of a timbre space are properties of sound sources, factors that leave an informative `fingerprint' to the sound, and enable a listener to distinguish different sources, instruments, from each other. These dimensions bear resemblance to those of Grey, but the attention is not paid to the signal but to its source, which we consider more natural, informative, and descriptive. The `fingerprints' to a sound are set by

  1. Properties of the vibrating source.
  2. Resonance properties of the body of a musical instrument. That is the immediate pathway of the produced sound, intensifying or attenuating certain frequencies.
  3. Properties of the driving force that sets a sound playing. The mechanism how it interacts with the vibrating source to give birth to a sound.
It is a well known fact that the human auditory system is able to separate the first two aspects above [Bregman90]. Since especially the third aspect is usually ignored in the literature, we will discuss it more than the others. It seems that these three properties of the source effectively explain the information that is present in a resulting acoustic signal.

Vibrating source

A vibrating source is the very place where a sound is produced. It causes at least two kinds of spectral features. First, brightness of a sound, which is determined by the energy distribution between high and low frequency components. Brightness was found to be the most prominent distinguishing feature in the experiments of Grey. Second, a vibrating source often produces certain regularities in the harmonic series of a sound. In a clarinet sound, for example, odd harmonics are stronger than the even ones.

Body of a musical instrument

The term formant refers to any of the several characteristic bands of resonance, intensified frequency bands that together determine the quality of a spoken vowel, for example. The structure of the body of a musical instrument causes formants. Its size, shape and material altogether make it function as a filter, which intensifies some frequency bands and attenuates some others. Harmonic partials at formant frequencies play louder and decay slower, which causes the timbre of the sound to smoothly change in the course of its playing.

One good way to model the body of an instrument is to determine its frequency response. The pattern of intensified and attenuated frequency bands affects the perceived timbre of a sound a lot. Bregman clarifies this by taking different spoken vowels as an example. Their distinction depends on the locations of the lowest three formants of the vocal tract. The lowest two are the most important and are enough to provide a clear distinction between the vowels [Bregman90].

Driving force

A musical sound typically consists of an attack transient followed by a steady sound. This structure derives itself from the physical production of a sound: the driving phenomenon that sets the sound playing leaves its `fingerprint' to the sound before it stabilizes or starts decaying. A human listener is able to recognize, for example, the way in which a guitar string is plugged, or to hear out the noisy turbulence of air in the beginning of the sound of a transverse flute. The transient, in spite of its very short duration, is very important for the perceived quality of the sound [Meillier91]. An analysis of musical instrument sounds shows that a large portion of the spectral information of the sound comes out during the first tens of milliseconds. This applies especially to percussive sounds [Meillier91]. The beginning information burst might be utilized as a `label' introducing the sound source.

5     Sequential dependencies in music

Most speech recognition systems today use linguistic information in interpreting real signals where single phonemes may be obscured or too short to be recognized only by the information in a single time frame. Linguistic rules are typically very primitive, such as a table of statistical probabilities for the existence of certain two or three letter sequences in a given language. This kind of data can be utilized in a hidden Markov model (HMM), which basically implements a state machine, where single letters are the states, and probabilities for transitions between the states have been calculated in the language. Based on both low-level observations and these high-level constraints, a sensing system then determines the most probable letter sequence.

In [Scheirer96a], Scheirer asks whether primitive `linguistic' dependencies of this kind could be found for note sequences in music, and under what circumstances following single notes and polyphonic lines is possible, difficult or impossible for humans. Fortunately, Bregman has studied these dependencies and has collected the results of both his own and other researchers' work in [Bregman90]. Since that information has been practically ignored by the music transcription community, we will shortly review it in the following. Bregman's intention is not to enumerate the compositional rules of any certain musical style, but rather to try to understand how the primitive principles of perception are being used in music to make complicated sound combination suit human perception.

Sequential coherence of melodies - a review on Bregman's results

Both time and frequency separations affect the integrity of perceived sequencies, according to Bregman's experiments. These two have such a correlation that as the frequency separation becomes wider, a note sequence must slow down in order to maintain its coherence. The duration of a note in Western music typically falls in the decade between one and one tenth of a second, and if a note is shorter that this, it tends to stay close to its neighbors in frequency, and is used to create what Bregman calls an ornamental effect. An ornament note is perceived only in a relation to another, and does not itself help to define a melody.

Note sequences can be effectively integrated together by letting them advance by small pitch transitions. To illustrate that, Bregman refers to the results of Otto Ortmann, who surveyed the sizes of the frequency intervals between successive notes in several classical music compositions, totalling 23000 intervals [Ortmann26]. He found that the smallest transitions were the most frequently used, and the number of occurrences dropped roughly in inverse proportion to the size of the interval. What is most interesting, harmonic intervals do not play a special role in sequences, but it is only the size of the difference in log frequency that affects sequential integration. Sequential integration by frequency proximity can be illustrated by letting an instrument play a sequence of notes, where every second note is played at a high frequency range and every other at a low range. This will be inevitably perceived as two distinct melodies: frequency proximity overcomes the temporal order of notes. A melody may be unbroken over a large frequency leap only in the case it does not find a better continuation [Bregman90].

Timbre is an important perceptual cue in grouping sounds to their sources of production. This is why timbral similarities can be effectively used for the purpose of grouping sounds in music, and for carrying a melody through the sounds of an accompanying orchestration [Erickson75].

As mentioned earlier, Bregman does not intend to enumerate musical rules that go above those that can be explained by universal perceptual principles. However, he pays attention to a certain higher-level sequential principle that seems to be justified by some general rule of perception. That is the fact that a dissonant combination of notes is usually followed by a note or a mixture of notes which is more stable in the musical system and is able to provide a cognitive reference point, a `melodic anchor'. Furthermore, these inharmonious combinations of notes typically do not set on together with other notes, but are placed in between the strongest rhythmic pulses to serve as a short moment of tension between two stabler states.

6     Simultaneity dependencies

In Section 6.3 we discussed how music uses harmonic frequency relations to group concurrent notes together. This takes place for two reasons. First, notes must be knitted together to be able to represent higher-level forms that cannot be expressed by single or unrelated atomic sounds. Second, the spectral components of harmonically related sounds match together, which effectively reduces the complexity of calculations that are needed in the human auditory system to group frequency components to acoustic entities. Harmonically related sounds appear merely as one coherent entity and are more easily followed.

Ironically, the perceptual intentions of music directly oppose those of its transcription. Dissonant sounds are easier to separate, and a computer would have no difficulty in paying attention to whatever number of unrelated melodies and note combinations, but has critical problems in resolving chimeric (fused) note combinations into their member notes. Human perception does not want to break down chimeras, but listens to them as a single object. Therefore music may recruit a large number of harmonically related sounds - that are hard to transcribe - without adding much complexity to a human listener. On the other hand, music has strict rules for the use of dissonances - that are easily detected by a transcriber - since they appear as separate objects for a listener.

In the previous section we already stated some regularities in the use of dissonance. Since transcribing dissonances is not the bottleneck of transcription systems, we consider it irrelevant to list more rules that govern it. Instead, we propose a principle that can be used in resolving rich harmonic polyphonies.

Dependencies in production

We spent time analyzing our personal strategies in listening classical and band music, and will here propose a top-down transcription procedure that seems to be natural, effective and musically relevant - in the sense that it is being used by musicians who need to break down chimeric sound mixtures for transcription purposes. The listening tests were mostly performed with band music called fusion (features from jazz, blues and rock), and with classical music.

The procedure that we suggest consists of two fundamental phases.

  1. Recognize the type, and count the number of the instruments that are present in a musical performance.
  2. Utilize two principles: a) Apply instrument specific rules concerning the range of notes that can be produced, and the restrictions on simultaneously possible sounds for each single instrument. b) Make global observations on the grouping strategy (described below) and repetitions of each instrument.
Sound production restrictions of an instrument can often be explicitly and unfailingly listed. For example, a saxophone can produce only one note at a time, has a certain range of pitches, and can vary in timbre only according to a known set of playing techniques. A guitar can produce up to six-voice polyphony, but the potential note combinations are significantly limited by the dimensions of the player's hand and potential playing techniques.

The auditory system naturally tends to assign a coherent sequence of notes to a common source, and larger structures of music are understood by interpreting them as movements, transformations, or expressions of these sources of production [Bregman90]. This seems to explain what we have noticed, that distinct instruments tend to choose their own strategies of grouping notes to meaningful entities. A flute, for example, integrates notes by forming sequences. A guitar has more possibilities: in addition to the ability of producing sequences, it can accompany a piece of music by dropping groups of simultaneous notes, chords, that are integrated by harmonic frequency relations. They can be further integrated on a higher level by forming a certain rhythmic pattern over time. A rhythmic pattern may provide a framework for combining sequential and simultaneous grouping principles. For example, a piano may take a pattern of playing four unit sequences, where the first two are single notes, the third is a simultaneously grouped note combination, and the fourth is a single note again. The grouping strategy of an instrument may change in different parts of a piece, but typically remains consistent over several measures in a musical notation. This enables predictive listening, which is important for a human listener.

Counting the number of instruments is not needed for a solo (one player) performance, where the restrictions on production can be readily used. The other extreme is classical symphony music, where counting the instruments will surely prove impossible in some points, but, interesting enough, also impossible to transcribe for even an experienced musician. Reasonable cases, such as telling the instruments in a classical chamber music, or standard pop music will not pose a problem for an experienced listener. Still resolving the instruments automatically is anything but a trivial task. The method of utilizing context (see Section 3) also seems promising for this purpose: instrumental sounds may be recovered because they take varying combinations with each other, and may sometimes even play monophonic parts in the piece.

Usage of synthesizer instruments introduces an interesting case, since it can produce arbitrary timbres and even arbitrary note combinations by preprogramming them to the memory of the instrument. However, we suggest that since music must be tailored to suit perception, analogous rules will apply to synthesized music as well. Usually certain entities called tracks are established, and each track practically takes the role of a single performer.

7     How to use the dependencies in transcription

"That style is actually not music at all, this is why it could not be transcribed."
    -Unacceptable Excuses, 1997, unpublished

We have represented psychoacoustic dependencies and restrictions that are being used in music to make it suit human perception. But so far we have given only a few cues, exactly where and how that knowledge can be used in an automatic transcription system. Actually we found only a very little literature on how to apply top-down knowledge in computational models. A single reference is the earlier mentioned work of Ellis [Ellis96].

Human perception is amazing in its interactivity: internal models, for example instrumental timbres, affect the interpretation of acoustic data, but at the same time the acoustic data create and reconcile the internal models. This is of critical importance: we must be careful not to reduce the generality and flexibility of a transcription system by sticking to predefined internal models, such as a detailed musical style or the peculiarities of a single instrument. Thus the first principle in using internal models is that they should be exposed to the acoustic context: the models should be adapted and reconciled at the points where the relevant information is laid bare, and then utilized at the more complicated or ambiguous points.

An important fact recovered earlier is that predictions guide the attention and affect the sensitivity of the auditory system. Therefore we know that even a quiet note in a position to form a transition between two other notes will inevitably be heard as intended, and such a weak note candidate can be surely confirmed by the top-down rules. On the other hand, surprising events must be indicated clearly in music to be perceived and not to frustrate inner predictions. Using this rule, a weak note candidate that is grouped neither sequentially nor simultaneously nor rhythmically can be canceled out, because it would most probably be interpreted as an accidental and meaningless artefact or inference by a human listener. An essential operation in automatic transcription is that of canceling out single erroneous interpretations, `false' notes, and therefore we think that these rules would significantly improve its performance.

Top-down processing rules should not be used too blindly or confidentially, or they will start to produce errors. Using the rules for guiding attention and sensitivity is quite safe. On the contrary, assuming too much on the probabilities of different note sequences and combinations will certainly limit the generality of a system. We think that top-down rules should be primarily used in solving ambiguous or otherwise unsolvable situations, where a human listener also has to make a guess of the most probable interpretation. The musical style dependent excuse, which is quoted below this section's title certainly does not satisfy a revolutionary musician.

In any case, there are universal perceptual rules that are not style dependent and can be used to reject certain interpretations and to favour others in an ambiguous situation. Musical styles vary, but all of them need to employ a strategy of some kind to integrate sounds into a coherent piece of music, which can also be followed by other listeners than the composer himself. Bregman takes an example from Bach's keyboard music: as the polyphony of a composition increases, perceptual techniques in organizing the notes to meaningful groups increase, too.


Last modified: Tue Dec 11 13:51:10 EET 2001 - Anssi Klapuri, klap @ cs tut fi