| Automatic Transcription of Music | Anssi Klapuri |
Top-down processing utilizes internal, high-level models of the acoustic environment and prior knowledge of the properties and dependencies of the objects in it. In this approach information also flows top-down: a sensing system collects evidence that would either justify or cause a change in an internal world model and in the state of the objects in it. This approach is double-called prediction-driven processing, because it is strongly dependent on the predictions of an abstracted internal model, and on prior knowledge of the sound sources [Ellis96].
The science of perception and the perceptual models of audition are dominated by a bottom-up approach, most often ignoring top-down flow of information in the human auditory system [Slaney95]. In our transcription system, the only top-down mechanism is its use of tone models, i.e., prior knowledge of the instrument sounds. On the other hand, musical knowledge was totally ignored, assuming all note combinations and sequences equally probable.
The foundation of signal analysis is still in reliable low-level observations. Without being able to reliably extract information at the lowest level, no amount of higher level processing is going to resolve a musical signal. It is therefore clear that top-down processing cannot replace the functions of a bottom-up analysis. Instead, top-down techniques can add to bottom-up processing and help it to solve otherwise ambiguous situations. Top-down rules may confirm an interpretation or cancel out some others. On the other hand, high-level knowledge can guide the attention and sensitivity of the low-level analysis.
It should be noted that top-down processing does not require that the information should have originated at a high-level process. Conversely, all top-down processing should be highly interactive and adapt to the signal at hands. For example, context is utilized when information is collected at a low level, interpreted at a higher level, and brought back to affect low-level processes. Similarly, internal tone models should originate from low-level observations.
A hierarchy of information representations, as discussed in Chapter 3 of [Klapuri98], does not need to imply bottom-up processing: representations may be fixed, and still information flows both ways. Data representations, processing algorithms, and an implementation architecture should be discussed separately, according to the desirable properties of each.
In this chapter we first discuss the shortcomings of pure bottom-up systems and represent some top-down phenomena in human hearing. Second, we represent top-down knowledge sources that can be used in music transcription: use of context, instrument models, and primitive `linguistic' dependencies of note sequences and combinations in music. Then we investigate how that information could be practically used, and finally propose some implementation architectures.
The critique of pure bottom-up models of the human auditory system is not only due to the fact that they fail to model some important processes in human perception. The very basic reason for criticism is that often these models are not sufficient to provide a reliable enough interpretation of acoustic information that comes from an auditory scene. This applies especially to a general purpose computational auditory scene analysis, where some top-down predictions of an internal world model are certainly needed. For example, even a human listener will find it difficult to analyze the number of persons and their actions in a room by acoustic information only, without prior knowledge of the actual physical setting in that room.
Transcription of musical signals is a substantially more specialized task, and thus more tractable for bottom-up systems. Nevertheless, inflexibility, when it comes to the generality of sounds and musical styles, is a problem that is largely derived from systems' inability to make overall observations of the data and let that knowledge flow back top-down for the purpose of adapting the low-level signal analysis. Another lacking facet of flexibility is the ability to work with obscured or partly corrupted data, which would still be easily handled by the human auditory system.
In the following sections, we will represent knowledge sources and mechanisms that add the intelligence of bottom-up systems. That knowledge can be used to prevent the bottom-up analysis being misled in some situations, or to solve otherwise ambiguous mixtures of sounds.
Psychoacoustic experiments have revealed several phenomena that suggest top-down processing to take place in human auditory perception. Although this has been known already for years, explicit critique of pure bottom-up perception models has emerged only very recently [Slaney95, Ellis96, Scheirer96]. In this section we review some phenomena that motivate a top-down viewpoint to perception.
Auditory restoration refers to an unconscious mechanism in hearing, which compensates the effect of masking sounds [Bregman90]. In an early experiment, listeners were played a speech recording in which a certain syllable had been deleted and replaced by a noise burst [Warren70]. Because of the linguistic context, the listeners also `heard' the removed syllable, and were even unable to identify exactly where the masking noise burst had occurred.
Slaney describes an experiment of Remez and Rubin which indicates that top-down processing takes place in organizing simultaneous spectral features [Slaney95]. Sine-wave speech, in which the acoustic signal was modelled by a small number of sinusoid waves, was played to a group of listeners. Most listeners first recognized that signal as a series of tones, chirps, and blips with no apparent linguistic meaning. But after some period of time, all listeners unmistakably heard the words and had difficulties in separating the tones and blips. The linguistic information changed the perception of the signal. In music, internal models of the instrument sounds and tonal context have an analogous effect.
Scheirer mentions Thomassen's observation, which indicates that high-level melodic understanding in music may affect the low-level perception of the attributes of a single sound in a stream [Scheirer96a]. Thomassen observed that certain frequency contours of melodic lines lead to a percept of an accented sound -\x11as it would have been played stronger, although there was no change in the loudness of the sounds [Thomassen82].
Slaney illustrates the effect of context by explaining Ladefoged's experiment, where the same constant sample was played after two different introductory sentences [Slaney95]. Depending on the speaker of the introductory sentence "Please say what this word is: -", the listeners heard the subsequent constant sample to be either "bit" or "bet" [Ladefoged89].
Memory and hearing interact. In [Klapuri98] we have stated that paying attention to time intervals in rhythm and to frequency intervals of concurrent sounds has a certain goal among others: to unify the sounds to form a coherent structure that is able to express more than any of the sounds alone. We propose that also the structure in music has this function: similarities in two sound sequencies tie these bigger entities together, although they may be separated in time and may differ from each other in details. These redundancies and repetitions facilitate the task of a human listener, and raise expectations in his mind. Only portions of a common theme need to be explicitly repeated to reconstruct the whole sequence in a listener's mind, and special attention can be paid to intentional variations in repeated sequencies.
Certain aspects of the human auditory restoration can be quite reliably modelled, if the low-level conditions and functions of that phenomenon can just be determined. These include illusory continuity either in time or in frequency dimension. If a stable sound becomes broken for a short time and an explaining event (such as a noise burst) is present, that period can be interpolated, since the human auditory system will do the same. On the other hand, if the harmonic partials of a sound are missing or corrupted at a certain frequency band, the sound can still be recovered by finding a harmonic series at adjacent bands.
The auditory restoration example in Section 2 represents a type of restoration that is based on linguistic knowledge and cannot be done as easily. That requires applying higher level knowledge. Primitive `linguistic' dependences in music will be discussed in Sections 5 and 6.
In [Klapuri98] we have pointed out that separating harmonically related sounds is theoretically ambiguous without internal models of the component sounds. These models do not necessarily need to be pre-defined, but may be found in the context of an ambiguous mixture. Internal models should be assembled and adapted at moments when that information is laid bare from under the interference of other sounds, and used at more complicated points. It is well known that if two sound sequences move in parallel in music, playing the same notes or two similar melodies with a harmonically related frequency separation, the auditory system will inevitably consider them a single unit. Typically this is not the composer's intention, but music must introduce the atomic patterns, sounds, by representing them as varying combinations. By recovering these atomic sounds, melodic lines of two instruments can be perceptually separated from each other, although they would instantaneously totally overlap each other.
Musical signals are redundant at many levels. Not only do they consist of a limited set of instrumental sounds, but they also have structural redundancy. The ability of a listener to follow rich polyphonies can be improved by a thematic repetition, and by adding new melodic layers or variations one by one. This strategy is often employed, for example, in classical and in jazz music, when the richness and originality of climax points could not be directly accepted by a listener.
This does not mean that instrument models must be built beforehand from a training material. They can be assembled either from separate training material, or during the transcription from the musical signal itself, or by combining these two: starting from a set of standard instrument models that are then adapted, reconciled using the musical signal. We use the terms tone model and timbre model in different meanings, the former referring to a model of a sound at a certain pitch, and the latter to the model of instrument characteristics that apply to its whole range of sounds. The term timbre alone means the distinctive character of the sounds of a musical instrument, apart from their pitch and intensity, `sound colour'. It seems that the human auditory system uses timbre models, since it tends to model the source, not single sounds.
If our transcription system were just able to extract its tone models from the musical piece itself, it would be an efficient and general purpose transcription tool. A critical problem in doing that is to recognize an instrument, to know what model to use when needed, or to reconcile by adapting when it is possible. Controlled adapting by accounting only the non-interfered partials would then be possible, based on the analysis in Chapter\x116. Another problem then is to find a parametrization for the models that allows using the information of just one sound in refining the timbre model of the whole instrument.
Since we think that distinguishing different instruments from each other is of critical importance, we concentrate on studying what is the information that should be stored in the models, i.e., what are the types of information that give a distinctive character to the sounds of an instrument. Parametrization and use of that information goes beyond the scope of this thesis.
We have taken a dimensional approach in organizing and distinguishing timbres of different instruments. This means that their relations or similarities cannot be expressed using just one attribute, but a timbre can resemble another in different ways, as well as a physical object may resemble another in size, shape, material, or colour. A classic study on organizing timbres to a feature space of features was published by Grey in 1977 [Grey77]. Grey found three distinctive dimensions of timbres. 1) The brightness of sounds: spectral energy distribution to high and low harmonic partials. 2) The amount of spectral fluctuation through time, i.e., the degree of synchronicity in the amplitude envelopes of the partials. 3) Presence and strength of high frequency energy during the attack period of the sounds.
We propose a selection of dimensions that pays attention to the fact that a human tends to hear the source of a sound, and understands a sound by imagining its source. Therefore we propose that the fundamental dimensions of a timbre space are properties of sound sources, factors that leave an informative `fingerprint' to the sound, and enable a listener to distinguish different sources, instruments, from each other. These dimensions bear resemblance to those of Grey, but the attention is not paid to the signal but to its source, which we consider more natural, informative, and descriptive. The `fingerprints' to a sound are set by
One good way to model the body of an instrument is to determine its frequency response. The pattern of intensified and attenuated frequency bands affects the perceived timbre of a sound a lot. Bregman clarifies this by taking different spoken vowels as an example. Their distinction depends on the locations of the lowest three formants of the vocal tract. The lowest two are the most important and are enough to provide a clear distinction between the vowels [Bregman90].
In [Scheirer96a], Scheirer asks whether primitive `linguistic' dependencies of this kind could be found for note sequences in music, and under what circumstances following single notes and polyphonic lines is possible, difficult or impossible for humans. Fortunately, Bregman has studied these dependencies and has collected the results of both his own and other researchers' work in [Bregman90]. Since that information has been practically ignored by the music transcription community, we will shortly review it in the following. Bregman's intention is not to enumerate the compositional rules of any certain musical style, but rather to try to understand how the primitive principles of perception are being used in music to make complicated sound combination suit human perception.
Note sequences can be effectively integrated together by letting them advance by small pitch transitions. To illustrate that, Bregman refers to the results of Otto Ortmann, who surveyed the sizes of the frequency intervals between successive notes in several classical music compositions, totalling 23000 intervals [Ortmann26]. He found that the smallest transitions were the most frequently used, and the number of occurrences dropped roughly in inverse proportion to the size of the interval. What is most interesting, harmonic intervals do not play a special role in sequences, but it is only the size of the difference in log frequency that affects sequential integration. Sequential integration by frequency proximity can be illustrated by letting an instrument play a sequence of notes, where every second note is played at a high frequency range and every other at a low range. This will be inevitably perceived as two distinct melodies: frequency proximity overcomes the temporal order of notes. A melody may be unbroken over a large frequency leap only in the case it does not find a better continuation [Bregman90].
Timbre is an important perceptual cue in grouping sounds to their sources of production. This is why timbral similarities can be effectively used for the purpose of grouping sounds in music, and for carrying a melody through the sounds of an accompanying orchestration [Erickson75].
As mentioned earlier, Bregman does not intend to enumerate musical rules that go above those that can be explained by universal perceptual principles. However, he pays attention to a certain higher-level sequential principle that seems to be justified by some general rule of perception. That is the fact that a dissonant combination of notes is usually followed by a note or a mixture of notes which is more stable in the musical system and is able to provide a cognitive reference point, a `melodic anchor'. Furthermore, these inharmonious combinations of notes typically do not set on together with other notes, but are placed in between the strongest rhythmic pulses to serve as a short moment of tension between two stabler states.
Ironically, the perceptual intentions of music directly oppose those of its transcription. Dissonant sounds are easier to separate, and a computer would have no difficulty in paying attention to whatever number of unrelated melodies and note combinations, but has critical problems in resolving chimeric (fused) note combinations into their member notes. Human perception does not want to break down chimeras, but listens to them as a single object. Therefore music may recruit a large number of harmonically related sounds - that are hard to transcribe - without adding much complexity to a human listener. On the other hand, music has strict rules for the use of dissonances - that are easily detected by a transcriber - since they appear as separate objects for a listener.
In the previous section we already stated some regularities in the use of dissonance. Since transcribing dissonances is not the bottleneck of transcription systems, we consider it irrelevant to list more rules that govern it. Instead, we propose a principle that can be used in resolving rich harmonic polyphonies.
The procedure that we suggest consists of two fundamental phases.
The auditory system naturally tends to assign a coherent sequence of notes to a common source, and larger structures of music are understood by interpreting them as movements, transformations, or expressions of these sources of production [Bregman90]. This seems to explain what we have noticed, that distinct instruments tend to choose their own strategies of grouping notes to meaningful entities. A flute, for example, integrates notes by forming sequences. A guitar has more possibilities: in addition to the ability of producing sequences, it can accompany a piece of music by dropping groups of simultaneous notes, chords, that are integrated by harmonic frequency relations. They can be further integrated on a higher level by forming a certain rhythmic pattern over time. A rhythmic pattern may provide a framework for combining sequential and simultaneous grouping principles. For example, a piano may take a pattern of playing four unit sequences, where the first two are single notes, the third is a simultaneously grouped note combination, and the fourth is a single note again. The grouping strategy of an instrument may change in different parts of a piece, but typically remains consistent over several measures in a musical notation. This enables predictive listening, which is important for a human listener.
Counting the number of instruments is not needed for a solo (one player) performance, where the restrictions on production can be readily used. The other extreme is classical symphony music, where counting the instruments will surely prove impossible in some points, but, interesting enough, also impossible to transcribe for even an experienced musician. Reasonable cases, such as telling the instruments in a classical chamber music, or standard pop music will not pose a problem for an experienced listener. Still resolving the instruments automatically is anything but a trivial task. The method of utilizing context (see Section 3) also seems promising for this purpose: instrumental sounds may be recovered because they take varying combinations with each other, and may sometimes even play monophonic parts in the piece.
Usage of synthesizer instruments introduces an interesting case, since it can produce arbitrary timbres and even arbitrary note combinations by preprogramming them to the memory of the instrument. However, we suggest that since music must be tailored to suit perception, analogous rules will apply to synthesized music as well. Usually certain entities called tracks are established, and each track practically takes the role of a single performer.
"That style is actually not music at all, this is why it could not be transcribed."
    -Unacceptable Excuses, 1997, unpublished
We have represented psychoacoustic dependencies and restrictions that are being used in music to make it suit human perception. But so far we have given only a few cues, exactly where and how that knowledge can be used in an automatic transcription system. Actually we found only a very little literature on how to apply top-down knowledge in computational models. A single reference is the earlier mentioned work of Ellis [Ellis96].
Human perception is amazing in its interactivity: internal models, for example instrumental timbres, affect the interpretation of acoustic data, but at the same time the acoustic data create and reconcile the internal models. This is of critical importance: we must be careful not to reduce the generality and flexibility of a transcription system by sticking to predefined internal models, such as a detailed musical style or the peculiarities of a single instrument. Thus the first principle in using internal models is that they should be exposed to the acoustic context: the models should be adapted and reconciled at the points where the relevant information is laid bare, and then utilized at the more complicated or ambiguous points.
An important fact recovered earlier is that predictions guide the attention and affect the sensitivity of the auditory system. Therefore we know that even a quiet note in a position to form a transition between two other notes will inevitably be heard as intended, and such a weak note candidate can be surely confirmed by the top-down rules. On the other hand, surprising events must be indicated clearly in music to be perceived and not to frustrate inner predictions. Using this rule, a weak note candidate that is grouped neither sequentially nor simultaneously nor rhythmically can be canceled out, because it would most probably be interpreted as an accidental and meaningless artefact or inference by a human listener. An essential operation in automatic transcription is that of canceling out single erroneous interpretations, `false' notes, and therefore we think that these rules would significantly improve its performance.
Top-down processing rules should not be used too blindly or confidentially, or they will start to produce errors. Using the rules for guiding attention and sensitivity is quite safe. On the contrary, assuming too much on the probabilities of different note sequences and combinations will certainly limit the generality of a system. We think that top-down rules should be primarily used in solving ambiguous or otherwise unsolvable situations, where a human listener also has to make a guess of the most probable interpretation. The musical style dependent excuse, which is quoted below this section's title certainly does not satisfy a revolutionary musician.
In any case, there are universal perceptual rules that are not style dependent and can be used to reject certain interpretations and to favour others in an ambiguous situation. Musical styles vary, but all of them need to employ a strategy of some kind to integrate sounds into a coherent piece of music, which can also be followed by other listeners than the composer himself. Bregman takes an example from Bach's keyboard music: as the polyphony of a composition increases, perceptual techniques in organizing the notes to meaningful groups increase, too.