Especially popular music pieces have a distinct structure defined by repetitions of different parts (e.g., verse and chorus). Being able to infer the structure from the audio enables several applications, such as easier navigation within the piece, music thumbnailing, and mash-ups.
The first approach was to to attempt to locate the repeated parts from the piece. This method utilised a cost function for the resulting structural description, consisting of three terms: the dissimilarity of the segments assigned to the same group, amount of the piece left unexplained, and the complexity of the explanation. These were motivated intuitively so that a good explanation covers majority of the piece, with relatively low amount of different musical parts, but still so that the parts in a group are similar to each other. This was published as
A prototype analysis system was implemented in Matlab looking like this:
The main illustration in the GUI consists of a self-similarity matrix of the piece overlaid with the analysis result (and possibly also the ground truth annotation, if available). The relative weights of the three cost terms are set with sliders and the result is updated accordingly. The same GUI can also be used to evaluate the result, both mathematically and aurally.
The system was evaluated with 50 songs with manual annotations. The problem with the annotations is that only repeated parts were annotated.
When collecting a new, larger, database of songs with manually annotated structural descriptions, sequential dependencies of structural parts became visible. Earlier, it was only an intuition saying that if there are some sequence of parts, the continuation can be predicted quite well. E.g., if the beginning is "intro", "verse", then the next is quite likely "chorus". Having the data to back up the intuition proved that fairly simple N-grams can be used to model the sequential dependencies. Taking the 80% most often occurring part names in the songs of the Beatles and calculating bigram transition probabilities produces a graph like this:
Using these N-grams in the labelling parts of of musical structure descriptions is described in
Quite often it is said that chorus section is the most energetic part in a piece. Based on this assumption, a small addition to the labelling method was implemented. It uses the loudness of a part relative to the average loudness of the piece as the acoustic feature (the standard deviation of the loudness within part is used as the second feature) and acoustic models are constructed for different parts. This information is then incorporated to the N-gram likelihoods assuming statistical independence of the two. The evalutions show that the the performance of the acoustic information alone is not very good, but it is capable improving the result of the plain N-gram labelling method slightly.
The acoustic information incorporation is described in more detail in
The collected TUTstructure07 data set contains 557 files. At this time, the annotations are not publicly available. However, the second data set used in the experiments consisting of Beatles' songs is available:
Having the data set enabled more systematic development of a structure analysis system. The first step was to analyse different acoustic features and their suitability for the task. The features selected for closer inspection were mel-frequency cepstral coefficients (MFCCs), chroma, and rhythmogram. The main target when selecting the features was that the feature values would be similar in different occurrences of a single musical part, and at the same time as dissimilar between different musical parts as possible. The similarity was measures with the general value during the segments, and by inspecting the sequential evolution of the features in a part occurrence. The feature experiments and obtained results are presented in:
Another thing the manually annotated data set enables, is to calculate estimate the posterior probabilities for two segments to be occurrences of the same musical part after observing the acoustic distance between them. Also, the probability that the segments are not repeats is simply a negation of the earlier value. Now, the overall fitness measure becomes adding up the (weighted) log-probabilities of each segment pair in the given structural description. In the following illustration
The plain log-probability is taken from the darkened blocks while the complement event is taken from the white blocks. (The description candidate whose fitness is evaluated in this case is "A1, B1, C, A2, B2".)
Since the musical part sequence N-grams showed earlier to contain quite much musicological information, their probabilites were also included in the overall fitness measure. The main problem with the final fitness function is that the optimisation is combinatorially very complex problem. For this optimisation, a new controllably greedy search algorithm, Bubble Token Passing (BTP), was proposed. These things are detailed in the produced publication:
The method proposed in the publication above was documented in more detail, including a pseudo-code description of the BTP algorithm and evaluations on three large data sets, in the following publication:
Two overview publications on the work in general on music structure analysis have been published. The first is an overview chapter in my dissertation, and the second is a State of the art review given in ISMIR2010 largely based on the thesis chapter.
Back to my main page.
- 15.8.2010 - Jouni Paulus