Supervisions

2023

Enhancing Domain-Specific Automated Audio Captioning: a Study on Adaptation Techniques and Transfer Learning

Abstract

Automated audio captioning is a challenging cross-modal task that takes an audio sample as input to analyze it and generate its caption in natural language as output. The existing datasets for audio captioning such as AudioCaps and Clotho encompass a diverse range of domains, with current proposed systems primarily focusing on generic audio captioning. This thesis delves into the adaptation of generic audio captioning systems to domain-specific contexts, simultaneously aiming to enhance generic audio captioning performance. The adaptation of the generic models to specific domains has been explored using two different techniques: complete fine-tuning of neural model layers and layer-wise fine-tuning within transformers. The process involves initial training with a generic captioning setup, followed by adaptation using domain-specific training data. In generic captioning, the process for training starts with training the model on the AudioCaps dataset followed by fine-tuning it using the Clotho dataset. This is accomplished through the utilization of a transformer-based architecture, which integrates a patchout fast spectrogram transformer (PaSST) for audio embeddings and a BART transformer. Word embeddings are generated using a byte-pair encoding (BPE) tokenizer tailored to the training datasets’ unique words, aligning the vocabulary with the generic captioning task. Experimental adaptation mainly focuses on audio clips related to animals and vehicles. The results demonstrate notable improvements in the performance of the generic and domain adaptation systems. Generic captioning has demonstrated an improvement in SPIDEr scores, increasing from 0.291 during fine-tuning to 0.301 with layer-wise fine-tuning. Specifically, we observed a notable increase in SPIDEr scores, from 0.315 to 0.323 for animal-related audio clips and from 0.298 to 0.308 for vehicle-related audio clips.

PDF

2020

Real-Time Sound Event Detection With Python

Abstract

Python is a popular programming language for rapid research prototyping in various research fields, owing it to the massive repository of well-maintained 3rd party packages, built-in capabilities of the language and strong community. This work investigates the feasibility of Python for the task of performing sound event detection (SED) in real-time, which is important in demonstrating project research results to any interested parties or utilise it for practical purposes such as acoustic health care monitoring, e.g. in attempts to reduce the transmission of the COVID-19 disease. The relevant background theory for detecting sound events based on a pre-determined sound recordings is first provided, which is followed by introduction to the basic of concepts that enable performing the same in real-time. Then, Python real-time system designs based on two related approaches are proposed and their feasibility is also evaluated with the help of corresponding reference system implementations. The results acquired with the implementations strongly suggest that Python is indeed very feasible for performing real-time SED, even when using a sophisticated model that possess 3.7M total parameters.

PDF

Sound Based Classification of Studded Tires: Automatic Tire Classification System

Abstract

The use of studded tires causes rutting of asphalt pavements and generates street dust to the environment. The maintenance of paved roads and cleaning of street dust requires resources and causes health risks. These effects are notable especially in spring time when the snow and ice has melted away from road surfaces. In order to predict these phenomena, the number of vehicles using studded tires should be measured continuously. Previously the estimations about the proportions of winter and summer tires have been created based on figures provided by car service companies that offer tire changing services. Occasional hearing based roadside sample surveys have also been made. Unlike the statistics from car service companies, hearing based data collection methods provide location and time specific information about the use of studded tires. Hearing based data collection is a difficult and labour-consuming task and it has not been applied widely. The purpose of this thesis was to find out if an automatic tire classification system could be implemented to collect data about the use of studded tires. A dataset of in-road audio recordings was exploited in the study. The dataset was collected from two measurement sites by using contact microphones under the road pavement. The measuring points were placed next to automatic traffic measurement stations that are used by Finnish Transport Infrastructure Agency in data collection purposes. Digital signal processing and machine learning was applied in the designing of the tire classification system. A passenger car detector was implemented to restrict the classification only for tires of passenger cars and to determine the exact bypass times of detected vehicles. Feature extraction from the audio data was done according to modeling of the human auditory system. Two versions of the tire classifier were designed, one based on support vector machine and the other on multilayer perceptron. The dataset was annotated by labelling the recordings with the information about the vehicle class and the tire type used in the vehicle. The recordings of passenger cars were used in the training and testing of the classifier-models. The split of data into a training set and test set was done according to recording locations, meaning that data from one location was named as the training set while the remaining data from the other location was used as test set. This way the generalization of the system could be verified as the classifier-models could not learn the recording location-specific factors of the test set during the training. A comparison of the two classifier models was made according to the results of the experiments that were carried out with the test set. The results of the experiments prove that automatic and instant tire classification is possible with the proposed methods. Both the passenger car detector and the tire classifier performed well in the experiments by scoring about 95% test accuracy. The differences between the results of the classifier models were small. The results imply that the system is able to generalize its knowledge from one recording environment to another without being explicitly trained to do so. However, due to the small amount of measurement sites used in the experiments, it is impossible to make reliable conclusions about general adaptivity of the system without further research. In order to improve the performance and reliability of the system, more data from new measurement sites should be collected in the follow-up research.

PDF

2019

Environmental sound recognition and prototype game design

Abstract

This project consists of creating a game using environmental sound recognition. The basic idea of the game is an escape room: the player will have to solve a series of enigmas by finding the right sounds to make in order to get out of the room. We will train a machine learning model to recognize the sounds used in the game. The dataset will consist of objects and human-made sounds. The data will be retrieved from existing datasets or created by us in case we lack available resources. The sound recognizer model will be made in Python and the game with Unity.

Clients

Toni Heittola, and Tuomas Virtanen

Synthetic generation of environmental audio learning examples for neural networks

Abstract

Deep neural network methods need to have a wide range of various training examples in order to train a classifier for predicting different classes. Also, big dataset makes the classifier learn about different conditions which results in better generalization. In audio processing, it is often difficult to find a large dataset. Therefore, data needs to be generated synthetically by mixing audio signals from different sources. In this project, we are going to develop a method using Keras data generator class in which environmental audio sounds are generated for binary classification application. While generating the synthetic sounds, some existing variations in acoustic conditions such as signal-to-noise ratio, acoustic conditions in indoor acoustic scenes, general acoustic conditions in outdoor acoustic scenes along with sound shifts in time and pitch should be considered.

Clients

Toni Heittola, and Tuomas Virtanen

2017

Organizing acoustic scene excerpts into 2D map with t-SNE

Abstract

The aim of this project was to develop a Python program, that uses t-Distributed Stochastic Neighbor Embedding, ”t-SNE”, to visualize high dimensional audio scene feature vector data as a 2D-map. This can be used to visualize audio scene feature vectors, and see how well the data is separable using the gathered features, and the t-SNE method.

Clients

Toni Heittola, and Tuomas Virtanen

Acoustic scene classification on Andoird platform

Abstract

The project implemented a neural network classifier on Android. The classifier used Tensorflow as backend for managing the classification flow. The classifier was trained to classify auditory scenes from extracted features. Client application was implemented using Kotlin-programming language and requires Android 7.1 to operate.

Clients

Toni Heittola, and Tuomas Virtanen

Animal onomatopoeia game

Abstract

The project is part of the course SGN-81006 Signal Processing Innovation Project, and the topic is given to us by our clients, researcher Toni Heittola and associate professor Tuomas Virtanen, working in the Audio Research Group at Tampere University of Technology, Laboratory of Signal Processing. In the project we gathered a small data set of animal onomatopoeias and trained a simple classifier using that data set. The classifier was then used for controlling a simple game where the player guides animals by imitating the sounds they make.

Clients

Toni Heittola, and Tuomas Virtanen

2014

Real-Time Audio Analysis

Abstract

The application areas of audio analysis have been gaining popularity over the last decades because of its support to numerous industrial products. In the conventional approaches, audio analysis algorithms, which are based on the pattern recognition approach, are often worked in a non-real-time situation. Most of the non-real-time audio analysis systems are designed for the rapid development, readability and maintainability of code, moreover, provides cross-platform functionality, efficient audio data analysis, and less latency. Hence, these requirements allow the overall development cost with same portability while notably improving the performance of a system. Generally, these programs are written using poor programming styles or using programming languages not suitable for real-time applications such as Java. This implies a high difficulty in changing the existing source code according to real-time requirements hard and tedious work of extending it. This forces research to deal with programming problems instead of speech and audio analysis innovations. In addition to deal with prior issues real-time audio analysis system also provides such platform to test and research the audio analysis algorithms. Purpose of this study is to research APIs that offer a low-latency, high efficiency option for developing real-time audio analysis system. Basic component of pattern recognition are block framing, windowing and mel-frequency cepstral coefficients (MFCC). The presented program is implemented in real-time using efficient APIs such as PortAudio and LibXtract. The program uses mel-frequency cepstral coefficients (MFCC) to process the small frame size of an audio signal without loss of audio signal power for improving the performance of audio analysis system. Audio analysis system can also be used in numerous products, that are not only useful for audio content analysis, audio classiffication, pattern recognition system and music information retrieval. But, also advantageous from practical engineering viewpoint for real-time input applications such as automatic sound event detection system. The system also indicate that it is successfully portable for Linux, Ubuntu, and major platforms for real-time audio input that is usually restricted in some audio analysis systems involves conventional approaches.

Music Video Analysis Using Signal Processing Tools

Abstract

Visual cuts points in music videos are often aligned with the musical beat, and on the higher level with musical structural change points (e.g. chorus-verse). The idea of this study is to investigate this relation more closely by using automatic video cut point detection and automatic musical structure analysis.

Clients

Toni Heittola, Tuomas Virtanen, Joni Kämäräinen, and Katariina Mahkonen

Real-time sound classification system using Python

Abstract

Python has gained recent years wide popularity in the research community. There is already a wide range of pattern recognition related toolbox available for Python. The aim of this project was to investigate possibilities to use Python for acoustic pattern recognition and develop system capable for real-time sound classification.

Clients

Toni Heittola, and Tuomas Virtanen

Acoustic context recognition using i-vector

Abstract

The aim of this project was to study i-vector approach for audio context recognition.

Clients

Toni Heittola, and Tuomas Virtanen

2013

Semi-supervised musical instrument recognition

Abstract

The application areas of music information retrieval have been gaining popularity over the last decades. Musical instrument recognition is an example of a specific research topic in the field. In this thesis, semi-supervised learning techniques are explored in the context of musical instrument recognition. The conventional approaches employed for musical instrument recognition rely on annotated data, i.e. example recordings of the target instruments with associated information about the target labels in order to perform training. This implies a highly laborious and tedious work of manually annotating the collected training data. The semi-supervised methods enable incorporating additional unannotated data into training. Such data consists of merely the recordings of the instruments and is therefore significantly easier to acquire. Hence, these methods allow keeping the overall development cost at the same level while notably improving the performance of a system. The implemented musical instrument recognition system utilises the mixture model semi-supervised learning scheme in the form of two EM-based algorithms. Furthermore, upgraded versions, namely, the additional labelled data weighting and class-wise retraining, for the improved performance and convergence criteria in terms of the particular classification scenario are proposed. The evaluation is performed on sets consisting of four and ten instruments and yields the overall average recognition accuracy rates of 95.3 and 68.4%, respectively. These correspond to the absolute gains of 6.1 and 9.7% compared to the initial, purely supervised cases. Additional experiments are conducted in terms of the effects of the proposed modifications, as well as the investigation of the optimal relative labelled dataset size. In general, the obtained performance improvement is quite noteworthy, and future research directions suggest to subsequently investigate the behaviour of the implemented algorithms along with the proposed and further extended approaches.

PDF

Classification of the Sounds of Footsteps and Person Identification

Abstract

The sound of footsteps contains a wide range of information about the person producing them. Humans are quite often using this information to identify persons in situations without visual contact. For example, they can tell how fast a person is walking, what kind of shoes a person is wearing, how tall a person is, or even the mood of a person. The combination of these features will make the sounds of the footsteps characteristic for certain person. The aim of the project is to study the automatic classification of the sound of footsteps and see how reliably one can do automatic identification of persons based on it.

Clients

Toni Heittola, and Tuomas Virtanen

Organizing a Database of Sound Samples

Abstract

In modern sample based music production, the management of large sample libraries intuitively is challenging problem. The aim of the project is study various ways to organize sample library according to the acoustic properties of samples.

Clients

Toni Heittola, and Tuomas Virtanen

2012

Automatic Guitar Chord Detection

Abstract

Automatic guitar chord detection is a process that attempts to detect a guitar chord from a piece of audio. Generally, automatic chord detection is considered to be a part of a large problem termed as automatic transcription. Although there has been a lot of research in the field of automatic transcription, but having a reliable transcription system is still a distant prospect. Chord detection becomes interesting as chords have comparatively stable structure and they completely describe the occurring harmonies in a piece of music. This thesis presents a novel approach for detecting the correctness of musical chords played by guitar. The approach is based on pattern matching technique applied to the database of chords and their typical mistakes. Mistakes are the versions of a chord where typical playing errors are made. Transient of a chord is skipped and its spectrum is whitened. A certain region of whitened spectra is chosen as a feature vector. Cosine distance is computed between the extracted features and the data present in a reference chord database. Finally, the system detects the correctness of a played chord based on k-Nearest Neighbor (k-NN) classifier. The developed system uses two types of spectral whitening techniques: one is based on Linear Predictive Coding (LPC) and the other is based on Phase Transform-beta (PHAT-beta). The average accuracy shown by LPC based system is 72% while that of PHAT-beta is 82.5%. The system was also evaluated under different noise conditions.

PDF

Classification of Insects Based on Sound

Abstract

Insect borne diseases kill a million people and destroy tens of billions of euros worth of crops annually. At the same time, beneficial insects pollinate the majority of crop species, and it has been estimated that approximately one third of all food consumed by humans is directly pollinated by bees alone. If we could inexpensively count and classify insects, we could plan interventions more accuracy, thus saving lives in the case of insect vectored disease, and growing more food in the case of insect crop pests. The aim of the project is to classify the insect based on the sound they produce while flying.

Clients

Toni Heittola, and Tuomas Virtanen

2010

Parameter Adaptation in Nonlinear Loudspeaker Models

Abstract

Loudspeaker is a device that converts electric input signal to acoustic output. The most common type of loudspeaker is a moving-coil transducer. The behaviour of a moving-coil transducer can be considered to be linear only when displacement of the coil-diaphragm assembly is small. When input signal level rises, nonlinearities start to cause audible distortion. In this thesis we examine microspeaker, a small loudspeaker used in mobile phones. The electro-mechanical process which converts the electrical signal into sound waves is exaplained. Based on this, we present a continuous-time, linear model of a loudspeaker mounted in a closed box. The model describes the loudspeaker's small-signal behaviour using only few parameters. We then consider the main sources of nonlinearities and how to model them. Two major sources nonlinearities are added to the continuous-time model. Then transformations from continuous-time models to discrete-time models are considered. The nonlinear model is converted to discrete-time while taking into account the properties of the microspeaker. The main purpose of this thesis is to study performance of a algorithm that finds the parameter values of the nonlinear loudspeaker model. Performance of the algorithm is compared to performance of an earlier algorithm for the linear loudspeaker model. The parameter values are found and changes in them are tracked using an adaptive signal processing method called system identification. The parameter values are updated using LMS algorithm. Since the discrete-time mechanical model of the microspeaker is based on a recursive filter, LMS algorithm for recursive filters is presented. We also review previous research related to parameter identification in linear and nonlinear loudspeaker models. Based on results from the experiments the studied algorithm is deemed to be yet incomplete. Linear parameters adapt in general quickly whereas the nonlinear parameters adapt too slowly and sometimes erroneously. The difference between the output predicted by the nonlinear loudspeaker model and the actual output of the loudspeaker (prediction error) is too high, meaning the parameters do not adapt to their true values. The model is also prone to instability. The algorithm requires further development regarding adaptation speed and prevention of instability. Other development considering initial parameter values and operation during silent moments should also be conducted in the future.