Latest update: 28th September 2006

Audio Research Group conducts research on understanding audio content, including speech, music, and spatial analysis. This page gives you some information on our projects with short descriptions and links to more detailed description of each topic.


Music

Research topics include:

Speech

Training of Neural Networks using Evolutionary Computation for Phoneme Recognition

For training of neural networks, the backpropagation algorithm is commonly used. This algorithm is very efficient and gives a monotonic increase in performance. However, it does impose restrictions on the types of neurons that can be used in the artificial neural network. Furthermore, changing the topology of the network generally has to be determined before training.

In our approach, we train the neural networks using evolutionary computation. That is, we maintain a population of neural networks which are randomly mutated and pruned from the collection based on their performance. This allows for lots of flexibility in the types of neurons that can be used without modifying the training algorithm. Also, the topology of the network can easily be learned together with the weights, simply by including topology-changing mutations, i.e. addition or removal of neurons and connections. The downside of this approach is that it learns much slower than backpropagation.

We have tested this approach for training a network as a phoneme recognizer. The inputs of the network are 19 Mel-scale log energies of a single frame and the outputs are the possible phonemes. For the TIMIT database, a collection of 8 phonemes and a different number of neurons in a fully connected backpropagation network, the recognition rate (% of correct classifications) is given below. For a similar running time, the performance of the evolutionary algorithm is given in the table below. The backpropagation performs better on training data but on the testing data the performance of the algorithms is similar. However, for the backpropagation network the number of connections is 152, 760, 1520 , 3040 and 4560. Note that the evolutionary algorithm develops the nodes and connections during the training so their number is not fixed. Interestingly, the number of connections of the network in the table below were only 32, 32, 38, 33 and 35, indicating that the 'relevance' of the parameters for the evolutionary network is higher than that of the backpropagation network.

Recognition rates with the backpropagation network with a different number of nodes.

1 node 5 nodes 10 nodes 20 nodes 30 nodes
training data 10.17% 49.94% 55.34% 60.81% 64.7%
testing data 11.67% 42.3% 33.7% 36.73% 28.5%

Recognition rates with different runs of the evolutionary algorithm.

training data 40.66% 40.66% 40.50% 40.22% 40.22% 40.16%
testing data 39.50% 39.50% 40.00% 42.50% 42.40% 39.90%

Contact person: Konsta Koppinen.

Neural networks for text-to-phoneme mapping

In speech synthesis, text-to-phoneme mapping (TTP) is responsible for translation of the written text to the corresponding phonetic transcriptions from which the syntethic speech is then generated. In speech recognition, a dictionary of phonetic transcriptions must be build by mapping the words to their phonetic transcriptions and this is done by TTP mapping. Neural networks have been applied with success to the problem of text-to-phoneme mapping.

Contact person: Beatrice Bilcu.

Spatial

The spatial team focuses on robust and low complexity methods for source localization and detection using multiple microphones. The specific research areas are

The techniques are currently developed for two application areas: