Research areas: speech recognition, language modelling, environmental sound event detection
Using Sequential Information in Polyphonic Sound Event Detection
To detect the class, and start and end times of sound events in real world recordings is a challenging task. Current computer systems often show relatively high frame-wise accuracy but low event-wise accuracy. In this paper, we attempted to merge the gap by explicitly including sequential information to improve the performance of a state-of-the-art polyphonic sound event detection system. We propose to 1) use delayed predictions of event activities as additional input features that are fed back to the neural network; 2) build N-grams to model the co-occurrence probabilities of different events; 3) use sequential loss to train neural networks. Our experiments on a corpus of real world recordings show that the N-grams could smooth the spiky output of a state-of-the-art neural network system, and improve both the frame-wise and the event-wise metrics.
An Investigation Into Language Model Data Augmentation For Low-resourced ASR and KWS
This paper reports on investigations of using two techniques for language model text data augmentation for low-resourced automatic speech recognition and keyword search. Low-resourced languages are characterized by limited training materials, which typically results in high out-of-vocabulary (OOV) rates and poor language model estimates. One technique makes use of recurrent neural networks (RNNs) using word or subword units. Word-based RNNs keep the same system vocabulary, so they cannot reduce the OOV, whereas subword units can reduce the OOV but generate many false combinations. A complementary technique is based on automatic machine translation, which requires parallel texts and is able to add words to the vocabulary. These methods were accessed on 10 languages in the context of the Babel program and NIST OpenKWS evaluation. Although improvements vary across languages with both methods, small gains were generally observed in terms of word error rate reduction and improved keyword search performance.
Machine Translation Based Data Augmentation For Cantonese Keyword Spotting
This paper presents a method to improve a language model for a limited-resourced language using statistical machine translation from a related language to generate data for the target language. In this work, the machine translation model is trained on a corpus of parallel Mandarin-Cantonese subtitles and used to translate a large set of Mandarin conversational telephone transcripts to Cantonese, which has limited resources. The translated transcripts are used to train a more robust language model for speech recognition and for key-word search in Cantonese conversational telephone speech. This method enables the keyword search system to detect 1.5 times more out-of-vocabulary words, and achieve 1.7% absolute improvement on actual term-weighted value.
Language Model Data Augmentation for Keyword Spotting in Low-Resourced Training Conditions
This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better than just using the 1- best translation. Target language texts collected from the Web and filtered to select conversational-like data are used in several manners. In addition to using Web data for training the language model of the speech recognizer, we further investigate using this data to improve the language model and phrase table of the MT system to get better translations of the English data. Finally, generating text data with a character-based recurrent neural network is investigated. This approach allows new word forms to be produced, providing a way to reduce the out-of-vocabulary rate and thereby improve keyword spotting performance. We study how these different methods of language model data aug- mentation impact speech-to-text and keyword spotting perfor- mance for the Lithuanian and Amharic languages. The best results are obtained by combining all of the explored methods.
An Adaptive Neural Control Scheme For Articulatory Synthesis Of CV Sequences
Reproducing the smooth vocal tract trajectories is critical for high quality articulatory speech synthesis. This paper presents an adaptive neural control scheme based on fuzzy logic and neural networks. The control scheme estimates motor commands from trajectories of flesh-points on selected articulators. These motor commands are then used to reproduce the trajectories based on 2nd order dynamical systems for the underlying articulators. Experiments show that the control scheme is able to manipulate the mass-spring based elastic tract walls in a 2-D articulatory synthesizer to realize efficient speech motor control. In particular, the proposed controller achieves high accuracy during on-line tracking of the lips, the tongue, and the jaw in the simulation of consonant-vowel sequences. It also offers salient features such as generality and adaptability for future developments of control models in articulatory synthesis.
Computer Speech and Language, 2014
Multi-view features in a DNN-CRF model for improved sentence unit detection on English Broadcast News
This paper presents a deep neural network-conditional random field (DNN-CRF) system with multi-view features for sentence unit detection on English broadcast news. We proposed a set of multi-view features extracted from the acoustic, articulatory, and linguistic domains, and used them together in the DNN-CRF model to predict the sentence bound- aries. We tested the accuracy of the multi-view features on the standard NIST RT-04 English broadcast news speech data. Experiments show that the best system outperforms the state-of-the-art sentence unit detection system significantly by 13.2% absolute NIST sentence error rate reduction using the reference transcription. However, the performance gain is limited on the recognized transcription partly due to the high word error rate.
A Deep Neural Network Approach For Sentence Boundary Detection In Broadcast News
This paper presents a deep neural network (DNN) approach to sentence boundary detection in broadcast news. We extract prosodic and lexical features at each inter-word position in the transcripts and learn a sequential classifier to label these positions as either boundary or non-boundary. This work is realized by a hybrid DNN-CRF (conditional random field) architecture. The DNN accepts prosodic feature inputs and non-linearly maps them into boundary/non-boundary posterior probability outputs. Subsequently, the posterior probabilities are combined with lexical features and the integrated features are modeled by a linear-chain CRF. The CRF finally labels the inter-word positions as boundary or non-boundary by Viterbi decoding. Experiments show that, as compared with the state-of-the-art DT-CRF approach , the proposed DNN-CRF approach achieves 16.7% and 4.1% reduction in NIST boundary detection error in reference and speech recognition transcripts, respectively.
A Novel Neural-based Pronunciation Modeling Method For Robust Speech Recognition
This paper describes a recurrent neural network (RNN) based articulatory-phonetic inversion (API) model for improved speech recognition. And a specialized optimization algorithm is introduced to enable human-like heuristic learning in an efficient data-driven manner to capture the dynamic nature of English speech pronunciations. The API model demonstrates superior pronunciation modeling ability and robustness against noise contaminations in large-vocabulary speech recognition experiments. Using a simple rescoring formula, it improves the hidden Markov model (HMM) baseline speech recognizer with consistent error rates reduction of 5.30% and 10.14% for phoneme recognition tasks on clean and noisy speech respectively on the selected TIMIT datasets. And an error rate reduction of 3.35% is obtained for the SCRIBE-TIMIT word recognition tasks. The proposed system qualifies as a competitive candidate for profound pronunciation modeling with intrinsic salient features such as generality and portability.
Articulatory Phonetic Features for Robust Speech Recognition
This thesis elaborates the use of speech production knowledge in the form of articulatory phonetic features to improve the robustness of speech recognition in practical situations. The main concept is that natural speech has three attributes in the human speech processing system, i.e. the motor activation, the articulatory trajectory, and the auditory perception. First, it describes an adaptive neural control model, which reproduces the articulatory trajectories and retrieves the motor activation patterns in a bio-mechanical speech synthesizer. Second, by manipulating the elastic vocal tract walls, the synthesizer produces the overall articulatory-to-acoustic trajectory map for English pronunciations. Third, the articulatory phonetic features are extracted in neural networks for speech recognition in cross-speaker and noisy conditions. The experimental results are compared with the traditional hidden Markov baseline system.
Ph.D. Thesis, 2012
2016 - present Post-doc: Acoustic Events Analysis
2015 - 2016 Post-doc: [Babel] - Text Data Augmentation via Statistical Machine Translation and RNNLMs (Cantonese, English)
2012 - 2015 Post-doc: Speech Segmentation with Multi-View Features (Malay, English)
2008 - 2012 Doctoral: Automatic Speech Recognition with Articulatory Phonetic Features (English)
2004 - 2008 Bachelor: Digital Signal Processing: Audio/Image/Video; Information Theory; Literature Studies
Audio Signal Processing. This course introduces the basics of audio signal processing, especially with speech and music. Project assignments are designed for hands-on experiences.
Engineering Mathematics. This was an undergrad-level course intended to introduce the basics of Mathematics to 2nd year University students.