Hou, Jun. On the use of frame and segment-based methods for the detection and classification of speech sounds and features. Retrieved from https://doi.org/doi:10.7282/T3P26Z9W
DescriptionStatistical data-driven methods and knowledge-based methods are two recent trends in Automatic Speech Recognition (ASR). Hidden Markov Model (HMM)-based speech recognition techniques have achieved great success for controlled tasks and environments. However, when we require improved accuracy and robustness (closer to Human Speech Recognition (HSR)), HMM algorithms for speech recognition gradually fail. Hence a need has emerged to incorporate higher level linguistic information into ASR systems in order to further discriminate between speech classes or phonemes with high confusion rates. The Automatic Speech Attribute Transcription (ASAT) project is one of the recent research efforts that has tried to bridge the gap between ASR and HSR.
In this thesis we focus on the design and optimization of the front end processing of the ASAT system, whose goal is to estimate a set of attribute and phoneme probability lattices which can be combined with information from higher level knowledge sources in a set of speech event verification modules in order to make a final recognition decision.
We propose a set of both frame-based methods and segment-based methods to improve the recognition performance of distinctive features and phonemes in English. We also study and evaluate both a parallel speech feature organization and a hierarchical phoneme topology. There are 4 main parts in this thesis work. In the first part, we use frame-based methods to estimate the likelihood of static sounds (e.g., steady vowels, fricatives, etc), and implement the parallel feature detection using Multi-Layer Perceptrons (MLPs) in order to detect the 14 Sound Pattern of English (SPE) features. In the second part, we use segment-based methods to classify dynamic sounds (e.g., stop consonants, diphthongs, etc), and use Time-Delay Neural Networks (TDNNs) to recognize phoneme classes in a hierarchical phoneme and feature organization. In the third part and in the forth part, we combine the frame-based parallel speech feature detection system and the segment-based hierarchical phoneme classification system to improve the overall phoneme classification performance and the speech feature detection performance.
The main contribution of this thesis is the creation of a phoneme recognizer that overcomes the disadvantages of pure statistical or knowledge-based systems, and provides a way to incorporate acoustic/phonetic/linguistic knowledge into an existing (HMM-based) automatic speech recognition system.