Implicit Pronunciation Modeling for Speech Recognition Using Syllable-Centric Models

Speech recognition is an essential component of any human computer interaction (HCI) scheme, which aspires to be natural. Thus, high accuracy speech recognition is of critical importance in making natural man-machine interfaces. Most systems today are based on phonemes, which are considered to be the fundamental units of speech based communication. For recognition purposes the phoneme provide a convenient unit in terms of training data requirements and availability. However the short duration of the phoneme limits us to correlations and information present in time scales of around 30-40ms. The goal of this project is to design training and recognition algorithms for building systems, which will use units such as syllable or word to provide a much larger acoustic context for recognition. Larger units can implicitly handle minor base form variations without the need for dictionary augmentation. This is of importance in a diverse cultural group such as the USA. Based on our current experimental results we can confidently say that mixed syllable and phoneme based systems can help improve ASR performance significantly.

NSF Report (Year 8)

Research

Implicit Pronunciation Modeling for Speech Recognition Using Syllable-Centric Models