Automatic syllable detection for vowel landmarks

Open AccessDissertation

Automatic syllable detection for vowel landmarks

Chats0

TLDR

The acoustic theory of speech production was used to predict characteristics of vowels, and studies were done on a speech database to test the predictions, and the resulting data guided the development of an improved Vowel landmark detector (VLD).

Abstract:

Lexical Access From Features (LAFF) is a proposed knowledge-based speech recognition system which uses landmarks to guide the search for distinctive features. The first stage in LAFF must find Vowel landmarks. This task is similar to automatic detection of syllable nuclei (ASD). This thesis adapts and extends ASD algorithms for Vowel landmark detection. In addition to existing work on ASD, the acoustic theory of speech production was used to predict characteristics of vowels, and studies were done on a speech database to test the predictions. The resulting data guided the development of an improved Vowel landmark detector (VLD). Studies of the TIMIT database showed that about 94% of vowels have a peak of energy in the F1 region, and that about 89% of vowels have a peak in F1 frequency. Energy and frequency peaks were fairly highly correlated, with both peaks tending to appear before the midpoint of the vowel duration (as labeled), and frequency peaks tending to appear before energy peaks. Landmark based vowel classification was not found to be sensitive to the precise location of the landmark. Energy in a fixed frequency band (300 to 900 Hz) was found to be as good for finding landmarks as the energy at F1, enabling a simple design for a VLD without the complexity of formant tracking. The VLD was based on a peak picking technique, using a recursive convex hull algorithm. Three acoustic cues (peak-to-dip depth, duration, and level) were combined using a multilayer perceptron with two hidden units. The perceptron was trained by matching landmarks to syllabic nuclei derived from the TIMIT aligned phonetic transcription. Pairs of abutting vowels were allowed to match either one or two landmarks without penalty. The perceptron was trained first by back propagation using mean squared error, and then by gradient descent using error rate. The final VLD's error rate was about 12%, with about 3.5% insertions and 8.5% deletions, which compares favorably to the 6% of vowels without peaks. Most errors occurred in predictable circumstances, such as high vowels adjacent to semivowels, or very reduced schwas. Further work should include improvements to the output confidence score, and error correction as part of vowel quality detection. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Automatic syllable detection for vowel landmarks

Citations

Toward a model for lexical access based on acoustic landmarks and distinctive features

Robust Speech Rate Estimation for Spontaneous Speech

Acoustic-Emergent Phonology in the Amplitude Envelope of Child-Directed Speech.

An Acoustic Measure for Word Prominence in Spontaneous Speech

An Automatic System for Detecting Prosodic Prominence in American English Continuous Speech

References

Neural networks for pattern recognition

Pattern Classification and Scene Analysis.

Pattern classification and scene analysis

Neural Networks for Pattern Recognition

Acoustic theory of speech production

Related Papers (5)

Automatic segmentation of speech into syllabic units

Toward a model for lexical access based on acoustic landmarks and distinctive features

Robust Speech Rate Estimation for Spontaneous Speech

A multilingual prosodic database.

The rise/fall/connection model of intonation