scispace - formally typeset
Open AccessDissertation

Automatic syllable detection for vowel landmarks

Reads0
Chats0
TLDR
The acoustic theory of speech production was used to predict characteristics of vowels, and studies were done on a speech database to test the predictions, and the resulting data guided the development of an improved Vowel landmark detector (VLD).
Abstract
Lexical Access From Features (LAFF) is a proposed knowledge-based speech recognition system which uses landmarks to guide the search for distinctive features. The first stage in LAFF must find Vowel landmarks. This task is similar to automatic detection of syllable nuclei (ASD). This thesis adapts and extends ASD algorithms for Vowel landmark detection. In addition to existing work on ASD, the acoustic theory of speech production was used to predict characteristics of vowels, and studies were done on a speech database to test the predictions. The resulting data guided the development of an improved Vowel landmark detector (VLD). Studies of the TIMIT database showed that about 94% of vowels have a peak of energy in the F1 region, and that about 89% of vowels have a peak in F1 frequency. Energy and frequency peaks were fairly highly correlated, with both peaks tending to appear before the midpoint of the vowel duration (as labeled), and frequency peaks tending to appear before energy peaks. Landmark based vowel classification was not found to be sensitive to the precise location of the landmark. Energy in a fixed frequency band (300 to 900 Hz) was found to be as good for finding landmarks as the energy at F1, enabling a simple design for a VLD without the complexity of formant tracking. The VLD was based on a peak picking technique, using a recursive convex hull algorithm. Three acoustic cues (peak-to-dip depth, duration, and level) were combined using a multilayer perceptron with two hidden units. The perceptron was trained by matching landmarks to syllabic nuclei derived from the TIMIT aligned phonetic transcription. Pairs of abutting vowels were allowed to match either one or two landmarks without penalty. The perceptron was trained first by back propagation using mean squared error, and then by gradient descent using error rate. The final VLD's error rate was about 12%, with about 3.5% insertions and 8.5% deletions, which compares favorably to the 6% of vowels without peaks. Most errors occurred in predictable circumstances, such as high vowels adjacent to semivowels, or very reduced schwas. Further work should include improvements to the output confidence score, and error correction as part of vowel quality detection. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

read more

Citations
More filters
Journal ArticleDOI

Toward a model for lexical access based on acoustic landmarks and distinctive features

TL;DR: A model in which the acoustic speech signal is processed to yield a discrete representation of the speech stream in terms of a sequence of segments, each of which is described by a set (or bundle) of binary distinctive features.
Journal ArticleDOI

Robust Speech Rate Estimation for Spontaneous Speech

TL;DR: This paper compares various spectral and temporal signal analysis and smoothing strategies to better characterize the underlying syllable structure to derive speech rate and describes an automated approach for learning algorithm parameters from data, and finds the optimal settings through Monte Carlo simulations and parameter sensitivity analysis.
Journal ArticleDOI

Acoustic-Emergent Phonology in the Amplitude Envelope of Child-Directed Speech.

TL;DR: It is demonstrated that the modulation statistics within this AM hierarchy indeed parse the speech signal into a primitive hierarchically-organised phonological system comprising stress feet, syllables and onset-rime units, termed Acoustic-Emergent Phonology (AEP) theory.
Journal ArticleDOI

An Acoustic Measure for Word Prominence in Spontaneous Speech

TL;DR: A comparative analysis on various acoustic features for word prominence detection and report results using a spoken dialog corpus with manually assigned prominence labels, which shows that the proposed acoustic score can discriminate between content word and function words in a statistically significant way.
Journal ArticleDOI

An Automatic System for Detecting Prosodic Prominence in American English Continuous Speech

TL;DR: A careful measurement of acoustic parameters, as well as the identification of their connection to prosodic parameters, makes it possible to build an automatic system capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature.
References
More filters
Book

Neural networks for pattern recognition

TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.
Book

Pattern classification and scene analysis

TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Book ChapterDOI

Neural Networks for Pattern Recognition

TL;DR: The chapter discusses two important directions of research to improve learning algorithms: the dynamic node generation, which is used by the cascade correlation algorithm; and designing learning algorithms where the choice of parameters is not an issue.
Related Papers (5)