scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A review of large-vocabulary continuous-speech

01 Sep 1996-IEEE Signal Processing Magazine (IEEE)-Vol. 13, Iss: 5, pp 45
TL;DR: The principles and architecture of current LVR systems are discussed and the key issues affecting their future deployment are identified; to illustrate the various points raised, the Cambridge University HTK system is described.
Abstract: Considerable progress has been made in speech-recognition technology over the last few years and nowhere has this progress been more evident than in the area of large-vocabulary recognition (LVR). Current laboratory systems are capable of transcribing continuous speech from any speaker with average word-error rates between 5% and 10%. If speaker adaptation is allowed, then after 2 or 3 minutes of speech, the error rate will drop well below 5% for most speakers. LVR systems had been limited to dictation applications since the systems were speaker dependent and required words to be spoken with a short pause between them. However, the capability to recognize natural continuous-speech input from any speaker opens up many more applications. As a result, LVR technology appears to be on the brink of widespread deployment across a range of information technology (IT) systems. This article discusses the principles and architecture of current LVR systems and identifies the key issues affecting their future deployment. To illustrate the various points raised, the Cambridge University HTK system is described. This system is a modem design that gives state-of-the-art performance, and it is typical of the current generation of recognition systems.
Citations
More filters
Journal ArticleDOI
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

9,091 citations

Journal ArticleDOI
TL;DR: It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN). The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency. In the second part of this paper, it is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.

7,290 citations

Journal ArticleDOI
TL;DR: A survey of speech emotion classification addressing three important aspects of the design of a speech emotion recognition system, the choice of suitable features for speech representation, and the proper preparation of an emotional speech database for evaluating system performance are addressed.

1,735 citations

Journal ArticleDOI
TL;DR: Comparisons suggest that the human-machine performance gap can be reduced by basic research on improving low-level acoustic-phonetic modeling, on improving robustness with noise and channel variability, and on more accurately modeling spontaneous speech.

606 citations


Additional excerpts

  • ...Young, 1996; Bourlard et al., 1996 ....

    [...]

Journal ArticleDOI
TL;DR: Data acquisition, feature extraction and classification methods employed for the analysis of sign language gestures are examined and the overall progress toward a true test of sign recognition systems--dealing with natural signing by native signers is discussed.
Abstract: Research in automatic analysis of sign language has largely focused on recognizing the lexical (or citation) form of sign gestures as they appear in continuous signing, and developing algorithms that scale well to large vocabularies. However, successful recognition of lexical signs is not sufficient for a full understanding of sign language communication. Nonmanual signals and grammatical processes which result in systematic variations in sign appearance are integral aspects of this communication but have received comparatively little attention in the literature. In this survey, we examine data acquisition, feature extraction and classification methods employed for the analysis of sign language gestures. These are discussed with respect to issues such as modeling transitions between signs in continuous signing, modeling inflectional processes, signer independence, and adaptation. We further examine works that attempt to analyze nonmanual signals and discuss issues related to integrating these with (hand) sign gestures. We also discuss the overall progress toward a true test of sign recognition systems?dealing with natural signing by native signers. We suggest some future directions for this research and also point to contributions it can make to other fields of research. Web-based supplemental materials (appendicies) which contain several illustrative examples and videos of signing can be found at www.computer.org/publications/dlib.

574 citations

References
More filters
Journal ArticleDOI
S. Boll1
TL;DR: A stand-alone noise suppression algorithm that resynthesizes a speech waveform and can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.
Abstract: A stand-alone noise suppression algorithm is presented for reducing the spectral effects of acoustically added noise in speech. Effective performance of digital speech processors operating in practical environments may require suppression of noise from the digital wave-form. Spectral subtraction offers a computationally efficient, processor-independent approach to effective digital speech analysis. The method, requiring about the same computation as high-speed convolution, suppresses stationary noise from speech by subtracting the spectral noise bias calculated during nonspeech activity. Secondary procedures are then applied to attenuate the residual noise left after subtraction. Since the algorithm resynthesizes a speech waveform, it can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.

4,862 citations

Journal ArticleDOI
TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

4,822 citations

Journal ArticleDOI
TL;DR: This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.
Abstract: We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words. We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.

3,336 citations

Journal ArticleDOI
Hynek Hermansky1
TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
Abstract: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.

2,969 citations

Journal ArticleDOI
TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.

2,504 citations