Showing papers on "Speaker recognition published in 1991"

PDF

Open Access

Journal Article•DOI•

Hidden Markov models for speech recognition

[...]

Biing-Hwang Juang¹, Lawrence R. Rabiner¹•Institutions (1)

01 Aug 1991-Technometrics

TL;DR: The role of statistical methods in this powerful technology as applied to speech recognition is addressed and a range of theoretical and practical issues that are as yet unsolved in terms of their importance and their effect on performance for different system implementations are discussed.

...read moreread less

Abstract: The use of hidden Markov models for speech recognition has become predominant in the last several years, as evidenced by the number of published papers and talks at major speech conferences. The reasons this method has become so popular are the inherent statistical (mathematically precise) framework; the ease and availability of training algorithms for cstimating the parameters of the models from finite training sets of speech data; the flexibility of the resulting recognition system in which one can easily change the size, type, or architecture of the models to suit particular words, sounds, and so forth; and the ease of implementation of the overall recognition system. In this expository article, we address the role of statistical methods in this powerful technology as applied to speech recognition and discuss a range of theoretical and practical issues that are as yet unsolved in terms of their importance and their effect on performance for different system implementations.

...read moreread less

1,480 citations

Patent•DOI•

Speech recognition apparatus & method having dynamic reference pattern adaptation

[...]

Leah S. Larkey

11 Feb 1991-Journal of the Acoustical Society of America

TL;DR: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance.

...read moreread less

Abstract: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance. The method and apparatus provide user correction actions representing the accuracy of a speech recognition, dynamically, during the recognition of unknown incoming speech utterances and after training of the system. The quality values are updated, during the speech recognition process, for at least a portion of those reference patterns used during the speech recognition process. Reference patterns having low quality values, indicative of either inaccurate representation of the unknown speech or non-use, can be deleted so long as the reference pattern is not needed, for example, where the reference pattern is the last instance of a known word or phrase. Various methods and apparatus are provided for determining when reference patterns can be deleted or added, to the reference memory, and when the scores or values associated with a reference pattern should be increased or decreased to represent the "goodness" of the reference pattern in recognizing speech.

...read moreread less

263 citations

Proceedings Article•

BREF, a large vocabulary spoken corpus for French.

[...]

Lori F. Larnel, Jean-Luc Gauvain, Maxine Eskenazi

01 Jan 1991

TL;DR: This paper presents some of the design considerations of BREF, a large read-speech corpus for French designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems, and for the study of phonological variations.

...read moreread less

Abstract: This paper presents some of the design considerations of BREF, a large read-speech corpus for French. BREF was designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems (both speaker-dependent and speakerindependent), and for the study of phonological variations. The texts to be read were selected from 5 million words of the French newspaper, Le Monde. In total, 11,000 texts were selected, with selection criteria that emphasisized maximizing the number of distinct triphones. Separate text materials were selected for training and test corpora. Ninety speakers have been recorded, each providing between 5,000 and 10,000 words (approximately 40-70 min.) of speech.

...read moreread less

225 citations

Proceedings Article•DOI•

Speech recognition in SRI's resource management and ATIS systems

[...]

Hy Murveit, John Butzberger, Mitch Weintraub

19 Feb 1991

TL;DR: DECIPHER as discussed by the authors is a speaker-independent continuous speech recognition system based on hidden Markov model (HMM) technology, which is used in SRI's Air Travel Information Systems (ATIS) and Resource Management systems.

...read moreread less

Abstract: This paper describes improvements to DECIPHER, the speech recognition component in SRI's Air Travel Information Systems (ATIS) and Resource Management systems. DECIPHER is a speaker-independent continuous speech recognition system based on hidden Markov model (HMM) technology. We show significant performance improvements in DECIPHER due to (1) the addition of tied-mixture HMM modeling (2) rejection of out-of-vocabulary speech and background noise while continuing to recognize speech (3) adapting to the current speaker (4) the implementation of N-gram statistical grammars with DECIPHER. Finally we describe our performance in the February 1991 DARPA Resource Management evaluation (4.8 percent word error) and in the February 1991 DARPA-ATIS speech and SLS evaluations (95 sentences correct, 15 wrong of 140). We show that, for the ATIS evaluation, a well-conceived system integration can be relatively robust to speech recognition errors and to linguistic variability and errors.

...read moreread less

172 citations

Journal Article•DOI•

On the application of mixture AR hidden Markov models to text independent speaker recognition

[...]

N.Z. Tisby¹•Institutions (1)

Bell Labs¹

01 Mar 1991-IEEE Transactions on Signal Processing

TL;DR: The results show that even with a short sequence of only four isolated digits, a speaker can be verified with an average equal-error rate of less than 3 %, and the small improvement over the vector quantization approach indicates the weakness of the Markovian transition probabilities for characterizing speaker-dependent transitional information.

...read moreread less

Abstract: Linear predictive hidden Markov models have proved to be efficient for statistically modeling speech signals. The possible application of such models to statistical characterization of the speaker himself is described and evaluated. The results show that even with a short sequence of only four isolated digits, a speaker can be verified with an average equal-error rate of less than 3 %. These results are slightly better than the results obtained using speaker-dependent vector quantizers, with comparable numbers of spectral vectors. The small improvement over the vector quantization approach indicates the weakness of the Markovian transition probabilities for characterizing speaker-dependent transitional information. >

...read moreread less

121 citations

Proceedings Article•DOI•

Integrating time alignment and neural networks for high performance continuous speech recognition

[...]

Patrick Haffner¹, M. Franzini¹, Alex Waibel¹•Institutions (1)

Carnegie Mellon University¹

14 Apr 1991

TL;DR: The authors describe two systems in which neural network classifiers are merged with dynamic programming (DP) time alignment methods to produce high-performance continuous speech recognizers.

...read moreread less

Abstract: The authors describe two systems in which neural network classifiers are merged with dynamic programming (DP) time alignment methods to produce high-performance continuous speech recognizers. One system uses the connectionist Viterbi-training (CVT) procedure, in which a neural network with frame-level outputs is trained using guidance from a time alignment procedure. The other system uses multi-state time-delay neural networks (MS-TDNNs), in which embedded DP time alignment allows network training with only word-level external supervision. The CVT results on the, TI Digits are 99.1% word accuracy and 98.0% string accuracy. The MS-TDNNs are described in detail, with attention focused on their architecture, the training procedure, and results of applying the MS-TDNNs to continuous speaker-dependent alphabet recognition: on two speakers, word accuracy is respectively 97.5% and 89.7%. >

...read moreread less

111 citations

Journal Article•DOI•

Speaker-dependent-feature extraction, recognition and processing techniques

[...]

Sadaoki Furui

15 Dec 1991-Speech Communication

TL;DR: Recent advances in and perspectives of research on speaker-dependent-feature extraction from speech waves, automatic speaker identification and verification, speaker adaptation in speech recognition, and voice conversion techniques are discussed.

...read moreread less

108 citations

Patent•DOI•

Voice recognition of proper names using text-derived recognition models

[...]

Barbara J. Wheatley¹, Joseph Picone¹•Institutions (1)

Texas Instruments¹

01 Jul 1991-Journal of the Acoustical Society of America

TL;DR: A name recognition system used to provide access to a database based on the voice recognition of a proper name spoken by a person who may not know the correct pronunciation of the name.

...read moreread less

Abstract: A name recognition system (FIG. 1 )used to provide access to a database based on the voice recognition of a proper name spoken by a person who may not know the correct pronunciation of the name. During an enrollment phase (10), for each name-text entered (11) into a text database (12), text-derived recognition models (22) are created for each of a selected number of pronunciations of a name-text, with each recognition model being constructed from a respective sequence of phonetic features (15) generated by a Boltzmann machine (13). During a name recognition phase (20), the spoken input (24,25) of a name (by a person who may not know the correct pronunciation) is compared (26) with the recognition models (22) looking for a pattern match--selection of a corresponding name-text is made based on a decision rule (28).

...read moreread less

95 citations

Proceedings Article•DOI•

Improvements and applications for key word recognition using hidden Markov modeling techniques

[...]

Jay G. Wilpon¹, L.G. Miller¹, P. Modi¹•Institutions (1)

Bell Labs¹

14 Apr 1991

TL;DR: A hidden Markov model based key wordspotting algorithm developed previously can recognize key words from a predefined vocabulary list spoken in an unconstrained fashion and improvements in the feature analysis and modeling techniques used to train the system are explored.

...read moreread less

Abstract: A hidden Markov model based key wordspotting algorithm developed previously can recognize key words from a predefined vocabulary list spoken in an unconstrained fashion. Improvements in the feature analysis used to represent the speech signal and modeling techniques used to train the system are explored. The authors discuss several task domain issues which influence evaluation criteria. They present results from extensive evaluations on three speaker independent databases: the 20 word vocabulary Stonehenge Road Rally database, distributed by the National Security Agency, a five word vocabulary used to automate operator-assisted calls, and a three word Spanish vocabulary that is currently being tested in Spain's telephone network. Currently, recognition accuracies range from 99.9% on the Spanish database to 74% (with 8.8 FA/H/W) on the Stonehenge task. >

...read moreread less

92 citations

Proceedings Article•DOI•

Connected word talker verification using whole word hidden Markov models

[...]

Aaron E. Rosenberg¹, Chin-Hui Lee¹, S. Gokcen¹•Institutions (1)

Bell Labs¹

14 Apr 1991

TL;DR: A speaker verification system using connected word verification phrases has been implemented and studied and the system has been evaluated on a 20-speaker telephone database of connected digital utterances.

...read moreread less

Abstract: A speaker verification system using connected word verification phrases has been implemented and studied. Verification utterances are represented as concatenated speaker-dependent whole-word hidden Markov models (HMMs). Verification phrases are specified as strings of words drawn from a small fixed vocabulary, such as the digits. Phrases can either be individualized or randomized for greater security. Training techniques to create speaker-dependent models for verification are used in which initial word models are created by bootstrapping from existing speaker-independent models. The system has been evaluated on a 20-speaker telephone database of connected digital utterances. Using approximately 66 s of connected digit training utterances per speaker, the verification equal-error rate is approximately 3.5% for 1.1 s test utterances and 0.3% for 4.4 s test utterances. In comparison, the performance of a template-based system using the same amount of training data is 6.7% and 1.5%, respectively. >

...read moreread less

88 citations

Proceedings Article•DOI•

Automatic modeling for adding new words to a large-vocabulary continuous speech recognition system

[...]

A. Asadi¹, R. Schwartz², J. Makhoul³•Institutions (3)

Northeastern University¹, Bosch², Tohoku University³

14 Apr 1991

TL;DR: The authors present a novel technique for obtaining a phonetic transcription for a new word, which is needed to add the new word to the system, using DECtalk's text-to-sound rules.

...read moreread less

Abstract: The authors report on the detection of new words for the speaker-dependent and speaker-independent paradigms. A useful operating point in a speaker-dependent paradigm is defined at 71% detection rate and 1% false alarm rate. The authors present a novel technique for obtaining a phonetic transcription for a new word, which is needed to add the new word to the system. The technique utilizes DECtalk's text-to-sound rules to obtain an initial phonetic transcription for the new word. Since these text-to-sound rules are imperfect, a probabilistic transformation technique is used that produces a phonetic pronunciation network of all possible pronunciations given DECtalk's transcription. The network is used to constrain a phonetic recognition process that results in an improved phonetic transcription for the new word. The resulting transcriptions are sufficient for speech recognition purposes. >

...read moreread less

Proceedings Article•DOI•

Radial basis function networks for speaker recognition

[...]

J. Oglesby, J.S. Mason

14 Apr 1991

TL;DR: Experimental results on a 40-speaker database indicate that the modified neural approach significantly outperforms both a standard multilayer perceptron and a vector quantization based system.

...read moreread less

Abstract: A speaker recognition system, using a modified form of feedforward neural network based on radial basis functions (RBFs), is presented. Each person to be recognized has his/her own neural model which is trained to recognise spectral feature vectors representative of his/her speech. Experimental results on a 40-speaker database indicate that the modified neural approach significantly outperforms both a standard multilayer perceptron and a vector quantization based system. The best performance for 4 digit test utterances is obtained from an RBF network with 384 RBF nodes in the hidden layer, given an 8% true talker rejection rate for a fixed 1% imposter acceptance rate. Additional advantages include a substantial reduction in training time over an MLP approach, and the ability to readily interpret the resulting model. >

...read moreread less

Proceedings Article•DOI•

A speaker verification system using alpha-nets

[...]

Michael J. Carey, E.S. Parris, J.S. Bridle¹•Institutions (1)

Bell Labs¹

14 Apr 1991

TL;DR: Experimental results show that adapting the spectral observation probabilities of each state of the model by the back propagation of errors can correct misclassification errors.

...read moreread less

Abstract: Speaker verification is performed by comparing the output probabilities of two Markov models of the same phonetic unit. One of these Markov models is speaker-specific, being built from utterances from the speaker whose identity is to be verified. The second model is built from utterances from a large population of speakers. The performance of the system is improved by treating the pair of models as a connectionist network, an alpha-net, which then allows discriminative training to be carried out. Experimental results show that adapting the spectral observation probabilities of each state of the model by the back propagation of errors can correct misclassification errors. The real-time implementation of the system produced an average digit error rate of 4.5% and only one misclassification in 600 trials using a five-digit sequence. >

...read moreread less

Journal Article•DOI•

A hierarchical neural network model based on a C/V segmentation algorithm for isolated Mandarin speech recognition

[...]

Jhing-Fa Wang¹, Chung-Hsien Wu¹, Shih-Hung Chang¹, Jau-Yien Lee Lee¹•Institutions (1)

National Cheng Kung University¹

01 Sep 1991-IEEE Transactions on Signal Processing

TL;DR: A novel algorithm simultaneously performing consonant/vowel (C/V) segmentation and pitch detection is proposed and an improvement of 12% in consonant recognition rate is obtained and the number of recognition candidates is reduced.

...read moreread less

Abstract: A novel algorithm simultaneously performing consonant/vowel (C/V) segmentation and pitch detection is proposed. Based on this algorithm, a consonant enhancement method and a hierarchical neural network scheme are explored for Mandarin speech recognition. As a result, an improvement of 12% in consonant recognition rate is obtained and the number of recognition candidates is reduced from 1300 to 63. A series of experiments over all Mandarin syllables (about 1300) is demonstrated in the speaker-dependent mode. Comparisons with the decoder timer waveform algorithm are evaluated to show that the performance is satisfactory. An overall recognition rate of 90.14% is obtained. >

...read moreread less

Proceedings Article•DOI•

Speaker adaptation and voice conversion by codebook mapping

[...]

K. Shikano, Satoshi Nakamura¹, M. Abe¹•Institutions (1)

Nippon Telegraph and Telephone¹

11 Jun 1991

TL;DR: The authors summarize a speaker adaptation algorithm based on codebook mapping from one speaker to a standard speaker to be useful in various kinds of speech recognition systems such as hidden-Markov-model-based, feature- based, and neural-network-based systems.

...read moreread less

Abstract: The authors summarize a speaker adaptation algorithm based on codebook mapping from one speaker to a standard speaker. This algorithm has been developed to be useful in various kinds of speech recognition systems such as hidden-Markov-model-based, feature-based, and neural-network-based systems. The codebook mapping speaker adaptation algorithm has been much improved by introducing several ideas based on fuzzy vector quantization. This fuzzy codebook mapping algorithm is also applicable to voice conversion between arbitrary speakers. >

...read moreread less

Journal Article•DOI•

Comparing discrimination and recognition of unfamiliar voices

[...]

Jody Kreiman¹, George Papcun²•Institutions (2)

University of California, Los Angeles¹, Los Alamos National Laboratory²

01 Aug 1991-Speech Communication

TL;DR: Comparisons of patterns of confusions for the two tasks supported the notion that voices are remembered in terms of a “prototype” and a set of deviations from that prototype, and that over time the deviations are forgotten so that identification responses converge on the most “typical” sounding voices.

...read moreread less

Proceedings Article•DOI•

A text-independent speaker recognition method robust against utterance variations

[...]

Tomoko Matsui, Sadaoki Furui

14 Apr 1991

TL;DR: A VQ (vector-quantization)-based text-independent speaker recognition method which is robust against utterance variations, and a normalization method, talker variability normalization (TVN), which normalizes parameter variation taking both inter- and intra-speaker variability into consideration.

...read moreread less

Abstract: The authors describe a VQ (vector-quantization)-based text-independent speaker recognition method which is robust against utterance variations. Three techniques are introduced to cope with temporal and text-dependent spectral variations. First, either an ergodic hidden Markov model or a voiced/unvoiced decision is used to classify input speech into broad phonetic classes. Second, a new distance measure, the distortion-intersection measure (DIM), is introduced for calculating VQ distortion of input speech compared to speaker-independent codebooks. Third, a normalization method, talker variability normalization (TVN), is introduced. TVN normalizes parameter variation taking both inter- and intra-speaker variability into consideration. The system was tested using utterances of nine speakers recorded over three years. The combination of the three techniques achieves high speaker identification accuracies of 98.5% using only vocal tract information and 99.0% using both vocal tract and pitch information. >

...read moreread less

Book•

A dynamic-net model of human speech recognition

[...]

Dennis Norris¹•Institutions (1)

Medical Research Council¹

01 Feb 1991

Journal Article•DOI•

On the use of hidden Markov modelling for recognition of Dysarthric speech

[...]

J.R. Deller¹, D. Hsu², L.J. Ferrier²•Institutions (2)

Michigan State University¹, Northeastern University²

01 Jun 1991-Computer Methods and Programs in Biomedicine

TL;DR: Experimental results using utterances of cerebral palsied persons with an array of articulatory abilities are presented and it is found that an ergodic model is found to outperform a standard left-to-right (Bakis) model structure.

...read moreread less

Journal Article•DOI•

LVQ-based shift-tolerant phoneme recognition

[...]

Erik McDermott, Shigeru Katagiri

01 Jun 1991-IEEE Transactions on Signal Processing

TL;DR: Recognition results are as good as those obtained in the time delay neural network system developed by Waibel et al. (1989), and suggest that LVQ could be the basis for a high-performance speech recognition system.

...read moreread less

Abstract: A shift-tolerant neural network architecture for phoneme recognition is described. The system is based on algorithms for learning vector quantization (LVQ), recently developed by Kohonen (1986, 1988), which pay close attention to approximating optimal decision lines in a discrimination task. Recognition performances in the 98%-99% correct range were obtained for LVQ networks aimed at speaker-dependent recognition of phonemes in small but ambiguous Japanese phonemic classes. A correct recognition rate of 97.7% was achieved by a large LVQ network covering all Japanese consonants. These recognition results are as good as those obtained in the time delay neural network system developed by Waibel et al. (1989), and suggest that LVQ could be the basis for a high-performance speech recognition system. >

...read moreread less

Proceedings Article•DOI•

A segment-based approach to voice conversion

[...]

Masanobu Abe

14 Apr 1991

TL;DR: The proposed voice conversion algorithm was used with two male speakers and, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than thespeech converted frame-by-frame.

...read moreread less

Abstract: A voice conversion algorithm that uses speech segments as conversion units is proposed. Input speech is decomposed into speech segments by a speech recognition module, and the segments are replaced by speech segments uttered by another speaker. This algorithm makes it possible to convert not only the static characteristics but also the dynamic characteristics of speaker individuality. The proposed voice conversion algorithm was used with two male speakers. Spectrum distortion between target speech and the converted speech was reduced to one-third the natural spectrum distortion between the two speakers. A listening experiment showed that, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than the speech converted frame-by-frame. >

...read moreread less

Proceedings Article•DOI•

On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition

[...]

Xuedong Huang¹, Kai-Fu Lee¹•Institutions (1)

Carnegie Mellon University¹

14 Apr 1991

TL;DR: The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX, and extended it to speaker-dependent speech recognition, which demonstrated a substantial difference between speaker- dependent and -independent systems.

...read moreread less

Abstract: The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX. The error rate for RM2 test set is 4.3%. They extended SPHINX to speaker-dependent speech recognition. The error rate is reduced to 1.4-2.6% with 600-2400 training sentences for each speaker, which demonstrated a substantial difference between speaker-dependent and -independent systems. Based on speaker-independent models, a study was made of speaker-adaptive speech recognition. With 40 adaptation sentences for each speaker, the error rate can be reduced from 4.3% to 3.1%. >

...read moreread less

Proceedings Article•DOI•

Velocity and acceleration features in speaker recognition

[...]

J.S. Mason, X. Zhang

14 Apr 1991

TL;DR: It is demonstrated that while the static feature gives the best individual performance, multiple linear combinations of feature sets based on regression analysis can reduce error rates.

...read moreread less

Abstract: The performance of dynamic features in automatic speaker recognition is examined. Second- and third-order regression analysis examining the performance of the associated feature sets independently, in combination, and in the presence of noise is included. It is shown that each regression order has a clear optimum. These are independent of the analysis order of the static feature from which the dynamic features are derived, and insensitive to low-level noise added to the test speech. It is also demonstrated that while the static feature gives the best individual performance, multiple linear combinations of feature sets based on regression analysis can reduce error rates. >

...read moreread less

Proceedings Article•DOI•

Integration of speaker and speech recognition systems

[...]

D.A. Reynolds¹, Larry Heck¹•Institutions (1)

Georgia Institute of Technology¹

14 Apr 1991

TL;DR: A combination of a high-performance speaker identification system and an isolated word recognizer is presented, capable of automatically producing speech and speaker identification with a closed set of speakers.

...read moreread less

Abstract: A combination of a high-performance speaker identification system and an isolated word recognizer is presented. The front-end text-independent speaker identification system determines the most likely speaker for an input word. The speaker identity is then used to choose the reference word models for the speech recognizer. When used with a closed set of speakers, the combination is capable of automatically producing speech and speaker identification. For an open set of speakers, the speaker recognition system acts as speaker quantizer which associates the unknown speaker with an acoustically similar speaker. The matching speaker's word models are used in the speech recognizer. The application of this front-end speaker recognizer is described for a DTW and HMM speech recognizer. Results on a combination using a DTW word recognizer are 100% for closed set experiments. >

...read moreread less

Journal Article•DOI•

Voice across America: Toward robust speaker-independent speech recognition for telecommunications applications

[...]

Barbara J. Wheatley¹, Joseph Picone¹•Institutions (1)

Texas Instruments¹

01 Apr 1991-Digital Signal Processing

TL;DR: The methods and motivation for VAA data collection and validation procedures, the current contents of thedatabase, and the results of exploratory research on a 1088-speaker subset of the database are described.

...read moreread less

Proceedings Article•DOI•

Vector-quantization-based speech recognition and speaker recognition techniques

[...]

S. Furui

04 Nov 1991

TL;DR: It is concluded that not only has the VQ technique reduced the amount of computation and storage, but it has also created new ideas for solving various problems in speech/speaker recognition.

...read moreread less

Abstract: The author reviews major methods of applying the vector quantization (VQ) technique to speech and speaker recognition. These include speech recognition based on the combination of VQ and the DTW/HMM (dynamic time warping/hidden Markov model) technique. VQ-distortion-based recognition, learning VQ algorithms, speaker adaptation by VQ-codebook mapping, and VQ-distortion-based speaker recognition. It is concluded that not only has the VQ technique reduced the amount of computation and storage, but it has also created new ideas for solving various problems in speech/speaker recognition. >

...read moreread less

Proceedings Article•DOI•

Continuous speech recognition using PLP analysis with multilayer perceptrons

[...]

Nelson Morgan, Hynek Hermansky, Hervé Bourlard, Phil Kohn, Chuck Wooters¹ - Show less +1 more•Institutions (1)

Université de Sherbrooke¹

14 Apr 1991

TL;DR: The authors investigate the use of continuous features derived by perceptual linear predictive (PLP) analysis, examine the effect of adding temporal features, and compare it to the previously studied use of multiframe input.

...read moreread less

Abstract: The authors investigate the use of continuous features derived by perceptual linear predictive (PLP) analysis, examine the effect of adding temporal features, and compare it to the previously studied use of multiframe input. Comparisons of the MLP (multilayer perceptron) and conventional Gaussian classifiers are also reported. The speaker-dependent portion of the Resource Management database was used for this test. Additionally, some experiments were performed with a perplexity-2200 speaker-independent recognition task on a subset of the TIMIT database. In each case, the PLP features were used as input to the networks. The experiments show the advantage of continuous PLP features and their first and second temporal derivatives. >

...read moreread less

Patent•DOI•

Speech recognition apparatus of speaker adaptation type

[...]

Kazunaga Yoshida¹, Takao Watanabe¹•Institutions (1)

NEC¹

23 Apr 1991-Journal of the Acoustical Society of America

TL;DR: A speech recognition apparatus is adapted to the speech of the particular speaker by converting the reference pattern into a normalized pattern by a neural network unit, internal parameters of which are modified through a learning operation using a normalized feature vector of the training pattern produced by the voice of the particularly speaker and normalized on the basis of thereference pattern.

...read moreread less

Abstract: A speech recognition apparatus of the speaker adaptation type operates to recognize an inputted speech pattern produced by a particular speaker by using a reference pattern produced by a voice of a standard speaker. The speech recognition apparatus is adapted to the speech of the particular speaker by converting the reference pattern into a normalized pattern by a neural network unit, internal parameters of which are modified through a learning operation using a normalized feature vector of the training pattern produced by the voice of the particular speaker and normalized on the basis of the reference pattern, so that the neural netowrk unit provides an optimum output similar to the corresponding normalized feature vector of the training pattern. In the alternative, the speech recognition apparatus operates to recognize an inputted speech pattern by converting the inputted speech pattern into a normalized speech pattern by the neural network unit, internal parameters of which are modified through a learning operation using a feature vector of the reference pattern normalized on the basis of the training pattern, so that the neural network unit provides an optimum output similar to the corresponding normalized feature vector of the reference pattern and recognizing the normalized speech pattern according to the reference pattern.

...read moreread less

Proceedings Article•DOI•

Rejection of extraneous input in speech recognition applications, using multi-layer perceptrons and the trace of HMMs

[...]

L. Mathan¹, Laurent Miclet¹•Institutions (1)

CNET¹

14 Apr 1991

TL;DR: In this paper, the authors trained multilayer perceptrons to confirm or reject the choice made by a Markov model system during recognition by classifying the trace of the winning model.

...read moreread less

Abstract: In isolated-word recognition from everyday speech, a considerable share of the input lies outside the permitted vocabulary, and has to be rejected. The authors trained multilayer perceptrons to confirm or reject the choice made by a Markov model system during recognition by classifying the trace of the winning model. This rejection method is totally independent of the recognition procedure. Results show that performance on a database containing field data is better than with other rejection procedures. >

...read moreread less

Proceedings Article•DOI•

TDNN-LR continuous speech recognition system using adaptive incremental TDNN training

[...]

H. Sawai

14 Apr 1991

TL;DR: Efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system and provides large-vocabulary and continuous speech recognition.

...read moreread less

Abstract: An investigation of speech recognition and language processing is described. The speech recognition part consists of the large phonemic time-delay neural networks (TDNNs) which can automatically spot all 24 Japanese phonemes by simply scanning input speech. The language processing part is made up of a predictive LR parser which predicts subsequent phonemes based on the currently proposed phonemes. This TDNN-LR recognition system provides large-vocabulary and continuous speech recognition. Recognition experiments for ATR's conference registration task were performed using the TDNN-LR method. Speaker-dependent phrase recognition rates of 65.1% for the first choices and 88.8% within the fifth choices were attained. Also, efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system. >

...read moreread less