Showing papers in "IEEE Transactions on Speech and Audio Processing in 1993"

PDF

Open Access

Journal Article•DOI•

Efficient vector quantization of LPC parameters at 24 bits/frame

[...]

Kuldip K. Paliwal¹, B. Atal¹•Institutions (1)

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: It is shown that the split vector quantizer can quantize LPC information in 24 bits/frame with an average spectral distortion of 1 dB and less than 2% of the frames having spectral distortion greater than 2 dB.

...read moreread less

Abstract: For low bit rate speech coding applications, it is important to quantize the LPC parameters accurately using as few bits as possible. Though vector quantizers are more efficient than scalar quantizers, their use for accurate quantization of linear predictive coding (LPC) information (using 24-26 bits/frame) is impeded by their prohibitively high complexity. A split vector quantization approach is used here to overcome the complexity problem. An LPC vector consisting of 10 line spectral frequencies (LSFs) is divided into two parts, and each part is quantized separately using vector quantization. Using the localized spectral sensitivity property of the LSF parameters, a weighted LSF distance measure is proposed. With this distance measure, it is shown that the split vector quantizer can quantize LPC information in 24 bits/frame with an average spectral distortion of 1 dB and less than 2% of the frames having spectral distortion greater than 2 dB. The effect of channel errors on the performance of this quantizer is also investigated and results are reported. >

...read moreread less

665 citations

Journal Article•DOI•

Multi-channel signal separation by decorrelation

[...]

E. Weinstein¹, Meir Feder¹, Alan V. Oppenheim²•Institutions (2)

Tel Aviv University¹, Massachusetts Institute of Technology²

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: Algorithms developed suggest a potentially interesting modification of Widrow's (1975) least-squares method for noise cancellation, where the reference signal contains a component of the desired signal.

...read moreread less

Abstract: Identification of an unknown system and recovery of the input signals from observations of the outputs of an unknown multiple-input, multiple-output linear system are considered. Attention is focused on the two-channel case, in which the outputs of a 2*2 linear time invariant system are observed. The approach consists of reconstructing the input signals by assuming that they are statistically uncorrelated and imposing this constraint on the signal estimates. In order to restrict the set of solutions, additional information on the true signal generation and/or on the form of the coupling systems is incorporated. Specific algorithms are developed and tested. As a special case, these algorithms suggest a potentially interesting modification of Widrow's (1975) least-squares method for noise cancellation, where the reference signal contains a component of the desired signal. >

...read moreread less

366 citations

Journal Article•DOI•

ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition

[...]

Vassilios Digalakis, J.R. Rohlicek, Mari Ostendorf

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: The nontraditional approach to the problem of estimating the parameters of a stochastic linear system is presented and it is shown how the evolution of the dynamics as a function of the segment length can be modeled using alternative assumptions.

...read moreread less

Abstract: A nontraditional approach to the problem of estimating the parameters of a stochastic linear system is presented. The method is based on the expectation-maximization algorithm and can be considered as the continuous analog of the Baum-Welch estimation algorithm for hidden Markov models. The algorithm is used for training the parameters of a dynamical system model that is proposed for better representing the spectral dynamics of speech for recognition. It is assumed that the observed feature vectors of a phone segment are the output of a stochastic linear dynamical system, and it is shown how the evolution of the dynamics as a function of the segment length can be modeled using alternative assumptions. A phoneme classification task using the TIMIT database demonstrates that the approach is the first effective use of an explicit model for statistical dependence between frames of speech. >

...read moreread less

238 citations

Journal Article•DOI•

Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding

[...]

W.P. LeBlanc¹, B. Bhattacharya, S.A. Mahmoud, V. Cuperman•Institutions (1)

Carleton University¹

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: It is shown experimentally that as the number of stages is increased above the optimal performance/complexity tradeoff, the quantizer robustness and outlier performance can be improved at the expense of a slight increase in rate.

...read moreread less

Abstract: A tree-searched multistage vector quantization (VQ) scheme for linear prediction coding (LPC) parameters which achieves spectral distortion lower than 1 dB with low complexity and good robustness using rates as low as 22 b/frame is presented. The M-L search is used, and it is shown that it achieves performance close to that of the optimal search for a relatively small M. A joint codebook design strategy for multistage VQ which improves convergence speed and the VQ performance measures is presented. The best performance/complexity tradeoffs are obtained with relatively small size codebooks cascaded in a 3-6 stage configuration. It is shown experimentally that as the number of stages is increased above the optimal performance/complexity tradeoff, the quantizer robustness and outlier performance can be improved at the expense of a slight increase in rate. Results for log area ratio (LAR) and line spectral pairs (LSPs) parameters are presented. A training technique that reduces outliers at the expense of a slight average performance degradation is introduced. The method significantly outperforms the split codebook approach. >

...read moreread less

201 citations

Journal Article•DOI•

Formant location from LPC analysis data

[...]

R.C. Snell, F. Milinazzo

01 Apr 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: The estimation of formant frequencies and bandwidths from the filter coefficients obtained through linear-predictive-coding analysis of speech is discussed from several viewpoints and a method for locating roots within the unit circle is derived.

...read moreread less

Abstract: The estimation of formant frequencies and bandwidths from the filter coefficients obtained through linear-predictive-coding (LPC) analysis of speech is discussed from several viewpoints. A method for locating roots within the unit circle is derived. This algorithm is particularly well suited to computations carried out in fixed-point arithmetic using specialized signal processing hardware. >

...read moreread less

171 citations

Journal Article•DOI•

Multonic Markov word models for large vocabulary continuous speech recognition

[...]

Lalit R. Bahl¹, Jerome R. Bellegarda¹, P.V. de Souza¹, Ponani S. Gopalakrishnan¹, David Nahamoo¹, Michael Picheny¹ - Show less +2 more•Institutions (1)

IBM¹

01 Jul 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system that is more flexible than previously reported fenone-based word models, which lead to an improved capability of modeling variations in pronunciation.

...read moreread less

Abstract: A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system. The models, built from combinations of acoustically based sub-word units called fenones, are derived automatically from one or more sample utterances of a word. Because they are more flexible than previously reported fenone-based word models, they lead to an improved capability of modeling variations in pronunciation. They are therefore particularly useful in the recognition of continuous speech. In addition, their construction is relatively simple, because it can be done using the well-known forward-backward algorithm for parameter estimation of hidden Markov models. Appropriate reestimation formulas are derived for this purpose. Experimental results obtained on a 5000-word vocabulary natural language continuous speech recognition task are presented to illustrate the enhanced power of discrimination of the new models. >

...read moreread less

170 citations

Journal Article•DOI•

Discriminative training of dynamic programming based speech recognizers

[...]

P.C. Chang, Biing-Hwang Juang

01 Apr 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A new minimum recognition error formulation and a generalized probabilistic descent (GPD) algorithm are analyzed and used to accomplish discriminative training of a conventional dynamic-programming-based speech recognizer.

...read moreread less

Abstract: A new minimum recognition error formulation and a generalized probabilistic descent (GPD) algorithm are analyzed and used to accomplish discriminative training of a conventional dynamic-programming-based speech recognizer. The objective of discriminative training here is to directly minimize the recognition error rate. To achieve this, a formulation that allows controlled approximation of the exact error rate and renders optimization possible is used. The GPD method is implemented in a dynamic-time-warping (DTW)-based system. A linear discriminant function on the DTW distortion sequence is used to replace the conventional average DTW path distance. A series of speaker-independent recognition experiments using the highly confusible English E-set as the vocabulary showed a recognition rate of 84.4% compared to approximately 60% for traditional template training via clustering. The experimental results verified that the algorithm converges to a solution that achieves minimum error rate. >

...read moreread less

165 citations

Journal Article•DOI•

Optimal quantization of LSP parameters

[...]

F.K. Soong¹, Biing-Hwang Juang¹•Institutions (1)

Bell Labs¹

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A globally optimal scalar quantizer is designed for each differential LSP frequency, which achieves a 1-dB average log spectral distortion, a commonly accepted level for reproducing perceptually transparent spectral information.

...read moreread less

Abstract: Two nonuniform aspects of the line spectrum pair (LSP) linear predictive coding (LPC) parameters are investigated, including nonuniform statistical distributions and spectral sensitivities of adjacent LSP frequency differences. Based upon these two nonuniform properties, a globally optimal scalar quantizer is designed for each differential LSP frequency. The design algorithm is dynamic programming based and minimization of a nontrivial data dependent spectral distortion is adopted as the optimality criterion. At 32 bits/frame, the new LSP quantizer achieves a 1-dB average log spectral distortion, a commonly accepted level for reproducing perceptually transparent spectral information. The quantization performance has also been shown to be robust across different speakers and databases. >

...read moreread less

151 citations

Journal Article•DOI•

Exponentially weighted stepsize NLMS adaptive filter based on the statistics of a room impulse response

[...]

Shoji Makino, Yutaka Kaneda, N. Koizumi

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A normalized least-mean-squares (NLMS) adaptive algorithm with double the convergence speed, at the same computational load, of the conventional NLMS for an acoustic echo canceller is proposed and its fast convergence is demonstrated.

...read moreread less

Abstract: A normalized least-mean-squares (NLMS) adaptive algorithm with double the convergence speed, at the same computational load, of the conventional NLMS for an acoustic echo canceller is proposed. This algorithm, called the ES (exponentially weighted stepsize) algorithm, uses a different stepsize (feedback constant) for each weight of an adaptive transversal filter. These stepsizes are time-invariant and weighted proportionally to the expected variation of a room impulse response. The algorithm adjusts coefficients with large errors in large steps, and coefficients with small errors in small steps. A transition formula is derived for the mean-squared coefficient error of the algorithm. The mean stepsize determines the convergence condition, the convergence speed, and the final excess mean-squared error. Modified for a practical multiple DSP structure, the algorithm requires only the same amount of computation as the conventional NLMS. The algorithm is implemented in a commercial acoustic echo canceller, and its fast convergence is demonstrated. >

...read moreread less

148 citations

Journal Article•DOI•

Shared-distribution hidden Markov models for speech recognition

[...]

Mei-Yuh Hwang¹, Xuedong Huang¹•Institutions (1)

Carnegie Mellon University¹

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A shared-distribution hidden Markov model (HMM) is presented for speaker-independent continuous speech recognition that reduced the word error rate on the DARPA Resource Management task by 20% in comparison with the generalized-triphone model.

...read moreread less

Abstract: A shared-distribution hidden Markov model (HMM) is presented for speaker-independent continuous speech recognition. The output distributions across different phonetic HMMs are shared with each other when they exhibit acoustic similarity. This sharing provides the freedom to use a larger number of Markov states for each phonetic model. Although an increase in the number of states will increase the total number of free parameters, with distribution sharing one can collapse redundant states while maintaining necessary ones. The shared-distribution model reduced the word error rate on the DARPA Resource Management task by 20% in comparison with the generalized-triphone model. >

...read moreread less

143 citations

Journal Article•DOI•

Encoding speech using prototype waveforms

[...]

Willem Bastiaan Kleijn¹•Institutions (1)

Bell Labs¹

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: The coding method is easily combined with existing LP-based speech coders, such as CELP, for unvoiced signals and excellent voiced speech quality is obtained at rates between 3.0 and 4.0 kb/s.

...read moreread less

Abstract: Voiced speech is interpreted as a concentration of slowly evolving pitch-cycle waveforms. This signal can be reconstructed by interpolation from a downsampled sequence of pitch-cycle waveforms with a rate of one prototype waveform per 20-30 ms interval. The prototype waveform is described by a set of linear-prediction (LP) filter coefficients describing the formant structure and a prototype excitation waveform, quantized with analysis-by-synthesis procedures. The speech signal is reconstructed by filtering an excitation signal consisting of the concatenation of (infinitesimal) sections of the instantaneous excitation waveforms. To obtain the correct level of periodicity, the short-term and the long-term correlations between the instantaneous excitation waveforms can be controlled explicitly. Thus, distortions such as noise, reverberation, and buzziness can be prevented. The coding method is easily combined with existing LP-based speech coders, such as CELP, for unvoiced signals. Excellent voiced speech quality is obtained at rates between 3.0 and 4.0 kb/s. >

...read moreread less

Journal Article•DOI•

Voiced-unvoiced-silence classifications of speech using hybrid features and a network classifier

[...]

Yingyong Qi¹, B.R. Hunt¹•Institutions (1)

University of Arizona¹

01 Apr 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: Voiced-unvoiced-silence classification of speech was done using a multilayer feedforward network and results indicated that the network performance was not significantly affected by the size of the training set and a classification rate as high as 96%.

...read moreread less

Abstract: Voiced-unvoiced-silence classification of speech was done using a multilayer feedforward network. The network performance was evaluated and compared to that of a maximum-likelihood classifier. Results indicated that the network performance was not significantly affected by the size of the training set and a classification rate as high as 96% was obtained. >

...read moreread less

Journal Article•DOI•

A fast approximate acoustic match for large vocabulary speech recognition

[...]

Lalit R. Bahl¹, S.V. De Gennaro¹, Ponani S. Gopalakrishnan¹, Robert Leroy Mercer¹•Institutions (1)

IBM¹

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: The authors describe a scheme for rapidly obtaining an approximate acoustic match for all the words in the vocabulary in such a way as to ensure that the correct word is, with high probability, one of a small number of words examined in detail.

...read moreread less

Abstract: In a large vocabulary speech recognition system using hidden Markov models, calculating the likelihood of an acoustic signal segment for all the words in the vocabulary involves a large amount of computation. In order to run in real time on a modest amount of hardware, it is important that these detailed acoustic likelihood computations be performed only on words which have a reasonable probability of being the word that was spoken. The authors describe a scheme for rapidly obtaining an approximate acoustic match for all the words in the vocabulary in such a way as to ensure that the correct word is, with high probability, one of a small number of words examined in detail. Using fast search methods, they obtain a matching algorithm that is about a hundred times faster than doing a detailed acoustic likelihood computation on all the words in the IBM Office Correspondence isolated word dictation task, which has a vocabulary of 20000 words. Experimental results showing the effectiveness of such a fast match for a number of talkers are given. >

...read moreread less

Journal Article•DOI•

On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition

[...]

Xuedong Huang¹, Kai-Fu Lee²•Institutions (2)

Carnegie Mellon University¹, Apple Inc.²

01 Apr 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: It was found that speaker-adaptive systems outperform both speaker-independent and speaker-dependent systems, suggesting that the most effective system is one that begins with speaker- independent training and continues to adapt to users.

...read moreread less

Abstract: The DARPA Resource Management task is used as a domain for investigating the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The error rate of the speaker-independent recognition system, SPHINX, was reduced substantially by incorporating between-word triphone models additional dynamic features, and sex-dependent, semicontinuous hidden Markov models. The error rate for speaker-independent recognition was 4.3%. On speaker-dependent data, the error rate was further reduced to 2.6-1.4% with 600-2400 training sentences for each speaker. Using speaker-independent models, the authors studied speaker-adaptive recognition. Both codebooks and output distributions were considered for adaptation. It was found that speaker-adaptive systems outperform both speaker-independent and speaker-dependent systems, suggesting that the most effective system is one that begins with speaker-independent training and continues to adapt to users. >

...read moreread less

Journal Article•DOI•

A minimax classification approach with application to robust speech recognition

[...]

Neri Merhav¹, Chin-Hui Lee¹•Institutions (1)

Bell Labs¹

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A generalized likelihood ratio test is developed and shown to be optimal in the sense of achieving the highest asymptotic exponential rate of decay of the error probability for the worst-case mismatch situation.

...read moreread less

Abstract: A minimax approach for robust classification of parametric information sources is studied and applied to isolated-word speech recognition based on hidden Markov modeling. The goal is to reduce the sensitivity of speech recognition systems to a possible mismatch between the training and testing conditions. To this end, a generalized likelihood ratio test is developed and shown to be optimal in the sense of achieving the highest asymptotic exponential rate of decay of the error probability for the worst-case mismatch situation. The proposed approach is compared to the standard approach, where no mismatch is assumed, in recognition of noisy speech and in other realistic mismatch situations. >

...read moreread less

Journal Article•DOI•

Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary

[...]

Lin-Shan Lee¹, Chiu-yu Tseng, H.Y. Gu, F.H. Liu, C.H. Chang, Y.H. Lin, Yungling Leo Lee, S.L. Tu, S.H. Hsieh, C.H. Chen - Show less +6 more•Institutions (1)

National Taiwan University¹

01 Apr 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: The first successfully implemented real-time Mandarin dictation machine, which recognizes Mandarin speech with very large vocabulary and almost unlimited texts for the input of Chinese characters into computers, is described.

...read moreread less

Abstract: The first successfully implemented real-time Mandarin dictation machine, which recognizes Mandarin speech with very large vocabulary and almost unlimited texts for the input of Chinese characters into computers, is described. The machine is speaker-dependent, and the input speech is in the form of sequences of isolated syllables. The machine can be decomposed into two subsystems. The first subsystem recognizes the syllables using hidden Markov models. Because every syllable can represent many different homonym characters and form different multisyllabic words with syllables on its right or left, the second subsystem is needed to identify the exact characters from the syllables and correct the errors in syllable recognition. The real-time implementation is on an IBM PC/AT, connected to three sets of specially designed hardware boards on which seven TMS 320C25 chips operate in parallel. The preliminary test results indicate that it takes only about 0.45 s to dictate a syllable (or character) with an accuracy on the order of 90%. >

...read moreread less

Journal Article•DOI•

A method for the construction of acoustic Markov models for words

[...]

Lalit R. Bahl¹, Peter Fitzhugh Brown¹, P.V. de Souza¹, Robert Leroy Mercer¹, Michael Picheny¹ - Show less +1 more•Institutions (1)

IBM¹

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A method for combining phonetic and fenonic models is presented and results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported.

...read moreread less

Abstract: A technique for constructing Markov models for the acoustic representation of words is described. Word models are constructed from models of subword units called fenones. Fenones represent very short speech events and are obtained automatically through the use of a vector quantizer. The fenonic baseform for a word-i.e., the sequence of fenones used to represent the word-is derived automatically from one or more utterances of that word. Since the word models are all composed from a small inventory of subword models, training for large-vocabulary speech recognition systems can be accomplished with a small training script. A method for combining phonetic and fenonic models is presented. Results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported. The results are compared with those for phonetics-based Markov models and template-based dynamic programming (DP) matching. >

...read moreread less

Journal Article•DOI•

Estimating hidden Markov model parameters so as to maximize speech recognition accuracy

[...]

Lalit R. Bahl¹, Peter Fitzhugh Brown¹, P.V. de Souza¹, Robert Leroy Mercer¹•Institutions (1)

IBM¹

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: Although it has not been proved that the corrective training algorithm converges, experimental evidence suggests that it does, and that it leads to fewer recognition errors that can be obtained with conventional training methods.

...read moreread less

Abstract: The problem of estimating the parameter values of hidden Markov word models for speech recognition is addressed. It is argued that maximum-likelihood estimation of the parameters via the forward-backward algorithm may not lead to values which maximize recognition accuracy. An alternative estimation procedure called corrective training, which is aimed at minimizing the number of recognition errors, is described. Corrective training is similar to a well-known error-correcting training procedure for linear classifiers and works by iteratively adjusting the parameter values so as to make correct words more probable and incorrect words less probable. There are strong parallels between corrective training and maximum mutual information estimation; the relationship of these two techniques is discussed and a comparison is made of their performance. Although it has not been proved that the corrective training algorithm converges, experimental evidence suggests that it does, and that it leads to fewer recognition errors that can be obtained with conventional training methods. >

...read moreread less

Journal Article•DOI•

Recovery of missing speech packets using the short-time energy and zero-crossing measurements

[...]

Nurgun Erdol¹, C. Castelluccia¹, Ali Zilouchian¹•Institutions (1)

Florida Atlantic University¹

01 Jul 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: In this paper, a waveform substitution technique using interpolation based on the slowly varying speech parameters of short-time energy and zero-crossing information is developed for a packetized speech communication system.

...read moreread less

Abstract: A waveform substitution technique using interpolation based on the slowly varying speech parameters of short-time energy and zero-crossing information is developed for a packetized speech communication system. The system uses 64-kb conventional pulse code modulation (PCM) for encoding and takes advantage of active talkspurts and silence intervals to increase the efficiency of utilizing a digital link. The short-time energy and information on the zero-crossings needed for the purpose of determining talkspurts are transmitted in a preceding packet. Hence, when a packet is pronounced lost, its envelope and frequency characteristics are obtained from a previous packet and used to synthesize a substitution waveform which is free of annoying sounds that are due to abrupt changes in amplitude. >

...read moreread less

Journal Article•DOI•

Improved tone concatenation rules in a formant-based Chinese text-to-speech system

[...]

Lin-Shan Lee¹, Chiu-yu Tseng, C. Hsieh•Institutions (1)

National Taiwan University¹

01 Jul 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: Improved tone concatenation rules are presented, and preliminary subjective tests indicate that these rules actually give better synthesized speech for a formant-based Chinese text-to-speech system.

...read moreread less

Abstract: A set of improved tone concatenation rules to be used in a formant-based Mandarin Chinese text-to-speech system is presented. This system concatenates prestored syllables superimposed by additional tone patterns to obtain speech sentences for unlimited text, with the acoustic properties of each syllable modified by a set of synthesis rules. The tone concatenation rules are the most important among these synthesis rules, because they tell how the tone patterns for the syllables should be modified in an arbitrary sentence under various conditions of concatenating syllables of different tones on both sides. The improved tone concatenation rules are obtained empirically by carefully analyzing the tone pattern behavior under various tone concatenation conditions for many sentences in a database. A total of 14 representative tone patterns are defined for the five tones, and different rules about which pattern should be used under what kind of tone concatenation conditions are organized in detail. Preliminary subjective tests indicate that these rules actually give better synthesized speech for a formant-based Chinese text-to-speech system. >

...read moreread less

Journal Article•DOI•

A new hybrid algorithm for speech recognition based on HMM segmentation and learning vector quantization

[...]

Shigeru Katagiri¹, Chin-Hui Lee¹•Institutions (1)

Bell Labs¹

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: It is shown that by combining both LVQ's discriminative power and the HMM's capability of modeling temporal variations of speech in a hybrid algorithm, the performance of the original HMM-based speech recognizer is significantly improved.

...read moreread less

Abstract: A hybrid speech recognition algorithm based on the combination of hidden Markov models (HMMs) and learning vector quantization (LVQ) is presented. The LVQ training algorithms are capable of producing highly discriminative reference vectors for classifying static patterns, i.e., vectors with a fixed dimension. The HMM formulation has also been successfully applied to the recognition of dynamic speech patterns that are of variable duration. It is shown that by combining both LVQ's discriminative power and the HMM's capability of modeling temporal variations of speech in a hybrid algorithm, the performance of the original HMM-based speech recognizer is significantly improved. For a highly confusable vocabulary consisting of the nine American English E-set letters used in a multispeaker, isolated-word test mode, the average word accuracy of the baseline HMM recognizer is 67%. When LVQ is incorporated in the hybrid system, the word accuracy increases to 83%. >

...read moreread less

Journal Article•DOI•

A*-admissible heuristics for rapid lexical access

[...]

P. Kenny¹, R. Hollan¹, Vishwa Gupta¹, Matthew Lennig¹, Paul Mermelstein¹, Douglas O'Shaughnessy¹ - Show less +2 more•Institutions (1)

Institut national de la recherche scientifique¹

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A new class of A* algorithms for Viterbi phonetic decoding subject to lexical constraints is presented and can be made to run substantially faster than theViterbi algorithm in an isolated word recognizer having a vocabulary of 1600 words.

...read moreread less

Abstract: A new class of A* algorithms for Viterbi phonetic decoding subject to lexical constraints is presented. This type of algorithm can be made to run substantially faster than the Viterbi algorithm in an isolated word recognizer having a vocabulary of 1600 words. In addition, multiple recognition hypotheses can be generated on demand and the search can be constrained in respect conditions on phone durations in such a way that computational requirements are substantially reduced. Results are presented on a 60000-word recognition task. >

...read moreread less

Journal Article•DOI•

A speaker-independent continuous speech recognition system using continuous mixture Gaussian density HMM of phoneme-sized units

[...]

Yunxin Zhao¹•Institutions (1)

Panasonic¹

01 Jul 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: A large vocabulary, speaker-independent, continuous speech recognition system which is based on hidden Markov modeling (HMM) of phoneme-sized acoustic units using continuous mixture Gaussian densities, which has been evaluated on the TIMIT database for a vocabulary size of 853.

...read moreread less

Abstract: The author describes a large vocabulary, speaker-independent, continuous speech recognition system which is based on hidden Markov modeling (HMM) of phoneme-sized acoustic units using continuous mixture Gaussian densities. A bottom-up merging algorithm is developed for estimating the parameters of the mixture Gaussian densities, where the resultant number of mixture components is proportional to both the sample size and dispersion of training data. A compression procedure is developed to construct a word transcription dictionary from the acoustic-phonetic labels of sentence utterances. A modified word-pair grammar using context-sensitive grammatical parts is incorporated to constrain task difficulty. The Viterbi beam search is used for decoding. The segmental K-means algorithm is implemented as a baseline for evaluating the bottom-up merging technique. The system has been evaluated on the TIMIT database (1990) for a vocabulary size of 853. For test set perplexities of 24, 104, and 853, the decoding word accuracies are 90.9%, 86.0%, and 62.9%, respectively. For the perplexity of 104, the decoding accuracy achieved by using the merging algorithm is 4.1% higher than that using the segmental K-means (22.8% error reduction), and the decoding accuracy using the compressed dictionary is 3.0% higher than that using a standard dictionary (18.1% error reduction). >

...read moreread less

Journal Article•DOI•

A stochastic model of speech incorporating hierarchical nonstationarity

[...]

Li Deng¹•Institutions (1)

University of Waterloo¹

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: The concept of two-level (global and local) hierarchical nonstationarity is introduced to describe the elastic and dynamic nature of the speech signal and potential uses in automatic uncovering of relationally invariant properties from thespeech signal and in speech recognition.

...read moreread less

Abstract: The concept of two-level (global and local) hierarchical nonstationarity is introduced to describe the elastic and dynamic nature of the speech signal. A doubly stochastic process model is developed to implement this concept. In the model, the global nonstationarity is embodied through an underlying Markov chain that governs evolution of the parameters in a set of output stochastic processes. The local nonstationarity is realized by utilizing state-conditioned, time-varying first- and second-order statistics in the output data-generation process models. For potential uses in automatic uncovering of relationally invariant properties from the speech signal and in speech recognition, the local nonstationarity is represented in a parametric form. Preliminary experiments on fitting the models to speech data demonstrate superior performances of the proposed model to several traditional types of hidden Markov models. >

...read moreread less

Journal Article•DOI•

Multichannel active control of random noise in a small reverberant room

[...]

S. Laugesen¹, Stephen J. Elliott²•Institutions (2)

University of Copenhagen¹, University of Southampton²

01 Apr 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: An algorithm for multichannel adaptive IIR (infinite impulse response) filtering is presented and applied to the active control of broadband random noise in a small reverberant room, and it is found that for the present application FIR filters are sufficient when the primary noise source is a loudspeaker.

...read moreread less

Abstract: An algorithm for multichannel adaptive IIR (infinite impulse response) filtering is presented and applied to the active control of broadband random noise in a small reverberant room Assuming complete knowledge of the primary noise, the theoretically optimal reductions of acoustic energy are initially calculated by means of a frequency-domain model These results are contrasted with results of a causality constrained theoretical time-domain optimization which are then compared with experimental results, the latter two results showing good agreement The experimental performances of adaptive multichannel FIR (finite impulse response) and IIR filters are then compared for a four-secondary-source, eight-error microphone active control system, and it is found that for the present application FIR filters are sufficient when the primary noise source is a loudspeaker Some experiments are then presented with the primary noise field generated by a panel excited by a loudspeaker in an adjoining room These results show that far better performances are provided by IIR and FIR filters when the primary source has a lightly damped dynamic behavior which the active controller must model >

...read moreread less

Journal Article•DOI•

Evaluation and optimization of perceptually-based ASR front-end

[...]

J.-C. Junqua, H. Wakita¹, Hynek Hermansky¹•Institutions (1)

Panasonic¹

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features.

...read moreread less

Abstract: Several recently proposed automatic speech recognition (ASR) front-ends are experimentally compared in speaker-dependent, speaker-independent (or cross-speaker) recognition. The perceptually based linear predictive (PLP) front-end, with the root-power sums (RPS) distance measure, yields generally the highest accuracies, especially in cross-speaker recognition., It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features. For a digit vocabulary and five reference templates obtained with a clustering algorithm, the optimization improves recognition accuracy from 97% to 98.1%, with respect to the PL-PRPS front-end. >

...read moreread less

Journal Article•DOI•

Filterbank-energy estimation using mixture and Markov models for recognition of noisy speech

[...]

A. Erell, M. Weintraub

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: An estimation algorithm for noise robust speech recognition, the minimum mean log spectral distance (MMLSD), is presented and is highly efficient with a quasi-stationary environmental noise, recorded with a desktop microphone, and requires almost no tuning to differences between this noise and the computer-generated white noise.

...read moreread less

Abstract: An estimation algorithm for noise robust speech recognition, the minimum mean log spectral distance (MMLSD), is presented. The estimation is matched to the recognizer by seeking to minimize the average distortion as measured by a Euclidean distance between filterbank log-energy vectors, approximating the weighted-cepstral distance used by the recognizer. The estimation is computed using a clean speech spectral probability distribution, estimated from a database, and a stationary, ARMA model for the noise. When trained on clean speech and tested with additive white noise at 10-dB SNR, the recognition accuracy with the MMLSD algorithm is comparable to that achieved with training the recognizer at the same constant 10-dB SNR. The algorithm is also highly efficient with a quasi-stationary environmental noise, recorded with a desktop microphone, and requires almost no tuning to differences between this noise and the computer-generated white noise. >

...read moreread less

Journal Article•DOI•

Energy conditioned spectral estimation for recognition of noisy speech

[...]

A. Erell, M. Weintraub

01 Jan 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: An estimation algorithm to improve the noise robustness of filterbank-based speech recognition systems is presented, based on a minimum mean square error (MMSE) estimation of the filter log-energies, introducing a significant improvement over related published algorithms by conditioning the estimate on the total frame energy.

...read moreread less

Abstract: An estimation algorithm to improve the noise robustness of filterbank-based speech recognition systems is presented. The algorithm is based on a minimum mean square error (MMSE) estimation of the filter log-energies, introducing a significant improvement over related published algorithms by conditioning the estimate on the total frame energy. The algorithm was evaluated with DECIPHER, SRI's continuous-speech speaker-independent recognizer, on two types of noisy speech: a standard database with added white Gaussian noise, and recordings made in a noisy environment. With white noise the recognition accuracy obtained while training on clean speech and testing in noise approached that obtained with training and testing in noise. In the noisy environment, the estimation algorithm boosted the recognition system's performance with a table mounted microphone almost to the level achieved with a close talking microphone. >

...read moreread less

Journal Article•DOI•

Lossless pole-zero modeling of speech signals

[...]

Il-Taek Lim¹, Byeong Gi Lee¹•Institutions (1)

Seoul National University¹

01 Jul 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: The authors establish a theory for lossless pole-zero modeling of speech signals for the description of nasal sounds based on a generalized vocal tract tube model, which consists of the main vocal tract, the oral tract, and the nasal tract.

...read moreread less

Abstract: The authors establish a theory for lossless pole-zero modeling of speech signals for the description of nasal sounds. The theory is based on a generalized vocal tract tube model which consists of the main vocal tract, the oral tract, and the nasal tract. A pole-zero type transfer function, which turns out to be a generalized version of the existing all-pole type transfer function, is derived. Fundamental properties of the generalized vocal tract model are investigated, employing the concept of discrete-time reactance. A procedure for evaluating the reflection coefficients for the model is outlined. The assumption of losslessness in the modeling leads to the following two properties. First, the combination of two lattice structures representing the oral and the nasal tracts form one larger lattice structure when viewed at their joint point called the branch boundary. Second, the oral and the nasal tract render respective discrete-time reactances whose convex combination generates the discrete-time reactance at the branch boundary. The first property enables the combined reactance to be computed, and the second helps separate it into its components. >

...read moreread less

Journal Article•DOI•

Accurate tuning curves in a cochlear model

[...]

James M. Kates¹•Institutions (1)

City University of New York¹

01 Oct 1993-IEEE Transactions on Speech and Audio Processing

TL;DR: Tuning curves are compared here for two cochlear models, one based on a cascade of low-pass filter sections and the other based onA cascade of filter sections derived from a one-dimensional transmission line, and the resultant simulated tuning curves and neural outputs indicate that the modified transmission-line approximation yields a more accurate co chlear model.

...read moreread less

Abstract: Most practical models of cochlear mechanics are based on approximations to the wave equation in the cochlea. These approximations engender compromises in the accuracy with which the cochlear motion can be reproduced. Tuning curves are compared here for two cochlear models, one based on a cascade of low-pass filter sections and the other based on a cascade of filter sections derived from a one-dimensional transmission line. The filters in the two simulations are designed to give comparable latency in the neural response to inputs at different frequencies, and the simulations include an active gain-control mechanism to adjust the characteristics of each section with changes in the input signal level. The resultant simulated tuning curves and neural outputs indicate that the modified transmission-line approximation yields a more accurate cochlear model. >

...read moreread less