scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Speech and Audio Processing in 1993"


Journal ArticleDOI
Kuldip K. Paliwal1, B. Atal1
TL;DR: It is shown that the split vector quantizer can quantize LPC information in 24 bits/frame with an average spectral distortion of 1 dB and less than 2% of the frames having spectral distortion greater than 2 dB.
Abstract: For low bit rate speech coding applications, it is important to quantize the LPC parameters accurately using as few bits as possible. Though vector quantizers are more efficient than scalar quantizers, their use for accurate quantization of linear predictive coding (LPC) information (using 24-26 bits/frame) is impeded by their prohibitively high complexity. A split vector quantization approach is used here to overcome the complexity problem. An LPC vector consisting of 10 line spectral frequencies (LSFs) is divided into two parts, and each part is quantized separately using vector quantization. Using the localized spectral sensitivity property of the LSF parameters, a weighted LSF distance measure is proposed. With this distance measure, it is shown that the split vector quantizer can quantize LPC information in 24 bits/frame with an average spectral distortion of 1 dB and less than 2% of the frames having spectral distortion greater than 2 dB. The effect of channel errors on the performance of this quantizer is also investigated and results are reported. >

665 citations


Journal ArticleDOI
TL;DR: Algorithms developed suggest a potentially interesting modification of Widrow's (1975) least-squares method for noise cancellation, where the reference signal contains a component of the desired signal.
Abstract: Identification of an unknown system and recovery of the input signals from observations of the outputs of an unknown multiple-input, multiple-output linear system are considered. Attention is focused on the two-channel case, in which the outputs of a 2*2 linear time invariant system are observed. The approach consists of reconstructing the input signals by assuming that they are statistically uncorrelated and imposing this constraint on the signal estimates. In order to restrict the set of solutions, additional information on the true signal generation and/or on the form of the coupling systems is incorporated. Specific algorithms are developed and tested. As a special case, these algorithms suggest a potentially interesting modification of Widrow's (1975) least-squares method for noise cancellation, where the reference signal contains a component of the desired signal. >

366 citations


Journal ArticleDOI
TL;DR: The nontraditional approach to the problem of estimating the parameters of a stochastic linear system is presented and it is shown how the evolution of the dynamics as a function of the segment length can be modeled using alternative assumptions.
Abstract: A nontraditional approach to the problem of estimating the parameters of a stochastic linear system is presented. The method is based on the expectation-maximization algorithm and can be considered as the continuous analog of the Baum-Welch estimation algorithm for hidden Markov models. The algorithm is used for training the parameters of a dynamical system model that is proposed for better representing the spectral dynamics of speech for recognition. It is assumed that the observed feature vectors of a phone segment are the output of a stochastic linear dynamical system, and it is shown how the evolution of the dynamics as a function of the segment length can be modeled using alternative assumptions. A phoneme classification task using the TIMIT database demonstrates that the approach is the first effective use of an explicit model for statistical dependence between frames of speech. >

238 citations


Journal ArticleDOI
TL;DR: It is shown experimentally that as the number of stages is increased above the optimal performance/complexity tradeoff, the quantizer robustness and outlier performance can be improved at the expense of a slight increase in rate.
Abstract: A tree-searched multistage vector quantization (VQ) scheme for linear prediction coding (LPC) parameters which achieves spectral distortion lower than 1 dB with low complexity and good robustness using rates as low as 22 b/frame is presented. The M-L search is used, and it is shown that it achieves performance close to that of the optimal search for a relatively small M. A joint codebook design strategy for multistage VQ which improves convergence speed and the VQ performance measures is presented. The best performance/complexity tradeoffs are obtained with relatively small size codebooks cascaded in a 3-6 stage configuration. It is shown experimentally that as the number of stages is increased above the optimal performance/complexity tradeoff, the quantizer robustness and outlier performance can be improved at the expense of a slight increase in rate. Results for log area ratio (LAR) and line spectral pairs (LSPs) parameters are presented. A training technique that reduces outliers at the expense of a slight average performance degradation is introduced. The method significantly outperforms the split codebook approach. >

201 citations


Journal ArticleDOI
TL;DR: The estimation of formant frequencies and bandwidths from the filter coefficients obtained through linear-predictive-coding analysis of speech is discussed from several viewpoints and a method for locating roots within the unit circle is derived.
Abstract: The estimation of formant frequencies and bandwidths from the filter coefficients obtained through linear-predictive-coding (LPC) analysis of speech is discussed from several viewpoints. A method for locating roots within the unit circle is derived. This algorithm is particularly well suited to computations carried out in fixed-point arithmetic using specialized signal processing hardware. >

171 citations


Journal ArticleDOI
TL;DR: A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system that is more flexible than previously reported fenone-based word models, which lead to an improved capability of modeling variations in pronunciation.
Abstract: A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system. The models, built from combinations of acoustically based sub-word units called fenones, are derived automatically from one or more sample utterances of a word. Because they are more flexible than previously reported fenone-based word models, they lead to an improved capability of modeling variations in pronunciation. They are therefore particularly useful in the recognition of continuous speech. In addition, their construction is relatively simple, because it can be done using the well-known forward-backward algorithm for parameter estimation of hidden Markov models. Appropriate reestimation formulas are derived for this purpose. Experimental results obtained on a 5000-word vocabulary natural language continuous speech recognition task are presented to illustrate the enhanced power of discrimination of the new models. >

170 citations


Journal ArticleDOI
TL;DR: A new minimum recognition error formulation and a generalized probabilistic descent (GPD) algorithm are analyzed and used to accomplish discriminative training of a conventional dynamic-programming-based speech recognizer.
Abstract: A new minimum recognition error formulation and a generalized probabilistic descent (GPD) algorithm are analyzed and used to accomplish discriminative training of a conventional dynamic-programming-based speech recognizer. The objective of discriminative training here is to directly minimize the recognition error rate. To achieve this, a formulation that allows controlled approximation of the exact error rate and renders optimization possible is used. The GPD method is implemented in a dynamic-time-warping (DTW)-based system. A linear discriminant function on the DTW distortion sequence is used to replace the conventional average DTW path distance. A series of speaker-independent recognition experiments using the highly confusible English E-set as the vocabulary showed a recognition rate of 84.4% compared to approximately 60% for traditional template training via clustering. The experimental results verified that the algorithm converges to a solution that achieves minimum error rate. >

165 citations


Journal ArticleDOI
F.K. Soong1, Biing-Hwang Juang1
TL;DR: A globally optimal scalar quantizer is designed for each differential LSP frequency, which achieves a 1-dB average log spectral distortion, a commonly accepted level for reproducing perceptually transparent spectral information.
Abstract: Two nonuniform aspects of the line spectrum pair (LSP) linear predictive coding (LPC) parameters are investigated, including nonuniform statistical distributions and spectral sensitivities of adjacent LSP frequency differences. Based upon these two nonuniform properties, a globally optimal scalar quantizer is designed for each differential LSP frequency. The design algorithm is dynamic programming based and minimization of a nontrivial data dependent spectral distortion is adopted as the optimality criterion. At 32 bits/frame, the new LSP quantizer achieves a 1-dB average log spectral distortion, a commonly accepted level for reproducing perceptually transparent spectral information. The quantization performance has also been shown to be robust across different speakers and databases. >

151 citations


Journal ArticleDOI
TL;DR: A normalized least-mean-squares (NLMS) adaptive algorithm with double the convergence speed, at the same computational load, of the conventional NLMS for an acoustic echo canceller is proposed and its fast convergence is demonstrated.
Abstract: A normalized least-mean-squares (NLMS) adaptive algorithm with double the convergence speed, at the same computational load, of the conventional NLMS for an acoustic echo canceller is proposed. This algorithm, called the ES (exponentially weighted stepsize) algorithm, uses a different stepsize (feedback constant) for each weight of an adaptive transversal filter. These stepsizes are time-invariant and weighted proportionally to the expected variation of a room impulse response. The algorithm adjusts coefficients with large errors in large steps, and coefficients with small errors in small steps. A transition formula is derived for the mean-squared coefficient error of the algorithm. The mean stepsize determines the convergence condition, the convergence speed, and the final excess mean-squared error. Modified for a practical multiple DSP structure, the algorithm requires only the same amount of computation as the conventional NLMS. The algorithm is implemented in a commercial acoustic echo canceller, and its fast convergence is demonstrated. >

148 citations


Journal ArticleDOI
TL;DR: A shared-distribution hidden Markov model (HMM) is presented for speaker-independent continuous speech recognition that reduced the word error rate on the DARPA Resource Management task by 20% in comparison with the generalized-triphone model.
Abstract: A shared-distribution hidden Markov model (HMM) is presented for speaker-independent continuous speech recognition. The output distributions across different phonetic HMMs are shared with each other when they exhibit acoustic similarity. This sharing provides the freedom to use a larger number of Markov states for each phonetic model. Although an increase in the number of states will increase the total number of free parameters, with distribution sharing one can collapse redundant states while maintaining necessary ones. The shared-distribution model reduced the word error rate on the DARPA Resource Management task by 20% in comparison with the generalized-triphone model. >

143 citations


Journal ArticleDOI
Willem Bastiaan Kleijn1
TL;DR: The coding method is easily combined with existing LP-based speech coders, such as CELP, for unvoiced signals and excellent voiced speech quality is obtained at rates between 3.0 and 4.0 kb/s.
Abstract: Voiced speech is interpreted as a concentration of slowly evolving pitch-cycle waveforms. This signal can be reconstructed by interpolation from a downsampled sequence of pitch-cycle waveforms with a rate of one prototype waveform per 20-30 ms interval. The prototype waveform is described by a set of linear-prediction (LP) filter coefficients describing the formant structure and a prototype excitation waveform, quantized with analysis-by-synthesis procedures. The speech signal is reconstructed by filtering an excitation signal consisting of the concatenation of (infinitesimal) sections of the instantaneous excitation waveforms. To obtain the correct level of periodicity, the short-term and the long-term correlations between the instantaneous excitation waveforms can be controlled explicitly. Thus, distortions such as noise, reverberation, and buzziness can be prevented. The coding method is easily combined with existing LP-based speech coders, such as CELP, for unvoiced signals. Excellent voiced speech quality is obtained at rates between 3.0 and 4.0 kb/s. >

Journal ArticleDOI
TL;DR: Voiced-unvoiced-silence classification of speech was done using a multilayer feedforward network and results indicated that the network performance was not significantly affected by the size of the training set and a classification rate as high as 96%.
Abstract: Voiced-unvoiced-silence classification of speech was done using a multilayer feedforward network. The network performance was evaluated and compared to that of a maximum-likelihood classifier. Results indicated that the network performance was not significantly affected by the size of the training set and a classification rate as high as 96% was obtained. >

Journal ArticleDOI
TL;DR: The authors describe a scheme for rapidly obtaining an approximate acoustic match for all the words in the vocabulary in such a way as to ensure that the correct word is, with high probability, one of a small number of words examined in detail.
Abstract: In a large vocabulary speech recognition system using hidden Markov models, calculating the likelihood of an acoustic signal segment for all the words in the vocabulary involves a large amount of computation. In order to run in real time on a modest amount of hardware, it is important that these detailed acoustic likelihood computations be performed only on words which have a reasonable probability of being the word that was spoken. The authors describe a scheme for rapidly obtaining an approximate acoustic match for all the words in the vocabulary in such a way as to ensure that the correct word is, with high probability, one of a small number of words examined in detail. Using fast search methods, they obtain a matching algorithm that is about a hundred times faster than doing a detailed acoustic likelihood computation on all the words in the IBM Office Correspondence isolated word dictation task, which has a vocabulary of 20000 words. Experimental results showing the effectiveness of such a fast match for a number of talkers are given. >

Journal ArticleDOI
TL;DR: It was found that speaker-adaptive systems outperform both speaker-independent and speaker-dependent systems, suggesting that the most effective system is one that begins with speaker- independent training and continues to adapt to users.
Abstract: The DARPA Resource Management task is used as a domain for investigating the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The error rate of the speaker-independent recognition system, SPHINX, was reduced substantially by incorporating between-word triphone models additional dynamic features, and sex-dependent, semicontinuous hidden Markov models. The error rate for speaker-independent recognition was 4.3%. On speaker-dependent data, the error rate was further reduced to 2.6-1.4% with 600-2400 training sentences for each speaker. Using speaker-independent models, the authors studied speaker-adaptive recognition. Both codebooks and output distributions were considered for adaptation. It was found that speaker-adaptive systems outperform both speaker-independent and speaker-dependent systems, suggesting that the most effective system is one that begins with speaker-independent training and continues to adapt to users. >

Journal ArticleDOI
Neri Merhav1, Chin-Hui Lee1
TL;DR: A generalized likelihood ratio test is developed and shown to be optimal in the sense of achieving the highest asymptotic exponential rate of decay of the error probability for the worst-case mismatch situation.
Abstract: A minimax approach for robust classification of parametric information sources is studied and applied to isolated-word speech recognition based on hidden Markov modeling. The goal is to reduce the sensitivity of speech recognition systems to a possible mismatch between the training and testing conditions. To this end, a generalized likelihood ratio test is developed and shown to be optimal in the sense of achieving the highest asymptotic exponential rate of decay of the error probability for the worst-case mismatch situation. The proposed approach is compared to the standard approach, where no mismatch is assumed, in recognition of noisy speech and in other realistic mismatch situations. >

Journal ArticleDOI
TL;DR: The first successfully implemented real-time Mandarin dictation machine, which recognizes Mandarin speech with very large vocabulary and almost unlimited texts for the input of Chinese characters into computers, is described.
Abstract: The first successfully implemented real-time Mandarin dictation machine, which recognizes Mandarin speech with very large vocabulary and almost unlimited texts for the input of Chinese characters into computers, is described. The machine is speaker-dependent, and the input speech is in the form of sequences of isolated syllables. The machine can be decomposed into two subsystems. The first subsystem recognizes the syllables using hidden Markov models. Because every syllable can represent many different homonym characters and form different multisyllabic words with syllables on its right or left, the second subsystem is needed to identify the exact characters from the syllables and correct the errors in syllable recognition. The real-time implementation is on an IBM PC/AT, connected to three sets of specially designed hardware boards on which seven TMS 320C25 chips operate in parallel. The preliminary test results indicate that it takes only about 0.45 s to dictate a syllable (or character) with an accuracy on the order of 90%. >

Journal ArticleDOI
TL;DR: A method for combining phonetic and fenonic models is presented and results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported.
Abstract: A technique for constructing Markov models for the acoustic representation of words is described. Word models are constructed from models of subword units called fenones. Fenones represent very short speech events and are obtained automatically through the use of a vector quantizer. The fenonic baseform for a word-i.e., the sequence of fenones used to represent the word-is derived automatically from one or more utterances of that word. Since the word models are all composed from a small inventory of subword models, training for large-vocabulary speech recognition systems can be accomplished with a small training script. A method for combining phonetic and fenonic models is presented. Results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported. The results are compared with those for phonetics-based Markov models and template-based dynamic programming (DP) matching. >

Journal ArticleDOI
TL;DR: Although it has not been proved that the corrective training algorithm converges, experimental evidence suggests that it does, and that it leads to fewer recognition errors that can be obtained with conventional training methods.
Abstract: The problem of estimating the parameter values of hidden Markov word models for speech recognition is addressed. It is argued that maximum-likelihood estimation of the parameters via the forward-backward algorithm may not lead to values which maximize recognition accuracy. An alternative estimation procedure called corrective training, which is aimed at minimizing the number of recognition errors, is described. Corrective training is similar to a well-known error-correcting training procedure for linear classifiers and works by iteratively adjusting the parameter values so as to make correct words more probable and incorrect words less probable. There are strong parallels between corrective training and maximum mutual information estimation; the relationship of these two techniques is discussed and a comparison is made of their performance. Although it has not been proved that the corrective training algorithm converges, experimental evidence suggests that it does, and that it leads to fewer recognition errors that can be obtained with conventional training methods. >

Journal ArticleDOI
TL;DR: In this paper, a waveform substitution technique using interpolation based on the slowly varying speech parameters of short-time energy and zero-crossing information is developed for a packetized speech communication system.
Abstract: A waveform substitution technique using interpolation based on the slowly varying speech parameters of short-time energy and zero-crossing information is developed for a packetized speech communication system. The system uses 64-kb conventional pulse code modulation (PCM) for encoding and takes advantage of active talkspurts and silence intervals to increase the efficiency of utilizing a digital link. The short-time energy and information on the zero-crossings needed for the purpose of determining talkspurts are transmitted in a preceding packet. Hence, when a packet is pronounced lost, its envelope and frequency characteristics are obtained from a previous packet and used to synthesize a substitution waveform which is free of annoying sounds that are due to abrupt changes in amplitude. >

Journal ArticleDOI
TL;DR: Improved tone concatenation rules are presented, and preliminary subjective tests indicate that these rules actually give better synthesized speech for a formant-based Chinese text-to-speech system.
Abstract: A set of improved tone concatenation rules to be used in a formant-based Mandarin Chinese text-to-speech system is presented. This system concatenates prestored syllables superimposed by additional tone patterns to obtain speech sentences for unlimited text, with the acoustic properties of each syllable modified by a set of synthesis rules. The tone concatenation rules are the most important among these synthesis rules, because they tell how the tone patterns for the syllables should be modified in an arbitrary sentence under various conditions of concatenating syllables of different tones on both sides. The improved tone concatenation rules are obtained empirically by carefully analyzing the tone pattern behavior under various tone concatenation conditions for many sentences in a database. A total of 14 representative tone patterns are defined for the five tones, and different rules about which pattern should be used under what kind of tone concatenation conditions are organized in detail. Preliminary subjective tests indicate that these rules actually give better synthesized speech for a formant-based Chinese text-to-speech system. >

Journal ArticleDOI
TL;DR: It is shown that by combining both LVQ's discriminative power and the HMM's capability of modeling temporal variations of speech in a hybrid algorithm, the performance of the original HMM-based speech recognizer is significantly improved.
Abstract: A hybrid speech recognition algorithm based on the combination of hidden Markov models (HMMs) and learning vector quantization (LVQ) is presented. The LVQ training algorithms are capable of producing highly discriminative reference vectors for classifying static patterns, i.e., vectors with a fixed dimension. The HMM formulation has also been successfully applied to the recognition of dynamic speech patterns that are of variable duration. It is shown that by combining both LVQ's discriminative power and the HMM's capability of modeling temporal variations of speech in a hybrid algorithm, the performance of the original HMM-based speech recognizer is significantly improved. For a highly confusable vocabulary consisting of the nine American English E-set letters used in a multispeaker, isolated-word test mode, the average word accuracy of the baseline HMM recognizer is 67%. When LVQ is incorporated in the hybrid system, the word accuracy increases to 83%. >

Journal ArticleDOI
TL;DR: A new class of A* algorithms for Viterbi phonetic decoding subject to lexical constraints is presented and can be made to run substantially faster than theViterbi algorithm in an isolated word recognizer having a vocabulary of 1600 words.
Abstract: A new class of A* algorithms for Viterbi phonetic decoding subject to lexical constraints is presented. This type of algorithm can be made to run substantially faster than the Viterbi algorithm in an isolated word recognizer having a vocabulary of 1600 words. In addition, multiple recognition hypotheses can be generated on demand and the search can be constrained in respect conditions on phone durations in such a way that computational requirements are substantially reduced. Results are presented on a 60000-word recognition task. >

Journal ArticleDOI
Yunxin Zhao1
TL;DR: A large vocabulary, speaker-independent, continuous speech recognition system which is based on hidden Markov modeling (HMM) of phoneme-sized acoustic units using continuous mixture Gaussian densities, which has been evaluated on the TIMIT database for a vocabulary size of 853.
Abstract: The author describes a large vocabulary, speaker-independent, continuous speech recognition system which is based on hidden Markov modeling (HMM) of phoneme-sized acoustic units using continuous mixture Gaussian densities. A bottom-up merging algorithm is developed for estimating the parameters of the mixture Gaussian densities, where the resultant number of mixture components is proportional to both the sample size and dispersion of training data. A compression procedure is developed to construct a word transcription dictionary from the acoustic-phonetic labels of sentence utterances. A modified word-pair grammar using context-sensitive grammatical parts is incorporated to constrain task difficulty. The Viterbi beam search is used for decoding. The segmental K-means algorithm is implemented as a baseline for evaluating the bottom-up merging technique. The system has been evaluated on the TIMIT database (1990) for a vocabulary size of 853. For test set perplexities of 24, 104, and 853, the decoding word accuracies are 90.9%, 86.0%, and 62.9%, respectively. For the perplexity of 104, the decoding accuracy achieved by using the merging algorithm is 4.1% higher than that using the segmental K-means (22.8% error reduction), and the decoding accuracy using the compressed dictionary is 3.0% higher than that using a standard dictionary (18.1% error reduction). >

Journal ArticleDOI
Li Deng1
TL;DR: The concept of two-level (global and local) hierarchical nonstationarity is introduced to describe the elastic and dynamic nature of the speech signal and potential uses in automatic uncovering of relationally invariant properties from thespeech signal and in speech recognition.
Abstract: The concept of two-level (global and local) hierarchical nonstationarity is introduced to describe the elastic and dynamic nature of the speech signal. A doubly stochastic process model is developed to implement this concept. In the model, the global nonstationarity is embodied through an underlying Markov chain that governs evolution of the parameters in a set of output stochastic processes. The local nonstationarity is realized by utilizing state-conditioned, time-varying first- and second-order statistics in the output data-generation process models. For potential uses in automatic uncovering of relationally invariant properties from the speech signal and in speech recognition, the local nonstationarity is represented in a parametric form. Preliminary experiments on fitting the models to speech data demonstrate superior performances of the proposed model to several traditional types of hidden Markov models. >

Journal ArticleDOI
TL;DR: An algorithm for multichannel adaptive IIR (infinite impulse response) filtering is presented and applied to the active control of broadband random noise in a small reverberant room, and it is found that for the present application FIR filters are sufficient when the primary noise source is a loudspeaker.
Abstract: An algorithm for multichannel adaptive IIR (infinite impulse response) filtering is presented and applied to the active control of broadband random noise in a small reverberant room Assuming complete knowledge of the primary noise, the theoretically optimal reductions of acoustic energy are initially calculated by means of a frequency-domain model These results are contrasted with results of a causality constrained theoretical time-domain optimization which are then compared with experimental results, the latter two results showing good agreement The experimental performances of adaptive multichannel FIR (finite impulse response) and IIR filters are then compared for a four-secondary-source, eight-error microphone active control system, and it is found that for the present application FIR filters are sufficient when the primary noise source is a loudspeaker Some experiments are then presented with the primary noise field generated by a panel excited by a loudspeaker in an adjoining room These results show that far better performances are provided by IIR and FIR filters when the primary source has a lightly damped dynamic behavior which the active controller must model >

Journal ArticleDOI
TL;DR: It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features.
Abstract: Several recently proposed automatic speech recognition (ASR) front-ends are experimentally compared in speaker-dependent, speaker-independent (or cross-speaker) recognition. The perceptually based linear predictive (PLP) front-end, with the root-power sums (RPS) distance measure, yields generally the highest accuracies, especially in cross-speaker recognition., It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features. For a digit vocabulary and five reference templates obtained with a clustering algorithm, the optimization improves recognition accuracy from 97% to 98.1%, with respect to the PL-PRPS front-end. >

Journal ArticleDOI
TL;DR: An estimation algorithm for noise robust speech recognition, the minimum mean log spectral distance (MMLSD), is presented and is highly efficient with a quasi-stationary environmental noise, recorded with a desktop microphone, and requires almost no tuning to differences between this noise and the computer-generated white noise.
Abstract: An estimation algorithm for noise robust speech recognition, the minimum mean log spectral distance (MMLSD), is presented. The estimation is matched to the recognizer by seeking to minimize the average distortion as measured by a Euclidean distance between filterbank log-energy vectors, approximating the weighted-cepstral distance used by the recognizer. The estimation is computed using a clean speech spectral probability distribution, estimated from a database, and a stationary, ARMA model for the noise. When trained on clean speech and tested with additive white noise at 10-dB SNR, the recognition accuracy with the MMLSD algorithm is comparable to that achieved with training the recognizer at the same constant 10-dB SNR. The algorithm is also highly efficient with a quasi-stationary environmental noise, recorded with a desktop microphone, and requires almost no tuning to differences between this noise and the computer-generated white noise. >

Journal ArticleDOI
TL;DR: An estimation algorithm to improve the noise robustness of filterbank-based speech recognition systems is presented, based on a minimum mean square error (MMSE) estimation of the filter log-energies, introducing a significant improvement over related published algorithms by conditioning the estimate on the total frame energy.
Abstract: An estimation algorithm to improve the noise robustness of filterbank-based speech recognition systems is presented. The algorithm is based on a minimum mean square error (MMSE) estimation of the filter log-energies, introducing a significant improvement over related published algorithms by conditioning the estimate on the total frame energy. The algorithm was evaluated with DECIPHER, SRI's continuous-speech speaker-independent recognizer, on two types of noisy speech: a standard database with added white Gaussian noise, and recordings made in a noisy environment. With white noise the recognition accuracy obtained while training on clean speech and testing in noise approached that obtained with training and testing in noise. In the noisy environment, the estimation algorithm boosted the recognition system's performance with a table mounted microphone almost to the level achieved with a close talking microphone. >

Journal ArticleDOI
TL;DR: The authors establish a theory for lossless pole-zero modeling of speech signals for the description of nasal sounds based on a generalized vocal tract tube model, which consists of the main vocal tract, the oral tract, and the nasal tract.
Abstract: The authors establish a theory for lossless pole-zero modeling of speech signals for the description of nasal sounds. The theory is based on a generalized vocal tract tube model which consists of the main vocal tract, the oral tract, and the nasal tract. A pole-zero type transfer function, which turns out to be a generalized version of the existing all-pole type transfer function, is derived. Fundamental properties of the generalized vocal tract model are investigated, employing the concept of discrete-time reactance. A procedure for evaluating the reflection coefficients for the model is outlined. The assumption of losslessness in the modeling leads to the following two properties. First, the combination of two lattice structures representing the oral and the nasal tracts form one larger lattice structure when viewed at their joint point called the branch boundary. Second, the oral and the nasal tract render respective discrete-time reactances whose convex combination generates the discrete-time reactance at the branch boundary. The first property enables the combined reactance to be computed, and the second helps separate it into its components. >

Journal ArticleDOI
TL;DR: Tuning curves are compared here for two cochlear models, one based on a cascade of low-pass filter sections and the other based onA cascade of filter sections derived from a one-dimensional transmission line, and the resultant simulated tuning curves and neural outputs indicate that the modified transmission-line approximation yields a more accurate co chlear model.
Abstract: Most practical models of cochlear mechanics are based on approximations to the wave equation in the cochlea. These approximations engender compromises in the accuracy with which the cochlear motion can be reproduced. Tuning curves are compared here for two cochlear models, one based on a cascade of low-pass filter sections and the other based on a cascade of filter sections derived from a one-dimensional transmission line. The filters in the two simulations are designed to give comparable latency in the neural response to inputs at different frequencies, and the simulations include an active gain-control mechanism to adjust the characteristics of each section with changes in the input signal level. The resultant simulated tuning curves and neural outputs indicate that the modified transmission-line approximation yields a more accurate cochlear model. >