scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Speech and Audio Processing in 1995"


Journal ArticleDOI
TL;DR: The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
Abstract: This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task. >

3,134 citations


Journal ArticleDOI
TL;DR: The popular spectral subtraction speech enhancement approach is shown to be a signal subspace approach which is optimal in an asymptotic (large sample) linear minimum mean square error sense, assuming the signal and noise are stationary.
Abstract: A comprehensive approach for nonparametric speech enhancement is developed. The underlying principle is to decompose the vector space of the noisy signal into a signal-plus-noise subspace and a noise subspace. Enhancement is performed by removing the noise subspace and estimating the clean signal from the remaining signal subspace. The decomposition can theoretically be performed by applying the Karhunen-Loeve transform (KLT) to the noisy signal. Linear estimation of the clean signal is performed using two perceptually meaningful estimation criteria. First, signal distortion is minimized while the residual noise energy is maintained below some given threshold. This criterion results in a Wiener filter with adjustable input noise level. Second, signal distortion is minimized for a fixed spectrum of the residual noise. This criterion enables masking of the residual noise by the speech signal. It results in a filter whose structure is similar to that obtained in the first case, except that now the gain function which modifies the KLT coefficients is solely dependent on the desired spectrum of the residual noise. The popular spectral subtraction speech enhancement approach is shown to be a particular case of the proposed approach. It is proven to be a signal subspace approach which is optimal in an asymptotic (large sample) linear minimum mean square error sense, assuming the signal and noise are stationary. Our listening tests indicate that 14 out of 16 listeners strongly preferred the proposed approach over the spectral subtraction approach. >

968 citations


Journal ArticleDOI
TL;DR: A constrained estimation technique for Gaussian mixture densities for speech recognition that approaches the speaker-independent accuracy achieved for native speakers and speaker-dependent systems that use six times as much training data.
Abstract: A trend in automatic speech recognition systems is the use of continuous mixture-density hidden Markov models (HMMs). Despite the good recognition performance that these systems achieve on average in large vocabulary applications, there is a large variability in performance across speakers. Performance degrades dramatically when the user is radically different from the training population. A popular technique that can improve the performance and robustness of a speech recognition system is adapting speech models to the speaker, and more generally to the channel and the task. In continuous mixture-density HMMs the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, the authors propose a constrained estimation technique for Gaussian mixture densities. The algorithm is evaluated on the large-vocabulary Wall Street Journal corpus for both native and nonnative speakers of American English. For nonnative speakers, the recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers. For native speakers, the recognition performance after adaptation improves to the accuracy of speaker-dependent systems that use six times as much training data. >

439 citations


Journal ArticleDOI
TL;DR: A new mixed excitation LPC vocoder model is presented that preserves the low bit rate of a fully parametric model but adds more free parameters to the excitation signal so that the synthesizer can mimic more characteristics of natural human speech.
Abstract: Traditional pitch-excited linear predictive coding (LPC) vocoders use a fully parametric model to efficiently encode the important information in human speech. These vocoders can produce intelligible speech at low data rates (800-2400 b/s), but they often sound synthetic and generate annoying artifacts such as buzzes, thumps, and tonal noises. These problems increase dramatically if acoustic background noise is present at the speech input. This paper presents a new mixed excitation LPC vocoder model that preserves the low bit rate of a fully parametric model but adds more free parameters to the excitation signal so that the synthesizer can mimic more characteristics of natural human speech. The new model also eliminates the traditional requirement for a binary voicing decision so that the vocoder performs well even in the presence of acoustic background noise. A 2400-b/s LPC vocoder based on this model has been developed and implemented in simulations and in a real-time system. Formal subjective testing of this coder confirms that it produces natural sounding speech even in a difficult noise environment. In fact, diagnostic acceptability measure (DAM) test scores show that the performance of the 2400-b/s mixed excitation LPC vocoder is close to that of the government standard 4800-b/s CELP coder. >

352 citations


Journal ArticleDOI
TL;DR: This paper presents a complete description of the original postfiltering algorithm and the underlying ideas that motivated its development, and achieves noticeable noise reduction while introducing only minimal distortion in speech.
Abstract: An adaptive postfiltering algorithm for enhancing the perceptual quality of coded speech is presented. The postfilter consists of a long-term postfilter section in cascade with a short-term postfilter section and includes spectral tilt compensation and automatic gain control. The long-term section emphasizes pitch harmonics and attenuates the spectral valleys between pitch harmonics. The short-term section, on the other hand, emphasizes speech formants and attenuates the spectral valleys between formants. Both filter sections have poles and zeros. Unlike earlier postfilters that often introduced a substantial amount of muffling to the output speech, our postfilter significantly reduces this effect by minimizing the spectral tilt in its frequency response. As a result, this postfilter achieves noticeable noise reduction while introducing only minimal distortion in speech. The complexity of the postfilter is quite low. Variations of this postfilter are now being used in several national and international speech coding standards. This paper presents for the first time a complete description of our original postfiltering algorithm and the underlying ideas that motivated its development. >

278 citations


Journal ArticleDOI
TL;DR: This paper presents an analysis of the filtered-X LMS algorithm using stochastic methods and some derived bounds and predicted dynamic behavior are found to correspond very well to simulation results.
Abstract: The presence of a transfer function in the auxiliary-path following the adaptive filter and/or in the error-path, as in the case of active noise control, has been shown to generally degrade the performance of the LMS algorithm. Thus, the convergence rate is lowered, the residual power is increased, and the algorithm can even become unstable. To ensure convergence of the algorithm, the input to the error correlator has to be filtered by a copy of the auxiliary-error-path transfer function. This paper presents an analysis of the filtered-X LMS algorithm using stochastic methods. The influence of off-line and on-line estimation of the error-path filter on the algorithm is also investigated. Some derived bounds and predicted dynamic behavior are found to correspond very well to simulation results.

262 citations


Journal ArticleDOI
TL;DR: An algorithm for reduction of broadband noise in speech based on signal subspaces is considered by means of the quotient singular value decomposition (QSVD), and a prewhitening operation becomes an integral part of the algorithm.
Abstract: We consider an algorithm for reduction of broadband noise in speech based on signal subspaces. The algorithm is formulated by means of the quotient singular value decomposition (QSVD). With this formulation, a prewhitening operation becomes an integral part of the algorithm. We demonstrate that this is essential in connection with updating issues in real-time recursive applications. We also illustrate by examples that we are able to achieve a satisfactory quality of the reconstructed signal.

239 citations


Journal ArticleDOI
TL;DR: A new method based on the global phase characteristics of minimum phase signals for determining the instants of significant excitation in speech signals is proposed, which works well for all types of voiced speech in male as well as female speech but, in all cases, under noise-free conditions only.
Abstract: A new method for determining the instants of significant excitation in speech signals is proposed. In the paper, significant excitation refers primarily to the instant of glottal closure within a pitch period in voiced speech. The method is based on the global phase characteristics of minimum phase signals. The average slope of the unwrapped phase of the short-time Fourier transform of linear prediction residual is calculated as a function of time. Instants where the phase slope function makes a positive zero-crossing are identified as significant excitations. The method is discussed in a source-filter context of speech production. The method is not sensitive to the characteristics of the filter. The influence of the type, length, and position of the analysis window is discussed. The method works well for all types of voiced speech in male as well as female speech but, in all cases, under noise-free conditions only. >

209 citations


Journal ArticleDOI
TL;DR: Five techniques for reducing acoustic feedback in hearing aids were investigated and a novel method for feedback cancellation with adaptation with adaptation during quiet intervals showed the novel system to provide the best overall performance.
Abstract: Five techniques for reducing acoustic feedback in hearing aids were investigated: an adaptive notch filter, three previously described methods for adaptive feedback cancellation, and a novel method for feedback cancellation with adaptation during quiet intervals. Through real-time implementations, these techniques were assessed for added stable gain and sound quality. Test results showed the novel system to provide the best overall performance. >

186 citations


Journal ArticleDOI
TL;DR: A theoretical analysis of high-rate vector quantization (VQ) systems that use suboptimal, mismatched distortion measures is presented, and the application of the analysis to the problem of quantizing the linear predictive coding (LPC) parameters in speech coding systems is described.
Abstract: The paper presents a theoretical analysis of high-rate vector quantization (VQ) systems that use suboptimal, mismatched distortion measures, and describes the application of the analysis to the problem of quantizing the linear predictive coding (LPC) parameters in speech coding systems. First, it is shown that in many high-rate VQ systems the quantization distortion approaches a simple quadratically weighted error measure, where the weighting matrix is a "sensitivity matrix" that is an extension of the concept of the scalar sensitivity. The approximate performance of VQ systems that train and quantize using mismatched distortion measures is derived, and is used to construct better distortion measures. Second, these results are used to determine the performance of LPC vector quantizers, as measured by the log spectral distortion (LSD) measure, which have been trained using other error measures, such as mean-squared (MSE) or weighted mean-squared error (WMSE) measures of LEPC parameters, reflection coefficients and transforms thereof, and line spectral pair (LSP) frequencies. Computationally efficient algorithms for computing the sensitivity matrices of these parameters are described. In particular, it is shown that the sensitivity matrix for the LSP frequencies is diagonal, implying that a WMSE measured LSP frequencies converges to the LSD measure in high-rate VQ systems. Experimental results to support the theoretical performance estimates are provided. >

182 citations


Journal ArticleDOI
TL;DR: The need for multidisciplinary research is reviewed, for development of shared corpora and related resources, for computational support and far rapid communication among researchers, and the expected benefits of this technology are reviewed.
Abstract: A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the person's words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language systems range from simple tasks, such as retrieving information from an existing database (traffic reports, airline schedules), to interactive problem solving tasks involving complex planning and reasoning (travel planning, traffic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: (1) robust speech recognition; (2) automatic training and adaptation; (3) spontaneous speech; (4) dialogue models; (5) natural language response generation; (6) speech synthesis and speech generation; (7) multilingual systems; and (8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and far rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area. >

Journal ArticleDOI
TL;DR: A model of spectral shape analysis in the central auditory system is developed based on neurophysiological mappings in the primary auditory cortex and on results from psychoacoustical experiments in human subjects, showing that this representation is equivalent to performing an affine wavelet transform of the spectral pattern.
Abstract: A model of spectral shape analysis in the central auditory system is developed based on neurophysiological mappings in the primary auditory cortex and on results from psychoacoustical experiments in human subjects. The model suggests that the auditory system analyzes an input spectral pattern along three independent dimensions: a logarithmic frequency axis, a local symmetry axis, and a local spectral bandwidth axis. It is shown that this representation is equivalent to performing an affine wavelet transform of the spectral pattern and preserving both the magnitude (a measure of the scale or local bandwidth of the spectrum) and phase (a measure of the local symmetry of the spectrum). Such an analysis is in the spirit of the cepstral analysis commonly used in speech recognition systems, the major difference being that the double Fourier-like transformation that the auditory system employs is carried out in a local fashion. Examples of such a representation for various speech and synthetic signals are discussed, together with its potential significance and applications for speech and audio processing. >

Journal ArticleDOI
TL;DR: MFB cepstra significantly outperform LPC cepstral under noisy conditions and techniques using an optimal linear combination of features for data reduction were evaluated.
Abstract: This paper compares the word error rate of a speech recognizer using several signal processing front ends based on auditory properties. Front ends were compared with a control mel filter bank (MFB) based cepstral front end in clean speech and with speech degraded by noise and spectral variability, using the TI-105 isolated word database. MFB recognition error rates ranged from 0.5 to 26.9% in noise, depending on the SNR, and auditory models provided error rates as much as four percentage points lower. With speech degraded by linear filtering, MFB error rates ranged from 0.5 to 3.1%, and the reduction in error rates provided by auditory models was less than 0.5 percentage points. Some earlier studies that demonstrated considerably more improvement with auditory models used linear predictive coding (LPC) based control front ends. This paper shows that MFB cepstra significantly outperform LPC cepstra under noisy conditions. Techniques using an optimal linear combination of features for data reduction were also evaluated. >

Journal ArticleDOI
TL;DR: A discussion is given of two techniques for designing inverse filters for use in multichannel sound reproduction systems and the theory presented reconciles the two approaches and derives explicit conditions which must be fulfilled if an exact inverse is to exist.
Abstract: A discussion is given of two techniques for designing inverse filters for use in multichannel sound reproduction systems. The first is the multiple-input/output inverse filtering theorem, which uses direct inversion of a matrix containing the coefficients of filters used to specify the electroacoustic transmission paths. The second is an adaptive technique based on the multiple error LMS algorithm. The theory presented reconciles the two approaches and furthermore, derives explicit conditions which must be fulfilled if an exact inverse is to exist. A formula is derived which gives the number of coefficients required in the inverse filters in terms of the number of coefficients used to represent the transmission paths. Some numerical examples are also presented which illustrate the dependence of the mean square error on both the choice of modeling delay and the number of coefficients in the inverse filters. Finally, the results of some simulations are given which demonstrate the acoustical possibilities associated with these filtering techniques. >

Journal ArticleDOI
TL;DR: A theoretical framework for Bayesian adaptive training of the parameters of a discrete hidden Markov model and a semi-continuous HMM with Gaussian mixture state observation densities is presented and the proposed MAP algorithms are shown to be effective especially in the cases in which the training or adaptation data are limited.
Abstract: A theoretical framework for Bayesian adaptive training of the parameters of a discrete hidden Markov model (DHMM) and of a semi-continuous HMM (SCHMM) with Gaussian mixture state observation densities is presented. In addition to formulating the forward-backward MAP (maximum a posteriori) and the segmental MAP algorithms for estimating the above HMM parameters, a computationally efficient segmental quasi-Bayes algorithm for estimating the state-specific mixture coefficients in SCHMM is developed. For estimating the parameters of the prior densities, a new empirical Bayes method based on the moment estimates is also proposed. The MAP algorithms and the prior parameter specification are directly applicable to training speaker adaptive HMMs. Practical issues related to the use of the proposed techniques for HMM-based speaker adaptation are studied. The proposed MAP algorithms are shown to be effective especially in the cases in which the training or adaptation data are limited. >

Journal ArticleDOI
TL;DR: This work shows that significant improvements in performance are obtained as compared to an earlier system proposed by Jayant and Christensen (1981) for packetized speech systems and shows that for a first-order Gauss-Markov source significant performance improvements can be obtained by using a second-order predictor instead of a first -order predictor.
Abstract: Speech quality in packetized speech systems can degrade substantially when packets are lost. We consider the problem of DPCM system design for packetized speech systems. The problem is formulated as a multiple description problem and the problem of optimal selection of the encoder and decoder filters is addressed. We show that significant improvements in performance are obtained as compared to an earlier system proposed by Jayant and Christensen (1981). Further, we show that for a first-order Gauss-Markov source significant performance improvements can be obtained by using a second-order predictor instead of a first-order predictor. >

Journal ArticleDOI
Dennis R. Morgan1
TL;DR: The paper establishes a theoretical basis for the slow asymptotic convergence and suggests postfiltering as a remedy that would be useful for the full-band LMS AEC and may also be applicable to subband designs.
Abstract: In most acoustic echo canceler (AEC) applications, an adaptive finite impulse response (FIR) filter is employed with coefficients that are computed using the LMS algorithm. The paper establishes a theoretical basis for the slow asymptotic convergence that is often noted in practice for such applications. The analytical approach expresses the mean-square error trajectory in terms of eigenmodes and then applies the asymptotic theory of Toeplitz matrices to obtain a solution that is based on a general characterization of the actual room impulse response. The method leads to good approximations even for a moderate number of taps (N>16) and applies to both full-band and subband designs. Explicit mathematical expressions of the mean-square error convergence are derived for bandlimited white noise, a first-order Markov process, and, more generally, pth-order rational spectra and a direct power-law model, which relates to lowpass FIR filters. These expressions show that the asymptotic convergence is generally slow, being at best of order 1/t for bandlimited white noise. It is argued that input filter design cannot do much to improve slow convergence. However, the theory suggests postfiltering as a remedy that would be useful for the full-band LMS AEC and may also be applicable to subband designs. >

Journal ArticleDOI
TL;DR: The paper presents the development and analysis of a narrowband adaptive noise equalizer (ANE), which can either amplify or attenuate narrowband noise.
Abstract: The paper presents the development and analysis of a narrowband adaptive noise equalizer (ANE), which can either amplify or attenuate narrowband noise. The output of the ANE system contains residual narrowband components, the amplitudes of which can be linearly and arbitrarily controlled by adjusting the gain parameter of the equalizer, thus providing the desired noise shaping capability. The characteristics of the ANE system are analyzed and applied to active noise control. >

Journal ArticleDOI
TL;DR: The a posteriori probability for the location of bursts of noise additively superimposed on a Gaussian AR process is derived to give a sequentially based restoration algorithm suitable for real-time applications.
Abstract: In this paper we derive the a posteriori probability for the location of bursts of noise additively superimposed on a Gaussian AR process. The theory is developed to give a sequentially based restoration algorithm suitable for real-time applications. The algorithm is particularly appropriate for digital audio restoration, where clicks and scratches may be modelled as additive bursts of noise. Experiments are carried out on both real audio data and synthetic AR processes and significant improvements are demonstrated over existing restoration techniques. >

Journal ArticleDOI
TL;DR: A new recursion is introduced that reduces the complexity of training a semi-Markov model with continuous output distributions and it is shown that the cost of training is proportional to M/sup 2/+D, compared with the standard recursion.
Abstract: Introduces a new recursion that reduces the complexity of training a semi-Markov model with continuous output distributions. It is shown that the cost of training is proportional to M/sup 2/+D, compared to M/sup 2/D with the standard recursion, where M is the observation vector length and D is the maximum allowed duration. >

Journal ArticleDOI
TL;DR: Two approaches using HCNN and HSMLP to model the intonation pattern as a hidden Markov chain for assisting tone recognition are proposed, and the effectiveness of these schemes was confirmed by simulations on a speaker-independent tone recognition task.
Abstract: Several neural network-based tone recognition schemes for continuous Mandarin speech are discussed. A basic MLP tone recognizer using recognition features extracted from the processing syllable is first introduced. Then, some additional features extracted from neighboring syllables are added to compensate for the coarticulation effect. It is then further improved to compensate For the effect of sandhi rules of tone pronunciation by including tone information of neighboring syllables. The recognition criterion is now changed to find the best tone sequence that minimizes the total risk that simultaneously considers tone recognition of all syllables in the input utterance. Last, two approaches using HCNN and HSMLP, respectively, to model the intonation pattern as a hidden Markov chain for assisting tone recognition are proposed. The effectiveness of these schemes was confirmed by simulations on a speaker-independent tone recognition task. A recognition rate of 86.72% was achieved. >

Journal ArticleDOI
TL;DR: The paper presents an efficient method for tone recognition of isolated Cantonese syllables using Suprasegmental feature parameters extracted from the voiced portion of a monosyllabic utterance and a three-layer feedforward neural network is used to classify these feature vectors.
Abstract: Tone identification is essential for the recognition of the Chinese language, specifically far Cantonese which is well known for being very rich in tones. The paper presents an efficient method for tone recognition of isolated Cantonese syllables. Suprasegmental feature parameters are extracted from the voiced portion of a monosyllabic utterance and a three-layer feedforward neural network is used to classify these feature vectors. Using a phonologically complete vocabulary of 234 distinct syllables, the recognition accuracy for single-speaker and multispeaker is given by 89.0% and 87.6% respectively. >

Journal ArticleDOI
TL;DR: The paper shows, both theoretically and experimentally, that whatever the noise estimation technique is, it is better to add this noise estimate to the reference clean models than to subtract it from the noisy data.
Abstract: The paper compares, on a database recorded in a car, a number of signal analysis and speech enhancement techniques as well as some approaches to adapt speech recognition systems. It is shown that a new nonlinear spectral subtraction associated with Mel frequency cepstral coefficients (MFCC) is an adequate compromise for low-cost integration. The Lombard effect is analyzed and simulated. Such a simulation is used to derive realistic training utterances from noise-free utterances. Adapting a continuous-density hidden Markov model (CDHMM) to these artificially generated training samples yields a very high performance with respect to that achieved within the ESPRIT adverse environment recognition of speech (ARS) project, i.e., an average of 1% error for all driving conditions. Finally, the paper shows, both theoretically and experimentally, that whatever the noise estimation technique is, it is better to add this noise estimate to the reference clean models than to subtract it from the noisy data. >

Journal ArticleDOI
TL;DR: It is revealed that the most robust method depends on the type of noise, and the authors propose three other approaches with the above point as a motivation, which lead to the weighted least absolute value solution.
Abstract: Various linear predictive (LP) analysis methods are studied and compared from the points of view of robustness to noise and of application to speaker identification. The key to the success of the LP techniques is in separating the vocal tract information from the pitch information present in a speech signal even under noisy conditions. In addition to considering the conventional, one-shot weighted least-squares methods, the authors propose three other approaches with the above point as a motivation. The first is an iterative approach that leads to the weighted least absolute value solution. The second is an extension of the one-shot least-squares approach and achieves an iterative update of the weights. The update is a function of the residual and is based on minimizing a Mahalanobis distance. Third, the weighted total least-squares formulation is considered. A study of the deviations in the LP parameters is done when noise (white Gaussian and impulsive) is added to the speech. It is revealed that the most robust method depends on the type of noise. Closed-set speaker identification experiments with 20 speakers are conducted using a vector quantizer classifier trained on clean speech. The relative performance of the various LP approaches depends on the type of speech material used for testing. >

Journal ArticleDOI
TL;DR: In adaptive noise cancelling, linear digital filters have been used to minimize the mean squared difference between filter outputs and the desired signal, but for non-Gaussian probability density functions of the involved signals, nonlinear filters can further reduce themean squared difference, thereby improving the signal-to-noise ratio at the system output.
Abstract: In adaptive noise cancelling, linear digital filters have been used to minimize the mean squared difference between filter outputs and the desired signal. However, for non-Gaussian probability density functions of the involved signals, nonlinear filters can further reduce the mean squared difference, thereby improving the signal-to-noise ratio at the system output. This is illustrated with a two-microphone beamformer for cancelling directional interference. In the case of a single uniformly distributed interference, we establish the optimum nonlinear performance limit. To approximate optimum performance, we realize two nonlinear filter architectures, the Volterra filter and the multilayer perceptron. The Volterra filter is also examined for speech interference. The beamformer is adapted to minimize the mean squared difference, but performance is measured with the intelligibility weighted gain. This criterion requires the signal-to-noise ratio at the beamformer output. For the nonlinear processor, this can only be determined when no target components exist in the reference channel of the noise canceller so that the target is transmitted without distortion. Under these ideal conditions and at equal filter lengths, the quadratic Volterra filter improves the intelligibility-weighted gain by maximally 2 dB relative to the linear filter.

Journal ArticleDOI
Jianing Dai1
TL;DR: The Markov chain model (MCM) offers a substantial reduction in computation, but at the expense of a significant increase in memory requirement when compared to the hidden Markov model (HMM).
Abstract: The paper describes how Markov chains may be applied to speech recognition. In this application, a spectral vector is modeled by a state of the Markov chain, and an utterance is represented by a sequence of states. The Markov chain model (MCM) offers a substantial reduction in computation, but at the expense of a significant increase in memory requirement when compared to the hidden Markov model (HMM). Experiments on isolated word recognition show that the MCM achieved results that are comparable to those of the HMMs tested for comparison.

Journal ArticleDOI
TL;DR: The novel approach is to construct a word production model using a previously suggested source generator framework, by employing knowledge of the statistical nature of duration and spectral variation of speech under stress, used in turn to produce simulated stressed speech training tokens from neutral speech tokens.
Abstract: It is known that speech recognition performance degrades if systems are not trained and tested under similar speaking conditions. This is particularly true if a speaker is exposed to demanding workload stress or noise. For recognition systems to be successful in applications susceptible to stress, speech recognizers should address the adverse conditions experienced by the user. The authors consider the problem of improved recognition training for speech recognition for various stressed speaking conditions (e.g., slow, loud, and Lombard effect speaking styles). The main objective is to devise a training procedure that produces a hidden Markov model recognizer that better characterizes a given stressed speaking style, without the need for directly collecting such stressed data. The novel approach is to construct a word production model using a previously suggested source generator framework [Hansen 1994], by employing knowledge of the statistical nature of duration and spectral variation of speech under stress. This model is used in turn to produce simulated stressed speech training tokens from neutral speech tokens. The token generation training method is shown to improve isolated word recognition by 24% for Lombard speech when compared to a neutral trained isolated word recognizer. Further results are reported for isolated and keyword recognition scenarios. >

Journal ArticleDOI
TL;DR: In this paper, a constrained-iterative feature-estimation algorithm is considered and shown to produce improved feature characterization in a variety of actual noise conditions, and an objective measure based MAP estimator is formulated as a means of predicting changes in robust recognition performance at the speech feature extraction stage.
Abstract: The introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm is considered and shown to produce improved feature characterization in a variety of actual noise conditions. In addition, an objective measure based MAP estimator is formulated as a means of predicting changes in robust recognition performance at the speech feature extraction stage. The four measures considered include (i) NIST SNR; (ii) Itakura-Saito log-likelihood; (iii) log-area-ratio; (iv) the weighted-spectral slope measure. A continuous distribution, monophone based, hidden Markov model recognition algorithm is used for objective measure based MAP estimator analysis and recognition evaluation. Evaluations were based on speech data from the Credit Card corpus (CC-DATA). It is shown that feature enhancement provides a consistent level of recognition improvement for broadband, and low-frequency colored noise sources. As the stationarity assumption for a given noise source breaks down, the ability of feature enhancement to improve recognition performance decreases. Finally, the log-likelihood based MAP estimator was found to be the best predictor of recognition performance, while the NIST SNR based MAP estimator was found to be poorest recognition predictor across the 27 noise conditions considered. >

Journal ArticleDOI
TL;DR: The results suggest that the combination of a flexible source generator framework to address stressed speaking conditions, and a feature enhancement algorithm that adapts based on speech-specific constraints, can be effective in reducing the consequences of stress and noise for robust automatic recognition.
Abstract: Studies have shown that depending on speaker task and environmental conditions, recognizers are sensitive to noisy stressful environments. The focus of the study is to achieve robust recognition in diverse environmental conditions through the formulation of feature enhancement and stress equalization algorithms under the framework of source generator theory. The generator framework is considered as a means of modeling production variation under stressful speaking conditions. A multi-dimensional stress equalization procedure is formulated that produces recognition features less sensitive to varying factors caused by stress. A feature enhancement algorithm is employed based on iterative techniques previously derived for enhancement of speech in varying background noise environments. Combined stress equalization and feature enhancement reduces average word error rates across 10 noisy stressful conditions by -38.7% (e.g., noisy loud, angry, and Lombard effect stress conditions, etc.). The results suggest that the combination of a flexible source generator framework to address stressed speaking conditions, and a feature enhancement algorithm that adapts based on speech-specific constraints, can be effective in reducing the consequences of stress and noise for robust automatic recognition. >

Journal ArticleDOI
TL;DR: TheDual-channel enhancement scheme is shown to follow the iterative expectation-maximization (EM) algorithm, resulting in a two-step dual-channel Wiener filtering scheme, and objective measures classified over individual phonemes for a subset of sentences from the TIMIT speech database show a consistent and superior improvement in quality.
Abstract: A new frequency-domain, constrained iterative algorithm is proposed for dual-channel speech enhancement. The dual-channel enhancement scheme is shown to follow the iterative expectation-maximization (EM) algorithm, resulting in a two-step dual-channel Wiener filtering scheme. A new technique for applying constraints during the EM iterations is developed so as to take advantage of the auditory properties of speech perception. An overriding goal is to enhance quality and at least maintain intelligibility of the estimated speech signal. Constraints are applied over time and iteration on mel-cepstral parameters which parametrize an auditory based spectrum. These constraints also adapt to changing speech characteristics over time with the aid of an adaptive boundary detector. Performance is demonstrated in three areas for speech degraded by additive white Gaussian noise, aircraft cockpit noise, and computer cooling-fan noise. First, global objective speech quality measures show improved quality when compared to unconstrained dual-channel Wiener filtering and a traditional LMS-based adaptive noise cancellation technique, over a range of signal-to-noise ratios and cross-talk levels. Second, time waveforms and frame-to-frame quality measures show good improvement, especially in unvoiced and transitional regions of speech. Informal listening tests confirm improvement in duality as measured by objective measures. Finally, objective measures classified over individual phonemes for a subset of sentences from the TIMIT speech database show a consistent and superior improvement in quality. >