scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Speech and Audio Processing in 1999"


Journal ArticleDOI
TL;DR: This paper addresses the problem of single channel speech enhancement at very low signal-to-noise ratios (SNRs) (<10 dB) with a new computationally efficient algorithm developed based on masking properties of the human auditory system, resulting in improved results over classical subtractive-type algorithms.
Abstract: This paper addresses the problem of single channel speech enhancement at very low signal-to-noise ratios (SNRs) (<10 dB). The proposed approach is based on the introduction of an auditory model in a subtractive-type enhancement process. Single channel subtractive-type algorithms are characterized by a tradeoff between the amount of noise reduction, the speech distortion, and the level of musical residual noise, which can be modified by varying the subtraction parameters. Classical algorithms are usually limited to the use of fixed optimized parameters, which are difficult to choose for all speech and noise conditions. A new computationally efficient algorithm is developed based on masking properties of the human auditory system. It allows for an automatic adaptation in time and frequency of the parametric enhancement system, and finds the best tradeoff based on a criterion correlated with perception. This leads to a significant reduction of the unnatural structure of the residual noise. Objective and subjective evaluation of the proposed system is performed with several noise types form the Noisex-92 database, having different time-frequency distributions. The application of objective measures, the study of the speech spectrograms, as well as subjective listening tests, confirm that the enhanced speech is more pleasant to a human listener. Finally, the proposed enhancement algorithm is tested as a front-end processor for speech recognition in noise, resulting in improved results over classical subtractive-type algorithms.

631 citations


Journal ArticleDOI
TL;DR: A new form of covariance matrix which allows a few "full" covariance matrices to be shared over many distributions, whilst each distribution maintains its own "diagonal" covariancy matrix is introduced.
Abstract: There is normally a simple choice made in the form of the covariance matrix to be used with continuous-density HMMs. Either a diagonal covariance matrix is used, with the underlying assumption that elements of the feature vector are independent, or a full or block-diagonal matrix is used, where all or some of the correlations are explicitly modeled. Unfortunately when using full or block-diagonal covariance matrices there tends to be a dramatic increase in the number of parameters per Gaussian component, limiting the number of components which may be robustly estimated. This paper introduces a new form of covariance matrix which allows a few "full" covariance matrices to be shared over many distributions, whilst each distribution maintains its own "diagonal" covariance matrix. In contrast to other schemes which have hypothesized a similar form, this technique fits within the standard maximum-likelihood criterion used for training HMMs. The new form of covariance matrix is evaluated on a large-vocabulary speech-recognition task. In initial experiments the performance of the standard system was achieved using approximately half the number of parameters. Moreover, a 10% reduction in word error rate compared to a standard system can be achieved with less than a 1% increase in the number of parameters and little increase in recognition time.

624 citations


Journal ArticleDOI
TL;DR: A closed-form weighted-equation-error method is derived that computes the optimal mapping coefficient as a function of sampling rate, and the solution is shown to be generally indistinguishable from the optimal least-squares solution.
Abstract: Use of a bilinear conformal map to achieve a frequency warping nearly identical to that of the Bark frequency scale is described Because the map takes the unit circle to itself, its form is that of the transfer function of a first-order allpass filter Since it is a first-order map, it preserves the model order of rational systems, making it a valuable frequency warping technique for use in audio filter design A closed-form weighted-equation-error method is derived that computes the optimal mapping coefficient as a function of sampling rate, and the solution is shown to be generally indistinguishable from the optimal least-squares solution The optimal Chebyshev mapping is also found to be essentially identical to the optimal least-squares solution The expression 08517[arctan(006583fs)]/sup 1/2/-0916 is shown to accurately approximate the optimal allpass coefficient as a function of sampling rate f/sub s/ in kHz for sampling rates greater than 1 kHz A filter design example is included that illustrates improvements due to carrying out the design over a Bark scale Corresponding results are also given and compared for approximating the related "equivalent rectangular bandwidth (ERB) scale" of Moore and Glasberg (ACTA Acustica, vo82, p335-45, 1996) using a first-order allpass transformation Due to the higher frequency resolution called for by the ERB scale, particularly at low frequencies, the first-order conformal map is less able to follow the desired mapping, and the error is two to three times greater than the Bark-scale case, depending on the sampling rate

432 citations


Journal ArticleDOI
TL;DR: This paper examines the problem of phasiness in the context of time-scale modification and provides new insights into its causes, and two extensions to the standard phase vocoder algorithm are introduced, and the resulting sound quality is shown to be significantly improved.
Abstract: The phase vocoder is a well established tool for time scaling and pitch shifting speech and audio signals via modification of their short-time Fourier transforms (STFTs). In contrast to time-domain time-scaling and pitch-shifting techniques, the phase vocoder is generally considered to yield high quality results, especially for large modification factors and/or polyphonic signals. However, the phase vocoder is also known for introducing a characteristic perceptual artifact, often described as "phasiness", "reverberation", or "loss of presence". This paper examines the problem of phasiness in the context of time-scale modification and provides new insights into its causes. Two extensions to the standard phase vocoder algorithm are introduced, and the resulting sound quality is shown to be significantly improved. Moreover, the modified phase vocoder is shown to provide a factor-of-two decrease in computational cost.

355 citations


Journal ArticleDOI
TL;DR: An automatic technique for estimating and modeling the glottal flow derivative source waveform from speech, and applying the model parameters to speaker identification, is presented.
Abstract: An automatic technique for estimating and modeling the glottal flow derivative source waveform from speech, and applying the model parameters to speaker identification, is presented. The estimate of the glottal flow derivative is decomposed into coarse structure, representing the general flow shape, and fine structure, comprising aspiration and other perturbations in the flow, from which model parameters are obtained. The glottal flow derivative is estimated using an inverse filter determined within a time interval of vocal-fold closure that is identified through differences in formant frequency modulation during the open and closed phases of the glottal cycle. This formant motion is predicted by Ananthapadmanabha and Fant (1982) to be a result of time-varying and nonlinear source/vocal tract coupling within a glottal cycle. The glottal flow derivative estimate is modeled using the Liljencrants-Fant (1986) model to capture its coarse structure, while the fine structure of the flow derivative is represented through energy and perturbation measures. The model parameters are used in a Gaussian mixture model speaker identification (SID) system. Both coarse- and fine-structure glottal features are shown to contain significant speaker-dependent information. For a large TIMIT database subset, averaging over male and female SID scores, the coarse-structure parameters achieve about 60% accuracy, the fine-structure parameters give about 40% accuracy, and their combination yields about 70% correct identification. Finally, in preliminary experiments on the counterpart telephone-degraded NTIMIT database, about a 5% error reduction in SID scores is obtained when source features are combined with traditional mel-cepstral measures.

332 citations


Journal ArticleDOI
TL;DR: This paper presents a new approach to an auditory model for robust speech recognition in noisy environments that consists of cochlear bandpass filters and nonlinear operations in which frequency information of the signal is obtained by zero-crossing intervals.
Abstract: This paper presents a new approach to an auditory model for robust speech recognition in noisy environments. The proposed model consists of cochlear bandpass filters and nonlinear operations in which frequency information of the signal is obtained by zero-crossing intervals. Intensity information is also incorporated by a peak detector and a compressive nonlinearity. The robustness of the zero-crossings in spectral estimation is verified by analyzing the variance of the level-crossing intervals as a function of the crossing level values. Compared with other auditory models, the proposed auditory model is computationally efficient, free from many unknown parameters, and able to serve as a robust front-end for speech recognition in noisy environments. Experimental results of speech recognition demonstrate the robustness of the proposed method in various types of noisy environments.

273 citations


Journal ArticleDOI
TL;DR: This paper develops a topic-dependent, sentence-level mixture language model which takes advantage of the topic constraints in a sentence or article, and introduces topic- dependent dynamic adaptation techniques in the framework of the mixture model, using n-gram caches and content word unigram caches.
Abstract: Standard statistical language models use n-grams to capture local dependencies, or use dynamic modeling techniques to track dependencies within an article. In this paper, we investigate a new statistical language model that captures topic-related dependencies of words within and across sentences. First, we develop a topic-dependent, sentence-level mixture language model which takes advantage of the topic constraints in a sentence or article. Second, we introduce topic-dependent dynamic adaptation techniques in the framework of the mixture model, using n-gram caches and content word unigram caches. Experiments with the static (or unadapted) mixture model on the North American Business (NAB) task show a 21% reduction in perplexity and a 3-4% improvement in recognition accuracy over a general n-gram model, giving a larger gain than that obtained with supervised dynamic cache modeling. Further experiments on the Switchboard corpus also showed a small improvement in performance with the sentence-level mixture model. Cache modeling techniques introduced in the mixture framework contributed a further 14% reduction in perplexity and a small improvement in recognition accuracy on the NAB task for both supervised and unsupervised adaptation.

196 citations


Journal ArticleDOI
TL;DR: This contribution presents a detailed analysis of a widely used set of parameters, the mel frequency cepstral coefficients (MFCCs), and suggests a new parameterization approach taking into account the whole energy zone in the spectrum.
Abstract: The focus of a continuous speech recognition process is to match an input signal with a set of words or sentences according to some optimality criteria. The first step of this process is parameterization, whose major task is data reduction by converting the input signal into parameters while preserving virtually all of the speech signal information dealing with the text message. This contribution presents a detailed analysis of a widely used set of parameters, the mel frequency cepstral coefficients (MFCCs), and suggests a new parameterization approach taking into account the whole energy zone in the spectrum. Results obtained with the proposed new coefficients give a confidence interval about their use in a large-vocabulary speaker-independent continuous-speech recognition system.

194 citations


Journal ArticleDOI
Sassan Ahmadi1, Andreas Spanias1
TL;DR: An improved cepstrum-based voicing detection and pitch determination algorithm is presented and is shown to be robust to additive noise and performance analysis on a large database indicates considerable improvement relative to the conventional cepStrum method.
Abstract: An improved cepstrum-based voicing detection and pitch determination algorithm is presented. Voicing decisions are made using a multifeature voiced/unvoiced classification algorithm based on statistical analysis of cepstral peak, zero-crossing rate, and energy of short-time segments of the speech signal. Pitch frequency information is extracted by a modified cepstrum-based method and then carefully refined using pitch tracking, correction, and smoothing algorithms. Performance analysis on a large database indicates considerable improvement relative to the conventional cepstrum method. The proposed algorithm is also shown to be robust to additive noise.

192 citations


Journal ArticleDOI
TL;DR: A new objective estimation approach that uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizing blocks that reflects the magnitude of a perceived distance between two perceptually transformed signals.
Abstract: Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts have found limited success, primarily in analog and higher-rate, error-free digital environments where speech waveforms are preserved or nearly preserved. The objective estimation of the perceived quality of highly compressed digital speech, possibly with bit errors or frame erasures has remained an open question. We report our findings regarding two essential components of objective estimators of perceived speech quality: perceptual transformations and distance measures. A perceptual transformation modifies a representation of an audio signal in a way that is approximately equivalent to the human hearing process. A distance measure reflects the magnitude of a perceived distance between two perceptually transformed signals. We then describe a new objective estimation approach that uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizing blocks. Each measuring normalizing block integrates two perceptually transformed signals over some time or frequency interval to determine the average difference across that interval. This difference is then normalized out of one signal, and is further processed to generate one or more measurements.

147 citations


Journal ArticleDOI
TL;DR: A way to objectively evaluate DTD algorithms based on the standard statistical methods of detection theory is developed and a receiver operating characteristic (ROC) is derived to characterize DTD performance.
Abstract: Echo cancelers commonly employ a doubletalk detector (DTD), which is essential to keep the adaptive filter from diverging in the presence of near-end speech and other disruptive noise. There have been numerous algorithms to detect doubletalk in an acoustic echo canceler (AEC). In those applications, typically, the threshold is chosen only by some heuristic method and the performance evaluation is very subjective. In this study, we develop a way to objectively evaluate DTD algorithms based on the standard statistical methods of detection theory. A receiver operating characteristic (ROC) is derived to characterize DTD performance. Several DTD algorithms are examined and simulated under typical real-world operating conditions using measured room responses and signals taken from a digital speech database. The DTD methods are then evaluated and compared using the ROC metric.

Journal ArticleDOI
TL;DR: From the theoretical and experimental study, it is seen that the recognition rates increase as the number of speakers in the training set increases, and it is shown that the common vector obtained from Criterion 2 represents the common properties of a spoken word better than the common or average vector obtainedfrom Criterion 1.
Abstract: A voice signal contains the psychological and physiological properties of the speaker as well as dialect differences, acoustical environment effects, and phase differences. For these reasons, the same word uttered by different speakers can be very different. In this paper, two theories are developed by considering two optimization criteria applied to both the training set and the test set. The first theory is well known and uses what is called Criterion 1 here and ends up with the average of all vectors belonging to the words in the training set. The second theory is a novel approach and uses what is called Criterion 2 here, and it is used to extract the common properties of all vectors belonging to the words in the training set. It is shown that Criterion 2 is superior to Criterion 1 when the training set is of concern. In Criterion 2, the individual differences are obtained by subtracting a reference vector from other vectors, and individual difference vectors are used to obtain orthogonal vector basis by using the Gram-Schmidt orthogonalization method. The common vector is obtained by subtracting projections of any vector of the training set on the orthogonal vectors from this same vector. It is proved that this common vector is unique for any word class in the training set and independent of the chosen reference vector. This common vector is used in isolated word recognition, and it is also shown that Criterion 2 is superior to Criterion 1 for the test set. From the theoretical and experimental study, it is seen that the recognition rates increase as the number of speakers in the training set increases. This means that the common vector obtained from Criterion 2 represents the common properties of a spoken word better than the common or average vector obtained from Criterion 1.

Journal ArticleDOI
TL;DR: A fast, exact implementation of the filtered-X least mean square adaptive filter for which the system's complexity scales according to the number of filter coefficients within the system is developed.
Abstract: In some situations where active noise control could be used, the well-known multichannel version of the filtered-X least mean square (LMS) adaptive filter is too computationally complex to implement. We develop a fast, exact implementation of this adaptive filter for which the system's complexity scales according to the number of filter coefficients within the system. In addition, we extend computationally efficient methods for effectively removing the delays of the secondary paths within the coefficient updates to the multichannel case, thus yielding fast implementations of the LMS adaptive algorithm for multichannel active noise control. Examples illustrate both the equivalence of the algorithms to their original counterparts and the computational gains provided by the new algorithms.

Journal ArticleDOI
TL;DR: A noisy-vowel corpus is used and four possible models for audiovisual speech recognition are proposed, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability.
Abstract: Audiovisual speech recognition involves fusion of the audio and video sensors for phonetic identification. There are three basic ways to fuse data streams for taking a decision such as phoneme identification: data-to-decision, decision-to-decision, and data-to-data. This leads to four possible models for audiovisual speech recognition, that is direct identification in the first case, separate identification in the second one, and two variants of the third early integration case, namely dominant recoding or motor recoding. However, no systematic comparison of these models is available in the literature. We propose an implementation of these four models, and submit them to a benchmark test. For this aim, we use a noisy-vowel corpus tested on two recognition paradigms in which the systems are tested at noise levels higher than those used for learning. In one of these paradigms, the signal-to-noise ratio (SNR) value is provided to the recognition systems, in the other it is not. We also introduce a new criterion for evaluating performances, based on transmitted information on individual phonetic features. In light of the compared performances of the four models with the two recognition paradigms, we discuss the advantages and drawbacks of these models, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability.

Journal ArticleDOI
TL;DR: Examination of the spectrogram displays for the enhanced speech shows that the H/sub /spl infin// filtering approach tends to be more effective where the assumptions on the noise statistics are less valid, and the proposed approach is straightforward to implement.
Abstract: This paper presents a new approach to speech enhancement based on the H/sub /spl infin// filtering. This approach differs from the traditional modified Wiener/Kalman filtering approach in the following two aspects: (1) no a priori knowledge of the noise source statistics is required, the only assumption made is that noise signals have a finite energy; (2) the estimation criterion for the filter design is to minimize the worst possible amplification of the estimation error signals in terms of the modeling errors and additive noise. Since most additive noise in speech are nonGaussian, this estimation approach is highly robust and more appropriate in practical speech enhancement. The proposed approach is straightforward to implement, as detailed in this paper. Experimental results show consistently superior enhancement performance of the H/sub /spl infin// filtering algorithm over the Kalman filtering counterpart, measured by the global signal-to-noise ratio (SNR). Examination of the spectrogram displays for the enhanced speech shows that the H/sub /spl infin// filtering approach tends to be more effective where the assumptions on the noise statistics are less valid.

Journal ArticleDOI
TL;DR: A (single) speech model is proposed which satisfactorily describes voiced and unvoiced speech and silence, and also allows for exploitation of the long term characteristics of noise, and a mathematically equivalent algorithm is devised, by exploiting the sparsity of the matrices concerned.
Abstract: In this work, we are concerned with optimal estimation of clean speech from its noisy version based on a speech model we propose. We first propose a (single) speech model which satisfactorily describes voiced and unvoiced speech and silence (i.e., pauses between speech utterances), and also allows for exploitation of the long term characteristics of noise. We then reformulate the model equations so as to facilitate subsequent application of the well-established Kalman filter for computing the optimal estimate of the clean speech in the minimum-mean-square-error sense. Since the standard algorithm for Kalman-filtering involves multiplications of very large matrices and thus demands high computational cost, we devise a mathematically equivalent algorithm which is computationally much more efficient, by exploiting the sparsity of the matrices concerned. Next, we present the methods we use for estimating the model parameters and give a complete description of the enhancement process. Performance assessment based on spectrogram plots, objective measures and informal subjective listening tests all indicate that our method gives consistently good results. As far as signal-to-noise ratio is concerned, the improvements over existing methods can be as high as 4 dB.

Journal ArticleDOI
TL;DR: Two approaches are concentrated on extracting features that are robust against channel variations and transforming the speaker models to compensate for channel effects, which resulted in a 38% relative improvement on the closed-set 30-s training 5-s testing condition of the NIST'95 Evaluation task.
Abstract: This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: (1) extracting features that are robust against channel variations and (2) transforming the speaker models to compensate for channel effects. First, an experimental study shows that optimizing the front end processing of the speech signal can significantly improve speaker recognition performance. A new filterbank design is introduced to improve the robustness of the speech spectrum computation in the front-end unit. Next, a new feature based on spectral slopes is described. Its ability to discriminate between speakers is shown to be superior to that of the traditional cepstrum. This feature can be used alone or combined with the cepstrum. The second part of the paper presents two model transformation methods that further reduce channel effects. These methods make use of a locally collected stereo database to estimate a speaker-independent variance transformation for each speech feature used by the classifier. The transformations constructed on this stereo database can then be applied to speaker models derived from other databases. Combined, the methods developed in this paper resulted in a 38% relative improvement on the closed-set 30-s training 5-s testing condition of the NIST'95 Evaluation task, after cepstral mean removal.

Journal ArticleDOI
TL;DR: In this paper, a corpus of sustained vowel sounds is examined to ensure dynamical invariance and the sounds are assessed by a range of invariant geometric features developed for the analysis of chaotic systems such as correlation dimension, Lyapunov exponents and short-term predictability.
Abstract: This paper addresses suggestions in the literature that the generation of speech is a nonlinear process. This has sparked great interest in the area of nonlinear analysis of speech with a number of studies being conducted to investigate whether low dimensional chaotic attractors exist for speech. This paper examines a corpus of sustained vowel sounds which were recorded for this study to ensure dynamical invariance. The sounds are assessed by a range of the invariant geometric features developed for the analysis of chaotic systems such as correlation dimension, Lyapunov exponents, and short-term predictability. The results presented suggest that although voiced speech is well characterized by a small number of dimensions, it is not necessarily chaotic. Finally, a synthesis technique for voiced sounds is developed inspired by the technique for estimating the Lyapunov exponents.

Journal ArticleDOI
TL;DR: This paper gives an analytical description of an adaptive microphone array that facilitates a simple built-in calibration to the environment and instrumentation, and provides speech enhancement and acoustic echo-cancellation.
Abstract: This paper gives an analytical description of an adaptive microphone array that facilitates a simple built-in calibration to the environment and instrumentation. This method, suggested for use in hands-free mobile telephones and speech recognition systems for cars, provides speech enhancement and acoustic echo-cancellation. The scheme offers several advantages, such as a simple calibration procedure, suppression of directional sources, versatile robust beamforming, and reduced target signal distortion. The analysis employs noncausal Wiener filters yielding compact and effective theoretical suppression limits.

Journal ArticleDOI
TL;DR: This study proposes a new approach which combines stress classification and speech recognition functions into one algorithm by generalizing the one-dimensional (1-D) hidden Markov model to an N-channel hiddenMarkov model (N-channel HMM).
Abstract: Robust speech recognition systems must address variations due to perceptually induced stress in order to maintain acceptable levels of performance in adverse conditions. One approach for addressing these variations is to utilize front-end stress classification to direct a stress dependent recognition algorithm which separately models each speech production domain. This study proposes a new approach which combines stress classification and speech recognition functions into one algorithm. This is accomplished by generalizing the one-dimensional (1-D) hidden Markov model to an N-channel hidden Markov model (N-channel HMM). Here, each stressed speech production style under consideration is allocated a dimension in the N-channel HMM to model each perceptually induced stress condition. It is shown that this formulation better integrates perceptually induced stress effects for stress independent recognition. This is due to the sub-phoneme (state level) stress classification that is implicitly performed by the algorithm. The proposed N-channel stress independent HMM method is compared to a previously established one-channel stress dependent isolated word recognition system yielding a 73.8% reduction in error rate. In addition, an 82.7% reduction in error rate is observed compared to the common one-channel neutral trained recognition approach.

Journal ArticleDOI
TL;DR: It is shown that the frame scores from the GVQ trained codebooks are less correlated, therefore, the sentence level speaker identification rate increases more quickly with the length of test sentences.
Abstract: A novel method, referred to as group vector quantization (GVQ), is proposed to train VQ codebooks for closed-set speaker identification. In GVQ training, speaker codebooks are optimized for vector groups rather than for individual vectors. An evaluation experiment has been conducted to compare the codebooks trained by the Linde-Buzo-Grey (LBG), the learning vector quantization (LVQ), and the GVQ algorithms. It is shown that the frame scores from the GVQ trained codebooks are less correlated, therefore, the sentence level speaker identification rate increases more quickly with the length of test sentences.

Journal ArticleDOI
TL;DR: Analysis of the zeros for HRTFs on the horizontal plane showed that the nonminimum-phase zero variation was well formulated using a simple pinna-reflection model, and the common-acoustical-pole and zero (CAPZ) model is thus effective for modeling and analyzing HRTF's.
Abstract: Use of a common-acoustical-pole and zero model is proposed for modeling head-related transfer functions (HRTFs) for various directions of sound incidence. The HRTFs are expressed using the common acoustical poles, which do not depend on the source directions, and the zeros, which do. The common acoustical poles are estimated as they are common to HRTFs for various source directions; the estimated values of the poles agree well with the resonance frequencies of the ear canal. Because this model uses only the zeros to express the HRTF variations due to changes in source direction, it requires fewer parameters (the order of the zeros) that depend on the source direction than do the conventional all-zero or pole/zero models. Furthermore, the proposed model can extract the zeros that are missed in the conventional models because of pole-zero cancellation. As a result, the directional dependence of the zeros can be traced well. Analysis of the zeros for HRTFs on the horizontal plane showed that the nonminimum-phase zero variation was well formulated using a simple pinna-reflection model. The common-acoustical-pole and zero (CAPZ) model is thus effective for modeling and analyzing HRTF's.

Journal ArticleDOI
TL;DR: The experimental results show that the adopted prior distribution and the proposed techniques help to improve the performance robustness under the examined mismatch conditions.
Abstract: We study a category of robust speech recognition problem in which mismatches exist between training and testing conditions, and no accurate knowledge of the mismatch mechanism is available. The only available information is the test data along with a set of pretrained Gaussian mixture continuous density hidden Markov models (CDHMMs). We investigate the problem from the viewpoint of Bayesian prediction. A simple prior distribution, namely constrained uniform distribution, is adopted to characterize the uncertainty of the mean vectors of the CDHMMs. Two methods, namely a model compensation technique based on Bayesian predictive density and a robust decision strategy called Viterbi Bayesian predictive classification are studied. The proposed methods are compared with the conventional Viterbi decoding algorithm in speaker-independent recognition experiments on isolated digits and TI connected digit strings (TIDTGITS), where the mismatches between training and testing conditions are caused by: (1) additive Gaussian white noise, (2) each of 25 types of actual additive ambient noises, and (3) gender difference. The experimental results show that the adopted prior distribution and the proposed techniques help to improve the performance robustness under the examined mismatch conditions.

Journal ArticleDOI
TL;DR: The key result developed here is an explicit expression for the cross-covariance between the log-periodograms of the clean and noisy signals that is used to show that the covariance matrix of cepstral components representing N signal samples, is a fixed signal independent matrix which approaches a diagonal matrix at a rate of 1/N.
Abstract: Explicit expressions for the second-order statistics of cepstral components representing clean and noisy signal waveforms are derived. The noise is assumed additive to the signal, and the spectral components of each process are assumed statistically independent complex Gaussian random variables. The key result developed here is an explicit expression for the cross-covariance between the log-periodograms of the clean and noisy signals. In the absence of noise, this expression is used to show that the covariance matrix of cepstral components representing N signal samples, is a fixed signal independent matrix, which approaches a diagonal matrix at a rate of 1/N. In addition, the cross-covariance expression is used to develop an explicit linear minimum mean square error estimator for the clean cepstral components given noisy cepstral components. Recognition results on the English digits using the fixed covariance and linear estimator are presented.

Journal ArticleDOI
TL;DR: By combining an interframe quantizer and a memoryless "safety-net" quantizer, it is demonstrated that the advantages of both quantization strategies can be utilized, and the performance for both noiseless and noisy channels improves.
Abstract: In linear predictive speech coding algorithms, transmission of linear predictive coding (LPC) parameters-often transformed to the line spectrum frequencies (LSF) representation-consumes a large part of the total bit rate of the coder. Typically, the LSF parameters are highly correlated from one frame to the next, and a considerable reduction in bit rate can be achieved by exploiting this interframe correlation. However, interframe coding leads to error propagation if the channel is noisy, which possibly cancels the achievable gain. In this paper, several algorithms for exploiting interframe correlation of LSF parameters are compared. Especially, performance for transmission over noisy channels is examined, and methods to improve noisy channel performance are proposed. By combining an interframe quantizer and a memoryless "safety-net" quantizer, we demonstrate that the advantages of both quantization strategies can be utilized, and the performance for both noiseless and noisy channels improves. The results indicate that the best interframe method performs as good as a memoryless quantizing scheme, with 4 bits less per frame. Subjective listening tests have been employed that verify the results from the objective measurements.

Journal ArticleDOI
TL;DR: A measure for the strength of the excitation based on Frobenius norm of the differenced signal is proposed and illustrated for speech under different types of degradations and for speech from different speakers.
Abstract: We study the robustness of a group-delay-based method for determining the instants of significant excitation in speech signals. These instants correspond to the instants of glottal closure for voiced speech. The method uses the properties of the global phase characteristics of minimum phase signals. Robustness of the method against noise and distortion is due to the fact that the average phase characteristics of a signal is determined mainly by the strength of the excitation impulse. The strength of excitation is determined by the energy of the residual error signal around the instant of excitation. We propose a measure for the strength of the excitation based on Frobenius norm of the differenced signal. The robustness of the group-delay-based method is illustrated for speech under different types of degradations and for speech from different speakers.

Journal ArticleDOI
TL;DR: It is shown that the F-ratio tests indicate better separability of vowels by using scale-transform based features than mel- transform based features.
Abstract: In this paper, we study the scale transform of the spectral-envelope of speech utterances by different speakers. This study is motivated by the hypothesis that the formant frequencies between different speakers are approximately related by a scaling constant for a given vowel. The scale transform has the fundamental property that the magnitude of the scale-transform of a function X(f) and its scaled version /spl radic//spl alpha/X(/spl alpha/f) are same. The methods presented here are useful in reducing variations in acoustic features. We show that the F-ratio tests indicate better separability of vowels by using scale-transform based features than mel-transform based features. The data used in the comparison of the different features consist of 200 utterances of four vowels that are extracted from the TIMIT database.

Journal ArticleDOI
TL;DR: On this task, the use of state information reduced the percentage of Gaussians computed to 10-15%, compared with 20-30% for the standard GS scheme, with little degradation in performance.
Abstract: This paper investigates the use of Gaussian selection (GS) to increase the speed of a large vocabulary speech recognition system. Typically, 30-70% of the computational time of a continuous density hidden Markov model-based (HMM-based) speech recognizer is spent calculating probabilities. The aim of CS is to reduce this load by selecting the subset of Gaussian component likelihoods that should be computed given a particular input vector. This paper examines new techniques for obtaining "good" Gaussian subsets or "shortlists." All the new schemes make use of state information, specifically, to which state each of the Gaussian components belongs. In this way, a maximum number of Gaussian components per state may be specified, hence reducing the size of the shortlist. The first technique introduced is a simple extension of the standard GS method, which uses this state information. Then, more complex schemes based on maximizing the likelihood of the training data are proposed. These new approaches are compared with the standard GS scheme on a large vocabulary speech recognition task. On this task, the use of state information reduced the percentage of Gaussians computed to 10-15%, compared with 20-30% for the standard GS scheme, with little degradation in performance.

Journal ArticleDOI
TL;DR: A comparison of the two sets of features indicates that J/sub 1/(t) can be used to model the hearing perception much like the mel cepstral coefficients.
Abstract: A compact representation of speech is possible using Bessel functions because of the similarity between voiced speech and the Bessel functions. Both voiced speech and the Bessel functions exhibit quasiperiodicity and decaying amplitude with time. This paper presents the results of speaker identification experiments using features obtained from (1) the Fourier-Bessel expansion and (2) the cepstral representation of speech frames. Identification scores of 65% and 76% were achieved using features based on J/sub 1/(t) expansion of air-to-ground speech transmission databases of 143 and 1054 test utterances, respectively. The corresponding scores for the two databases using cepstral coefficients of a comparable size were 80% and 88%. A comparison of the two sets of features indicates that J/sub 1/(t) can be used to model the hearing perception much like the mel cepstral coefficients.

Journal ArticleDOI
TL;DR: The experimental results for a rectangular room, in which the residue values are interpolated or extrapolated by using a cosine function or a linear prediction method, demonstrate that unknown RTFs can be well estimated at low frequencies from known (measured) R TFs by using the proposed methods.
Abstract: A method is proposed for modeling a room transfer function (RTF) by using common acoustical poles and their residues. The common acoustical poles correspond to the resonance frequencies (eigenfrequencies) of the room, so they are independent of the source and receiver positions. The residues correspond to the eigenfunctions of the room. Therefore, the residue, which is a function of the source and receiver positions, can be expressed using simple analytical functions for rooms with a simple geometry such as rectangular. That is, the proposed model can describe RTF variations using simple residue functions. Based on the proposed common-acoustical-pole and residue model, methods are also proposed for spatially interpolating and extrapolating RTFs. Because the common acoustical poles are invariant in a given room, the interpolation or extrapolation of RTFs is reformulated as a problem of interpolating or extrapolating residue values. The experimental results for a rectangular room, in which the residue values are interpolated or extrapolated by using a cosine function or a linear prediction method, demonstrate that unknown RTFs can be well estimated at low frequencies from known (measured) RTFs by using the proposed methods.