scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Speech and Audio Processing in 2001"


Journal ArticleDOI
TL;DR: An unbiased noise estimator is developed which derives the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal by minimizing a conditional mean square estimation error criterion in each time step.
Abstract: We describe a method to estimate the power spectral density of nonstationary noise when a noisy speech signal is given. The method can be combined with any speech enhancement algorithm which requires a noise power spectral density estimate. In contrast to other methods, our approach does not use a voice activity detector. Instead it tracks spectral minima in each frequency band without any distinction between speech activity and speech pause. By minimizing a conditional mean square estimation error criterion in each time step we derive the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal. Based on the optimally smoothed power spectral density estimate and the analysis of the statistics of spectral minima an unbiased noise estimator is developed. The estimator is well suited for real time implementations. Furthermore, to improve the performance in nonstationary noise we introduce a method to speed up the tracking of the spectral minima. Finally, we evaluate the proposed method in the context of speech enhancement and low bit rate speech coding with various noise types.

1,731 citations


Journal ArticleDOI
TL;DR: It is shown that the posterior probabilities computed on word graphs outperform all other confidence measures and are compared with two alternative confidence measures, i.e., the acoustic stability and the hypothesis density.
Abstract: In this paper, we present several confidence measures for large vocabulary continuous speech recognition. We propose to estimate the confidence of a hypothesized word directly as its posterior probability, given all acoustic observations of the utterance. These probabilities are computed on word graphs using a forward-backward algorithm. We also study the estimation of posterior probabilities on N-best lists instead of word graphs and compare both algorithms in detail. In addition, we compare the posterior probabilities with two alternative confidence measures, i.e., the acoustic stability and the hypothesis density. We present experimental results on five different corpora: the Dutch ARISE 1k evaluation corpus, the German Verbmobil '98 7k evaluation corpus, the English North American Business '94 20k and 64k development corpora, and the English Broadcast News '96 65k evaluation corpus. We show that the posterior probabilities computed on word graphs outperform all other confidence measures. The relative reduction in confidence error rate ranges between 19% and 35% compared to the baseline confidence error rate.

496 citations


Journal ArticleDOI
TL;DR: A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features.
Abstract: While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications. An approach to automatic segmentation and classification of audiovisual data based on audio content analysis is proposed. The audio signal from movies or TV programs is segmented and classified into basic types such as speech, music, song, environmental sound, speech with music background, environmental sound with music background, silence, etc. Simple audio features including the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak tracks are extracted to ensure the feasibility of real-time processing. A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features. Experimental results show that the proposed scheme achieves an accuracy rate of more than 90% in audio classification.

473 citations


Journal ArticleDOI
TL;DR: A linear-correction least-squares estimation procedure is proposed for the source localization problem under an additive measurement error model and yields an efficient source location estimator without assuming a priori knowledge of noise distribution.
Abstract: A linear-correction least-squares estimation procedure is proposed for the source localization problem under an additive measurement error model. The method, which can be easily implemented in a real-time system with moderate computational complexity, yields an efficient source location estimator without assuming a priori knowledge of noise distribution. Alternative existing estimators, including likelihood-based, spherical intersection, spherical interpolation, and quadratic-correction least-squares estimators, are reviewed and comparisons of their complexity, estimation consistency and efficiency against the Cramer-Rao lower bound are made. Numerical studies demonstrate that the proposed estimator performs better under many practical situations.

461 citations


Journal ArticleDOI
TL;DR: Three new features derived from the nonlinear Teager (1980) energy operator (TEO) are investigated for stress classification and it is shown that the TEO-CB-Auto-Env feature outperforms traditional pitch and mel-frequency cepstrum coefficients (MFCC) substantially.
Abstract: Studies have shown that variability introduced by stress or emotion can severely reduce speech recognition accuracy. Techniques for detecting or assessing the presence of stress could help improve the robustness of speech recognition systems. Although some acoustic variables derived from linear speech production theory have been investigated as indicators of stress, they are not always consistent. Three new features derived from the nonlinear Teager (1980) energy operator (TEO) are investigated for stress classification. It is believed that the TEO based features are better able to reflect the nonlinear airflow structure of speech production under adverse stressful conditions. The features proposed include TEO-decomposed FM variation (TEO-FM-Var), normalized TEO autocorrelation envelope area (TEO-Auto-Env), and critical band based TEO autocorrelation envelope area (TEO-CB-Auto-Env). The proposed features are evaluated for the task of stress classification using simulated and actual stressed speech and it is shown that the TEO-CB-Auto-Env feature outperforms traditional pitch and mel-frequency cepstrum coefficients (MFCC) substantially. Performance for TEO based features are maintained in both text-dependent and text-independent models, while performance of traditional features degrades in text-independent models. Overall neutral versus stress classification rates are also shown to be more consistent across different stress styles.

433 citations


Journal ArticleDOI
TL;DR: A spherical harmonics analysis is used to derive performance bounds on how well an array of loudspeakers can recreate a three-dimensional (3-D) plane-wave sound field within a spherical region of space.
Abstract: Reproduction of a sound field is a fundamental problem in acoustic signal processing. In this paper, we use a spherical harmonics analysis to derive performance bounds on how well an array of loudspeakers can recreate a three-dimensional (3-D) plane-wave sound field within a spherical region of space. Specifically, we develop a relationship between the number of loudspeakers, the size of the reproduction sphere, the frequency range, and the desired accuracy. We also provide analogous results for the special case of reproduction of a two-dimensional (2-D) sound field. Results are verified through computer simulations.

378 citations


Journal ArticleDOI
Yannis Stylianou1
TL;DR: The harmonic plus noise model (HNM) for concatenative text-to-speech (TTS) synthesis provides high-quality speech synthesis while outperforming other models for synthesis (e.g., TD-PSOLA) in intelligibility, naturalness, and pleasantness.
Abstract: This paper describes the application of the harmonic plus noise model (HNM) for concatenative text-to-speech (TTS) synthesis. In the context of HNM, speech signals are represented as a time-varying harmonic component plus a modulated noise component. The decomposition of a speech signal into these two components allows for more natural-sounding modifications of the signal (e.g., by using different and better adapted schemes to modify each component). The parametric representation of speech using HNM provides a straightforward way of smoothing discontinuities of acoustic units around concatenation points. Formal listening tests have shown that HNM provides high-quality speech synthesis while outperforming other models for synthesis (e.g., TD-PSOLA) in intelligibility, naturalness, and pleasantness.

371 citations


Journal ArticleDOI
TL;DR: This work set out to find an objective spectral measure for discontinuity, and studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities.
Abstract: A common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is the most likely the clause of this phenomenon. We first set out to find an objective spectral measure for discontinuity. To this end, several spectral distance measures are related to the results of a listening experiment. Then, we studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities. The number of additional diphones is limited by clustering consonant contexts that have a similar effect on the surrounding vowels on the basis of the best performing distance measure. A listening experiment has shown that the addition of these context-sensitive diphones significantly reduces the amount of audible discontinuities.

283 citations


Journal ArticleDOI
TL;DR: The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames and derives a voicing condition for speech frames based on the relation between the skewness and kurtosis of voiced speech.
Abstract: This paper presents a robust algorithm for voice activity detection (VAD) based on newly established properties of the higher order statistics (HOS) of speech. Analytical expressions for the third and fourth-order cumulants of the LPC residual of short-term speech are derived assuming a sinusoidal model. The flat spectral feature of this residual results in distinct characteristics for these cumulants in terms of phase, periodicity and harmonic content and yields closed-form expressions for the skewness and kurtosis. Important properties about these cumulants and their similarity with the autocorrelation function are revealed from this exploratory part. They show that the HOS of speech are sufficiently distinct from those of Gaussian noise and can be used as a basis for speech detection. Their immunity to Gaussian noise makes them particularly useful in algorithms designed for low SNR environments. The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames. The variance of the HOS estimators is quantified and used to yield a likelihood measure for noise frames. Moreover, a voicing condition for speech frames is derived based on the relation between the skewness and kurtosis of voiced speech. The performance of the algorithm is compared to the ITU-T G.729B VAD in various noise conditions, and quantified using the probability of correct and false classifications. The results show that the proposed algorithm has an overall better performance than G.729B, with noticeable improvement in Gaussian-like noises, such as street and parking garage, and moderate to low SNR.

249 citations


Journal ArticleDOI
TL;DR: A modified version of the autocorrelation pitch extraction method well known to be robust against noise is proposed, using that the average magnitude difference function (AMDF) has similar characteristics with the autcorrelation function, and the auto- reciprocal of the AMDF is weighted.
Abstract: In this paper, we propose a modified version of the autocorrelation pitch extraction method well known to be robust against noise. Utilizing that the average magnitude difference function (AMDF) has similar characteristics with the autocorrelation function, the autocorrelation function is weighted by the reciprocal of the AMDF. By simulation experiments, it is shown that the proposed pitch extraction method is useful in noisy environments.

245 citations


Journal ArticleDOI
TL;DR: An adaptive Karhunen-Loeve transform (KLT) tracking-based algorithm is proposed for enhancement of speech degraded by colored additive interference that decomposes noisy speech into its components along the axes of a KLT-based vector space of clean speech.
Abstract: An adaptive Karhunen-Loeve transform (KLT) tracking-based algorithm is proposed for enhancement of speech degraded by colored additive interference. This algorithm decomposes noisy speech into its components along the axes of a KLT-based vector space of clean speech. It is observed that the noise energy is disparately distributed along each eigenvector. These energies are obtained from noise samples gathered from silence intervals between speech samples. To obtain these silence intervals, we proposed an efficient voice activity detector based on outputs of the principle component eigenfilter; the greatest eigenvalue of speech KLT. Enhancement is performed by modifying each KLT component due to its noise and clean speech energies. The objective is to minimize the produced distortion when residual noise power is limited to a specific level. At the end, the inverse KLT is performed and an estimation of the clean signal is synthesized. Our listening tests indicated that 71% of our subjects preferred the enhanced speech by the above method over former methods of enhancement of speech degraded by computer generated white Gaussian noise. Our method was preferred by 80% of our subjects when we processed real samples of noisy speech gathered from various environments.

Journal ArticleDOI
TL;DR: Three cross-updated least mean square (LMS) adaptive filters are used to reduce mutual disturbances between the operation of the ANC controller and the modeling of the secondary path.
Abstract: A good active noise control (ANC) system with online secondary path modeling should have the property that the operation of the ANC controller and the modeling of the secondary path are mutually independent. A new ANC system with online secondary path modeling is presented. Three cross-updated least mean square (LMS) adaptive filters are used to reduce mutual disturbances between the operation of the ANC controller and the modeling of the secondary path. Computer simulations have been conducted and the results show that the proposed method is able to produce superior performance compared to existing methods.

Journal ArticleDOI
TL;DR: A noise suppression algorithm based on spectral subtraction that employs a noise and speech-dependent gain function for each frequency component and shows improvement in speech quality and reduction of noise artifacts as compared with conventional spectral subtracted methods.
Abstract: In hands-free speech communication, the signal-to-noise ratio (SNR) is often poor, which makes it difficult to have a relaxed conversation. By using noise suppression, the conversation quality can be improved. This paper describes a noise suppression algorithm based on spectral subtraction. The method employs a noise and speech-dependent gain function for each frequency component. Proper measures have been taken to obtain a corresponding causal filter and also to ensure that the circular convolution originating from fast Fourier transform (FFT) filtering yields a truly linear filtering. A novel method that uses spectrum-dependent adaptive averaging to decrease the variance of the gain function is also presented. The results show a 10-dB background noise reduction for all input SNR situations tested in the range -6 to 16 dB, as well as improvement in speech quality and reduction of noise artifacts as compared with conventional spectral subtraction methods.

Journal ArticleDOI
TL;DR: A new and generalizing approach to error concealment is described as part of a modified robust speech decoder that can be applied to any speech codec standard and preserves bit exactness in the case of an error free channel.
Abstract: In digital speech communication over noisy channels there is the need for reducing the subjective effects of residual bit errors which have not been eliminated by channel decoding. This task is usually called error concealment. We describe a new and generalizing approach to error concealment as part of a modified robust speech decoder. It can be applied to any speech codec standard and preserves bit exactness in the case of an error free channel. The proposed method requires bit reliability information provided by the demodulator or by the equalizer or specifically by the channel decoder and can exploit additionally a priori knowledge about codec parameters. We apply our algorithms to PCM, ADPCM, and GSM full-rate speech coding using AWGN, fading, and GSM channel models, respectively. It turns out that the speech quality is significantly enhanced, showing the desired inherent muting mechanism or graceful degradation behavior in the case of extreme adverse transmission conditions.

Journal ArticleDOI
TL;DR: A structural maximum a posteriori (SMAP) approach to improve the MAP estimates obtained when the amount of adaptation data is small and the recognition results obtained in unsupervised adaptation experiments showed that SMAP estimation was effective even when only one utterance from a new speaker was used for adaptation.
Abstract: Maximum a posteriori (MAP) estimation has been successfully applied to speaker adaptation in speech recognition systems using hidden Markov models. When the amount of data is sufficiently large, MAP estimation yields recognition performance as good as that obtained using maximum-likelihood (ML) estimation. This paper describes a structural maximum a posteriori (SMAP) approach to improve the MAP estimates obtained when the amount of adaptation data is small. A hierarchical structure in the model parameter space is assumed and the probability density functions for model parameters at one level are used as priors for those of the parameters at adjacent levels. Results of supervised adaptation experiments using nonnative speakers' utterances showed that SMAP estimation reduced error rates by 61% when ten utterances were used for adaptation and that it yielded the same accuracy as MAP and ML estimation when the amount of data was sufficiently large. Furthermore, the recognition results obtained in unsupervised adaptation experiments showed that SMAP estimation was effective even when only one utterance from a new speaker was used for adaptation. An effective way to combine rapid supervised adaptation and on-line unsupervised adaptation was also investigated.

Journal ArticleDOI
TL;DR: One of the first robust LVCSR systems that uses a syllable-level acoustic unit for LV CSR on telephone-bandwidth speech and exceeds the performance of a comparable triphone system both in terms of word error rate (WER) and complexity.
Abstract: Most large vocabulary continuous speech recognition (LVCSR) systems in the past decade have used a context-dependent (CD) phone as the fundamental acoustic unit. We present one of the first robust LVCSR systems that uses a syllable-level acoustic unit for LVCSR on telephone-bandwidth speech. This effort is motivated by the inherent limitations in phone-based approaches-namely the lack of an easy and efficient way for modeling long-term temporal dependencies. A syllable unit spans a longer time frame, typically three phones, thereby offering a more parsimonious framework for modeling pronunciation variation in spontaneous speech. We present encouraging results which show that a syllable-based system exceeds the performance of a comparable triphone system both in terms of word error rate (WER) and complexity. The WER of the best syllabic system reported here is 49.1% on a standard Switchboard evaluation, a small improvement over the triphone system. We also report results on a much smaller recognition task, OGI Alphadigits, which was used to validate some of the benefits syllables offer over triphones. The syllable-based system exceeds the performance of the triphone system by nearly 20%, an impressive accomplishment since the alphadigits application consists mostly of phone-level minimal pair distinctions.

Journal ArticleDOI
TL;DR: The main point of the paper is to show the close relationship between the nonzero principal components and the difference subspace together with the complementary close relation between the zero principal component and the common vector.
Abstract: The main point of the paper is to show the close relation between the nonzero principal components and the difference subspace together with the complementary close relation between the zero principal components and the common vector. A common vector representing each word-class is obtained from the eigenvectors of the covariance matrix of its own word-class; that is, the common vector is in the direction of a linear combination of the eigenvectors corresponding to the zero eigenvalues of the covariance matrix. The methods that use the nonzero principal components for recognition purposes suggest the elimination of all the features that are in the direction of the eigenvectors corresponding to the smallest eigenvalues (including the zero eigenvalues) of the covariance matrix whereas the common vector approach suggests the elimination of all the features that are in the direction of the eigenvectors corresponding to the largest, all nonzero eigenvalues of the covariance matrix.

Journal ArticleDOI
TL;DR: This paper presents a sinusoidal model based algorithm for enhancement of speech degraded by additive broad-band noise that shows considerable improvement over traditional spectral subtraction and Wiener filtering based schemes.
Abstract: This paper presents a sinusoidal model based algorithm for enhancement of speech degraded by additive broad-band noise. In order to ensure speech-like characteristics observed in clean speech, smoothness constraints are imposed on the model parameters using a spectral envelope surface (SES) smoothing procedure. Algorithm evaluation is performed using speech signals degraded by additive white Gaussian noise. Distortion as measured by objective speech quality scores showed a 34%-41% reduction over a SNR range of 5-to-20 dB. Objective and subjective evaluations also show considerable improvement over traditional spectral subtraction and Wiener filtering based schemes. Finally, in a subjective AB preference test, where enhanced signals were coded with the G729 codec, the proposed scheme was preferred over the traditional enhancement schemes tested for SNRs in the range of 5 to 20 dB.

Journal ArticleDOI
TL;DR: Evaluation on the Airline Travel Information System (ATIS) task shows that in comparison to its parent CDHMM system, a converted SDCHMM system achieves seven- to 18-fold reduction in memory requirement for acoustic models, and runs 30%-60% faster without any loss of recognition accuracy.
Abstract: Most contemporary laboratory recognizers require too much memory to run, and are too slow for mass applications. One major cause of the problem is the large parameter space of their acoustic models. In this paper, we propose a new acoustic modeling methodology which we call subspace distribution clustering hidden Markov modeling (SDCHMM) with the aim of achieving much more compact acoustic models. The theory of SDCHMM is based on tying the parameters of a new unit, namely the subspace distribution, of continuous density hidden Markov models (CDHMMs). SDCHMMs can be converted from CDHMMs by projecting the distributions of the CDHMMs onto orthogonal subspaces, and then tying similar subspace distributions over all states and all acoustic models in each subspace, by exploiting the combinatorial effect of subspace distribution encoding, all original full-space distributions can be represented by combinations of a small number of subspace distribution prototypes. Consequently, there is a great reduction in the number of model parameters, and thus substantial savings in memory and computation. This renders SDCHMM very attractive in the practical implementation of acoustic models. Evaluation on the Airline Travel Information System (ATIS) task shows that in comparison to its parent CDHMM system, a converted SDCHMM system achieves seven- to 18-fold reduction in memory requirement for acoustic models, and runs 30%-60% faster without any loss of recognition accuracy.

Journal ArticleDOI
TL;DR: This paper presents a summary of the solutions retained for this dual reduction in the context of mono-channel and two-channel sound pick-ups.
Abstract: The modern telecommunications field is concerned with freedom and, in this context, hands-free systems offer subscribers the possibility of talking more naturally, without using a handset. This new type of use leads to new problems which were negligible in traditional telephony, namely the superposition of noise and echo on the speech signal. To solve these problems and provide a quality that is sufficient for telecommunications, combined reduction of these disturbances is required. This paper presents a summary of the solutions retained for this dual reduction in the context of mono-channel and two-channel sound pick-ups.

Journal ArticleDOI
TL;DR: The results indicate that the use of warped techniques is beneficial especially in wideband coding and may result in savings of one bit per sample compared to the conventional algorithm while retaining the same subjective quality.
Abstract: Frequency-warped signal processing techniques are attractive to many wideband speech and audio applications since they have a clear connection to the frequency resolution of human hearing. A warped version of linear predictive coding (LPC) is studied. The performance of conventional and warped LPC algorithms are compared in a simulated coding system using listening tests and conventional technical measures. The results indicate that the use of warped techniques is beneficial especially in wideband coding and may result in savings of one bit per sample compared to the conventional algorithm while retaining the same subjective quality.

Journal ArticleDOI
E. Gunduzhan1, K. Momtahan1
TL;DR: A high performance packet loss concealment algorithm for pulse code modulation (PCM) coded speech that extracts the residual signal of the previously received speech by linear prediction analysis, uses periodic replication to generate an approximation for the excitation signal of missing speech, and generates synthesized speech using this excitation.
Abstract: One of the well-known problems in real-time packetized voice applications is the degradation in voice quality due to delayed or misrouted packets. When a voice packet does not arrive at the receiver on time, the receiver needs a packet loss concealment algorithm to generate a signal instead of the missing voice segment. In this paper we describe a high performance packet loss concealment algorithm for pulse code modulation (PCM) coded speech. The algorithm extracts the residual signal of the previously received speech by linear prediction analysis, uses periodic replication to generate an approximation for the excitation signal of missing speech, and generates synthesized speech using this excitation. It also performs overlap-and-adding and scaling operations to smooth out transitions at frame boundaries. The new algorithm is compared to other algorithms by subjective quality tests, and is found to be better than the existing algorithms in some cases.

Journal ArticleDOI
Jiri Navratil1
TL;DR: A particularly successful approach based on phonotactic-acoustic features and systems for language identification as well as for unknown-language rejection are presented and are computationally inexpensive and easily extensible to new languages without the need for linguistic experts.
Abstract: Automatic recognition of spoken languages has become an important feature in a variety of speech-enabled multilingual applications which, besides accuracy, also demand for efficient and "linguistically scalable" algorithms. This paper deals with a particularly successful approach based on phonotactic-acoustic features and presents systems for language identification as well as for unknown-language rejection. An architecture with multipath decoding, improved phonotactic models using binary-tree structures, and acoustic pronunciation models serve as a framework for experiments and discussion on these two tasks. In particular, language identification accuracy on a telephone-speech task (NIST'95 evaluation) in six and nine languages is presented together with results from a perceptual experiment carried out with human listeners. The performance of language rejection based on phonotactic modeling combined with a monolingual LVCSR system in the domain of broadcast news transcription is also reported. Besides yielding state-of-the-art performance, the described systems are computationally inexpensive and easily extensible (scalable) to new languages without the need for linguistic experts.

Journal ArticleDOI
TL;DR: Carefully investigated experiments demonstrate that DFE achieves the design of a better recognizer and provides an innovative recognition-oriented analysis of the filter-bank, as an alternative to conventional analysis based on psychoacoustic expertise or heuristics.
Abstract: A pattern recognizer is usually a modular system which consists of a feature extractor module and a classifier module. Traditionally, these two modules have been designed separately, which may not result in an optimal recognition accuracy. To alleviate this fundamental problem, the authors have developed a design method, named discriminative feature extraction (DFE), that enables one to design the overall recognizer, i.e., both the feature extractor and the classifier, in a manner consistent with the objective of minimizing recognition errors. This paper investigates the application of this method to designing a speech recognizer that consists of a filter-hank feature extractor and a multi-prototype distance classifier. Carefully investigated experiments demonstrate that DFE achieves the design of a better recognizer and provides an innovative recognition-oriented analysis of the filter-bank, as an alternative to conventional analysis based on psychoacoustic expertise or heuristics.

Journal ArticleDOI
H. Sano1, T. Inoue1, A. Takahashi1, K. Terai, Y. Nakamura 
TL;DR: In this article, an active control system for low-frequency road noise in automobiles combined with an audio system is developed as a commercial application for the first time in the world, and installed in a station wagon.
Abstract: An active control system for low-frequency road noise in automobiles combined with an audio system is developed as a commercial application for the first time in the world, and installed in a station wagon. The purpose of this paper is to provide an outline of the system and describe the newly developed cost-reduction technology used for it, since the reduction of system costs is a major reason that active noise control technology could successfully be applied in a commercial product. The methods used to reduce costs include utilization of feedback control, implementation by analogue circuits, and common use of audio system speakers. This system reduces low-frequency road noise in the front seat by about 10 dB and improves audio system listening experience while driving.

Journal ArticleDOI
TL;DR: An algorithm similar to the well-known Baum-Welch (1970) algorithm for estimating the parameters of a hidden Markov model (HMM) is derived that is equivalent to maximizing the likelihood function for the standard parameterization of the HMM defined on the input data space.
Abstract: We derive an algorithm similar to the well-known Baum-Welch (1970) algorithm for estimating the parameters of a hidden Markov model (HMM). The new algorithm allows the observation PDF of each state to be defined and estimated using a different feature set. We show that estimating parameters in this manner is equivalent to maximizing the likelihood function for the standard parameterization of the HMM defined on the input data space. The processor becomes optimal if the state-dependent feature sets are sufficient statistics to distinguish each state individually from a common state.

Journal ArticleDOI
Hong Kook Kim1, R.V. Cox
TL;DR: The proposed bitstream-based front-end gives superior word and string accuracies over a recognizer constructed from decoded speech signals and its performance is comparable to that of a wireline recognition system that uses the cepstrum as a feature set.
Abstract: We propose a feature extraction method for a speech recognizer that operates in digital communication networks. The feature parameters are basically extracted by converting the quantized spectral information of a speech coder into a cepstrum. We also include the voiced/unvoiced information obtained from the bitstream of the speech coder in the recognition feature set. We performed speaker-independent connected digit HMM recognition experiments under clean, background noise, and channel impairment conditions. From these results, we found that the speech recognition system employing the proposed bitstream-based front-end gives superior word and string accuracies over a recognizer constructed from decoded speech signals. Its performance is comparable to that of a wireline recognition system that uses the cepstrum as a feature set. Next, we extended the evaluation of the proposed bitstream-based front-end to large vocabulary speech recognition with a name database. The recognition results proved that the proposed bitstream-based front-end also gives a comparable performance to the conventional wireline front-end.

Journal ArticleDOI
TL;DR: A method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach is proposed.
Abstract: Almost all speech contains simultaneous contributions from more than one acoustic source within the speaker's vocal tract. In this paper, we propose a method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach. The PSHF outputs periodic and aperiodic components that are estimates of the respective contributions of the different types of acoustic source. It produces four reconstructed time series signals by decomposing the original speech signal, first, according to amplitude, and then according to power of the Fourier coefficients. Thus, one pair of periodic and aperiodic signals is optimized for subsequent time-series analysis, and another pair for spectral analysis. The performance of the PSHF algorithm is tested on synthetic signals, using three forms of disturbance (jitter, shimmer and additive noise), and the results were used to predict the performance on real speech. Processing recorded speech examples elicited latent features from the signals, demonstrating the PSHF's potential for analysis of mixed-source speech.

Journal ArticleDOI
TL;DR: Of the several nonlinearities considered, ideal half-wave rectification appears to be the best choice for speech and music, while the smoothed rectifier is a little more difficult to implement.
Abstract: In this paper, we investigate several types of nonlinearities used for the unique identification of receiving room impulse responses in stereo acoustic echo cancellation. The effectiveness is quantified by the mutual coherence of the transformed signals. The perceptual degradation is studied by psycho-acoustic experiments in terms of subjective quality and localization accuracy in the medial plane. The results indicate that, of the several nonlinearities considered, ideal half-wave rectification appears to be the best choice for speech. For music, the nonlinearity parameter of the ideal rectifier must be readjusted. The smoothed rectifier does not require this readjustment, but is a little more difficult to implement.

Journal ArticleDOI
George Saon1, Mukund Padmanabhan1
TL;DR: The experimental results show that by augmenting both the acoustic vocabulary and the language model with these new tokens, the word recognition accuracy can be improved by absolute 2.8% (7% relative) on a voice mail continuous speech recognition task.
Abstract: We present a new approach to deriving compound words from a training corpus. The motivation for making compound words is because under some assumptions, speech recognition errors occur less frequently in longer words. Furthermore, they also enable more accurate modeling of pronunciation variability at the boundary between adjacent words in a continuously spoken utterance. We introduce a measure based on the product between the direct and the reverse bigram probability of a pair of words for finding candidate pairs in order to create compound words. Our experimental results show that by augmenting both the acoustic vocabulary and the language model with these new tokens, the word recognition accuracy can be improved by absolute 2.8% (7% relative) on a voice mail continuous speech recognition task. We also compare the proposed measure for selecting compound words with other measures that have been described in the literature.