scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2001"


Journal ArticleDOI
TL;DR: An unbiased noise estimator is developed which derives the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal by minimizing a conditional mean square estimation error criterion in each time step.
Abstract: We describe a method to estimate the power spectral density of nonstationary noise when a noisy speech signal is given. The method can be combined with any speech enhancement algorithm which requires a noise power spectral density estimate. In contrast to other methods, our approach does not use a voice activity detector. Instead it tracks spectral minima in each frequency band without any distinction between speech activity and speech pause. By minimizing a conditional mean square estimation error criterion in each time step we derive the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal. Based on the optimally smoothed power spectral density estimate and the analysis of the statistics of spectral minima an unbiased noise estimator is developed. The estimator is well suited for real time implementations. Furthermore, to improve the performance in nonstationary noise we introduce a method to speed up the tracking of the spectral minima. Finally, we evaluate the proposed method in the context of speech enhancement and low bit rate speech coding with various noise types.

1,731 citations


Book
01 Jan 2001
TL;DR: This chapter discusses the Discrete-Time Speech Signal Processing Framework, a model based on the FBS Method, and its applications in Speech Communication Pathway and Homomorphic Signal Processing.
Abstract: (NOTE: Each chapter begins with an introduction and concludes with a Summary, Exercises and Bibliography.) 1. Introduction. Discrete-Time Speech Signal Processing. The Speech Communication Pathway. Analysis/Synthesis Based on Speech Production and Perception. Applications. Outline of Book. 2. A Discrete-Time Signal Processing Framework. Discrete-Time Signals. Discrete-Time Systems. Discrete-Time Fourier Transform. Uncertainty Principle. z-Transform. LTI Systems in the Frequency Domain. Properties of LTI Systems. Time-Varying Systems. Discrete-Fourier Transform. Conversion of Continuous Signals and Systems to Discrete Time. 3. Production and Classification of Speech Sounds. Anatomy and Physiology of Speech Production. Spectrographic Analysis of Speech. Categorization of Speech Sounds. Prosody: The Melody of Speech. Speech Perception. 4. Acoustics of Speech Production. Physics of Sound. Uniform Tube Model. A Discrete-Time Model Based on Tube Concatenation. Vocal Fold/Vocal Tract Interaction. 5. Analysis and Synthesis of Pole-Zero Speech Models. Time-Dependent Processing. All-Pole Modeling of Deterministic Signals. Linear Prediction Analysis of Stochastic Speech Sounds. Criterion of "Goodness". Synthesis Based on All-Pole Modeling. Pole-Zero Estimation. Decomposition of the Glottal Flow Derivative. Appendix 5.A: Properties of Stochastic Processes. Random Processes. Ensemble Averages. Stationary Random Process. Time Averages. Power Density Spectrum. Appendix 5.B: Derivation of the Lattice Filter in Linear Prediction Analysis. 6. Homomorphic Signal Processing. Concept. Homomorphic Systems for Convolution. Complex Cepstrum of Speech-Like Sequences. Spectral Root Homomorphic Filtering. Short-Time Homomorphic Analysis of Periodic Sequences. Short-Time Speech Analysis. Analysis/Synthesis Structures. Contrasting Linear Prediction and Homomorphic Filtering. 7. Short-Time Fourier Transform Analysis and Synthesis. Short-Time Analysis. Short-Time Synthesis. Short-Time Fourier Transform Magnitude. Signal Estimation from the Modified STFT or STFTM. Time-Scale Modification and Enhancement of Speech. Appendix 7.A: FBS Method with Multiplicative Modification. 8. Filter-Bank Analysis/Synthesis. Revisiting the FBS Method. Phase Vocoder. Phase Coherence in the Phase Vocoder. Constant-Q Analysis/Synthesis. Auditory Modeling. 9. Sinusoidal Analysis/Synthesis. Sinusoidal Speech Model. Estimation of Sinewave Parameters. Synthesis. Source/Filter Phase Model. Additive Deterministic-Stochastic Model. Appendix 9.A: Derivation of the Sinewave Model. Appendix 9.B: Derivation of Optimal Cubic Phase Parameters. 10. Frequency-Domain Pitch Estimation. A Correlation-Based Pitch Estimator. Pitch Estimation Based on a "Comb Filter<170. Pitch Estimation Based on a Harmonic Sinewave Model. Glottal Pulse Onset Estimation. Multi-Band Pitch and Voicing Estimation. 11. Nonlinear Measurement and Modeling Techniques. The STFT and Wavelet Transform Revisited. Bilinear Time-Frequency Distributions. Aeroacoustic Flow in the Vocal Tract. Instantaneous Teager Energy Operator. 12. Speech Coding. Statistical Models of Speech. Scaler Quantization. Vector Quantization (VQ). Frequency-Domain Coding. Model-Based Coding. LPC Residual Coding. 13. Speech Enhancement. Introduction. Preliminaries. Wiener Filtering. Model-Based Processing. Enhancement Based on Auditory Masking. Appendix 13.A: Stochastic-Theoretic parameter Estimation. 14. Speaker Recognition. Introduction. Spectral Features for Speaker Recognition. Speaker Recognition Algorithms. Non-Spectral Features in Speaker Recognition. Signal Enhancement for the Mismatched Condition. Speaker Recognition from Coded Speech. Appendix 14.A: Expectation-Maximization (EM) Estimation. Glossary.Speech Signal Processing.Units.Databases.Index.About the Author.

984 citations


Journal ArticleDOI
TL;DR: The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames and derives a voicing condition for speech frames based on the relation between the skewness and kurtosis of voiced speech.
Abstract: This paper presents a robust algorithm for voice activity detection (VAD) based on newly established properties of the higher order statistics (HOS) of speech. Analytical expressions for the third and fourth-order cumulants of the LPC residual of short-term speech are derived assuming a sinusoidal model. The flat spectral feature of this residual results in distinct characteristics for these cumulants in terms of phase, periodicity and harmonic content and yields closed-form expressions for the skewness and kurtosis. Important properties about these cumulants and their similarity with the autocorrelation function are revealed from this exploratory part. They show that the HOS of speech are sufficiently distinct from those of Gaussian noise and can be used as a basis for speech detection. Their immunity to Gaussian noise makes them particularly useful in algorithms designed for low SNR environments. The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames. The variance of the HOS estimators is quantified and used to yield a likelihood measure for noise frames. Moreover, a voicing condition for speech frames is derived based on the relation between the skewness and kurtosis of voiced speech. The performance of the algorithm is compared to the ITU-T G.729B VAD in various noise conditions, and quantified using the probability of correct and false classifications. The results show that the proposed algorithm has an overall better performance than G.729B, with noticeable improvement in Gaussian-like noises, such as street and parking garage, and moderate to low SNR.

249 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: It is demonstrated that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features, and synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.
Abstract: Describes a technique for synthesizing speech with arbitrary speaker characteristics using speaker independent speech units, which we call "average voice" units. The technique is based on an HMM-based text-to-speech (TTS) system and maximum likelihood linear regression (MLLR) adaptation algorithm. In the HMM-based TTS system, speech synthesis units are modeled by multi-space probability distribution (MSD) HMMs which can model spectrum and pitch simultaneously in a unified framework. We derive an extension of the MLLR algorithm to apply it to MSD-HMMs. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.

158 citations


Patent
Andrew P. Dejaco1
12 Apr 2001
TL;DR: In this paper, the authors proposed a method for dynamically adjusting the average output data rate of the speech coder with minimal impact on the quality of the input speech, which can be applied to both unvoiced speech and temporally masked speech.
Abstract: It is an objective of the present invention to provide an optimized method of selection of the encoding mode that provides rate efficient coding of the input speech. It is a second objective of the present invention to identify and provide a means for generating a set of parameters ideally suited for this operational mode selection. Third, it is an objective of the present invention to provide identification of two separate conditions that allow low rate coding with minimal sacrifice to quality. The two conditions are the coding of unvoiced speech and the coding of temporally masked speech. It is a fourth objective of the present invention to provide a method for dynamically adjusting the average output data rate of the speech coder with minimal impact on speech quality.

157 citations


PatentDOI
Dan Chazan1, Ron Hoory1
TL;DR: In this article, the spectral envelopes are integrated over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments, and an output speech signal is reconstructed by concatenating the feature vector corresponding to a sequence of speech segments.
Abstract: A method for speech synthesis includes receiving an input speech signal containing a set of speech segments, and estimating spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments. The spectral envelopes are integrated over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments. An output speech signal is reconstructed by concatenating the feature vectors corresponding to a sequence of the speech segments.

151 citations


PatentDOI
Sudheer Sirivara1
TL;DR: In this article, a method and apparatus are provided for compressing and using a concatenative speech database in TTS systems to improve the quality of speech output generated by handheld TTS system by allowing synthesis to occur on the client.
Abstract: A method and apparatus are provided for compressing and using a concatenative speech database in TTS systems to improve the quality of speech output generated by handheld TTS systems by allowing synthesis to occur on the client. According to one embodiment of the present invention, a G.723 encoder receives diphone waveforms, and compresses them into diphone residuals. While compressing the diphone waveforms, the encoder generates Linear Predictive Coding (LPC) coefficients. The diphone residuals, and the encoder-generated LPC coefficients are then stored in encoder-generated compressed packet.

145 citations


Proceedings ArticleDOI
20 Dec 2001
TL;DR: In this paper, a common narrow-band speech signal is expanded into a wide-band signal and the expanded signal gives the impression of a wide band speech signal regardless of what type of vocoder is used in a receiver.
Abstract: A common narrow-band speech signal is expanded into a wide-band speech signal. The expanded signal gives the impression of a wide-band speech signal regardless of what type of vocoder is used in a receiver. The robust techniques suggested herein are based on speech acoustics and fundamentals of human hearing. That is the techniques extend the harmonic structure of the speech signal during voiced speech segments and introduce a linearly estimated amount of speech energy in the wide frequency-band. During unvoiced speech segments, a fricated noise may be introduced in the upper frequency-band.

139 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: Results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech, however, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.
Abstract: The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

135 citations


Patent
05 Jan 2001
TL;DR: In this article, a system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wide band speech signal, and then the lower frequency range of the wide band signal is reproduced using the received narrow band signal.
Abstract: A system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wideband speech signal. The lower frequency range of the wideband speech signal is reproduced using the received narrowband speech signal. The received narrowband speech signal is analyzed to determine its formants and pitch information. The upper frequency range of the wideband speech signal is synthesized using information derived from the received narrowband speech signal.

123 citations


Patent
Martin Holzapfel1
10 Sep 2001
TL;DR: In this article, a first speech model is trained with a first time pattern and a second speech model with a second time pattern, and the second model is initialized with the first model.
Abstract: A method and also a configuration for determining a descriptive feature of a speech signal, in which a first speech model is trained with a first time pattern and a second speech model is trained with a second time pattern. The second speech model is initialized with the first speech model.

Proceedings ArticleDOI
07 May 2001
TL;DR: A new algorithm is proposed for generating synthetic frequency components in the high-band given the low-band ones for wide-band speech synthesis based on linear prediction (LPC) analysis-synthesis based on spectral envelope extension and bandwidth extension of the LPC analysis residual using a spectral folding.
Abstract: This paper contributes to narrowband speech enhancement by means of frequency bandwidth extension A new algorithm is proposed for generating synthetic frequency components in the high-band (ie, 4-8 kHz) given the low-band ones (ie, 0-4 kHz) for wide-band speech synthesis It is based on linear prediction (LPC) analysis-synthesis It consists of a spectral envelope extension using efficiently line spectral frequencies (LSF) and a bandwidth extension of the LPC analysis residual using a spectral folding The low-band LSF of the synthesis signal are obtained from the input speech signal and the high-band LSF are estimated from the low-band ones using statistical models This estimation is achieved by means of four models that are distinguished by means of the first two reflection coefficients obtained from the input signal linear prediction analysis

PatentDOI
TL;DR: In this article, a system and method of audio processing provides enhanced speech recognition, where the multi-channel audio signal from the microphones may be processed by a beamforming network to generate a single-channel enhanced audio signal, on which voice activity is detected.
Abstract: A system and method of audio processing provides enhanced speech recognition. Audio input is received at a plurality of microphones. The multi-channel audio signal from the microphones may be processed by a beamforming network to generate a single-channel enhanced audio signal, on which voice activity is detected. Audio signals from the microphones are additionally processed by an adaptable noise cancellation filter having variable filter coefficients to generate a noise-suppressed audio signal. The variable filter coefficients are updated during periods of voice inactivity. A speech recognition engine may apply a speech recognition algorithm to the noise-suppressed audio signal and generate an appropriate output. The operation of the speech recognition engine and the adaptable noise cancellation filter may advantageously be controlled based on voice activity detected in the single-channel enhanced audio signal from the beamforming network.

Journal ArticleDOI
TL;DR: The results indicate that the use of warped techniques is beneficial especially in wideband coding and may result in savings of one bit per sample compared to the conventional algorithm while retaining the same subjective quality.
Abstract: Frequency-warped signal processing techniques are attractive to many wideband speech and audio applications since they have a clear connection to the frequency resolution of human hearing. A warped version of linear predictive coding (LPC) is studied. The performance of conventional and warped LPC algorithms are compared in a simulated coding system using listening tests and conventional technical measures. The results indicate that the use of warped techniques is beneficial especially in wideband coding and may result in savings of one bit per sample compared to the conventional algorithm while retaining the same subjective quality.

Journal ArticleDOI
TL;DR: The proposed time-frequency method has several interesting properties, the most important of which is the ability to simultaneously resolve all FM components of a multicomponent signal, as long as the STFT of the composite signal satisfies a simple separability condition.
Abstract: We present time-frequency methods which are well suited to the analysis of nonstationary multicomponent FM signals, such as speech. These methods are based on group delay, instantaneous frequency, and higher-order phase derivative surfaces computed from the short time Fourier transform (STFT). Unlike more conventional approaches, these methods do not assume a locally stationary approximation of the signal model. We describe the computation of the phase derivatives, the physical interpretation of these derivatives, and a re-mapping algorithm based on these phase derivatives. We show analytically, and by example, the convergence of the re-mapping to the FM representation of the signal. The methods are applied to speech to estimate signal parameters, such as the group delay of a transmission channel and speech formant frequencies. Our goal is to develop a unified method which can accurately estimate speech components in both time and frequency and to apply these methods to the estimation of instantaneous formant frequencies, effective excitation time, vocal tract group delay, and channel group delay. The proposed method has several interesting properties, the most important of which is the ability to simultaneously resolve all FM components of a multicomponent signal, as long as the STFT of the composite signal satisfies a simple separability condition. The method can provide super-resolution in both time and frequency in the sense that it can simultaneously provide time and frequency estimates of FM components, which have much better accuracy than the Heisenberg uncertainty of the STFT. Super-resolution provides the capability to accurately "re-map" each component of the STFT surface to the time and frequency of the FM signal component it represents. To attain high resolution and accuracy, the signal must be jointly estimated simultaneously in time and frequency. This is accomplished by estimating two surfaces, which are essentially the derivatives of the STFT phase with respect to time and frequency. To avoid phase ambiguities, the differentiation is performed as a cross-spectral product.

Proceedings ArticleDOI
07 May 2001
TL;DR: An improvement in the state-of-the-art large vocabulary continuous speech recognition (LVCSR) performance is demonstrated by the use of visual information, in addition to the traditional audio one, by taking a decision fusion approach for the audio-visual information.
Abstract: We demonstrate an improvement in the state-of-the-art large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio- and visual- only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: the first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).

Proceedings ArticleDOI
07 May 2001
TL;DR: The study demonstrates the complementary nature of the two components, which are derived using linear prediction analysis of short segments of speech and captured implicitly by a feedforward autoassociative neural network.
Abstract: We study the effectiveness of the features extracted from the source and system components of the speech production process for the purpose of speaker recognition. The source and system components are derived using linear prediction (LP) analysis of short segments of speech. The source component is the LP residual derived from the signal, and the system component is a set of weighted linear prediction cepstral coefficients. The features are captured implicitly by a feedforward autoassociative neural network (AANN). Two separate speaker models are derived by training two AANN models using feature vectors corresponding to source and system components. A speaker recognition system for 20 speakers is built and tested using both the models to evaluate the performance of source and system features. The study demonstrates the complementary nature of the two components.

Journal ArticleDOI
TL;DR: In this paper, an autoregressive moving-average (ARMA) model of processes has been studied in the context of cepstral analysis and homomorphic filtering, and it is shown that each such model determines and is completely determined by its finite windows of coefficients and covariance lags, and that the pole-zero model can be determined from the windows as the unique minimum of a convex objective function.
Abstract: One of the most widely used methods of spectral estimation in signal and speech processing is linear predictive coding (LPC), LPC has some attractive features, which account for its popularity, including the properties that the resulting modeling filter (i) matches a finite window of n+1 covariance lags, (ii) is rational of degree at most n, and (iii) has stable zeros and poles. The only limiting factor of this methodology is that the modeling filter is "all-pole," i.e., an autoregressive (AR) model. In this paper, we present a systematic description of all autoregressive moving-average (ARMA) models of processes that have properties (i)-(iii) in the context of cepstral analysis and homomorphic filtering. We show that each such an ARMA model determines and is completely determined by its finite windows of cepstral coefficients and covariance lags. We show that these nth-order windows form local coordinates for all ARMA models of degree n and that the pole-zero model can be determined from the windows as the unique minimum of a convex objective function. We refine this optimization method by first noting that the maximum entropy design of an LPC filter is obtained by maximizing the zeroth cepstral coefficient, subject to the constraint (i). More generally, we modify this scheme to a more well-posed optimization problem where the covariance data enter as a constraint and the linear weights of the cepstral coefficients are "positive"-in a sense that a certain pseudo-polynomial is positive-rather succinctly generalizing the maximum entropy method. This new problem is a homomorphic filter generalization of the maximum entropy method. providing a procedure for the design of any stable, minimum-phase modeling filter of degree less or equal to n that interpolates the given covariance window. We present an algorithm for realizing these filters in a lattice-ladder form, given the covariance window and the moving average part of the model, While we also show how to determine the moving average part using cepstral smoothing, one can make use of any good a priori estimate for the system zeros to initialize the algorithm. We conclude the paper with an example of this method, incorporating an example from the literature on ARMA modeling.

Journal ArticleDOI
TL;DR: A method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach is proposed.
Abstract: Almost all speech contains simultaneous contributions from more than one acoustic source within the speaker's vocal tract. In this paper, we propose a method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach. The PSHF outputs periodic and aperiodic components that are estimates of the respective contributions of the different types of acoustic source. It produces four reconstructed time series signals by decomposing the original speech signal, first, according to amplitude, and then according to power of the Fourier coefficients. Thus, one pair of periodic and aperiodic signals is optimized for subsequent time-series analysis, and another pair for spectral analysis. The performance of the PSHF algorithm is tested on synthetic signals, using three forms of disturbance (jitter, shimmer and additive noise), and the results were used to predict the performance on real speech. Processing recorded speech examples elicited latent features from the signals, demonstrating the PSHF's potential for analysis of mixed-source speech.

Proceedings ArticleDOI
07 May 2001
TL;DR: Experiments demonstrating the subliminal channel capacity of the speech data embedding technique developed here, which may design an effective data signal that can be used to hide an arbitrary message in a speech signal.
Abstract: The technique of embedding a digital signal into an audio recording or image using techniques that render the signal imperceptible has received significant attention. Embedding an imperceptible, cryptographically secure signal, or watermark, is seen as a potential mechanism that may be used to prove ownership or detect tampering. While there has been a considerable amount of attention devoted to the techniques of spread-spectrum signaling for use in image and audio watermarking applications, there has only been a limited study for embedding data signals in speech. Speech is an uncharacteristically narrow band signal given the perceptual capabilities of the human hearing system. However, using speech analysis techniques, one may design an effective data signal that can be used to hide an arbitrary message in a speech signal. Also included are experiments demonstrating the subliminal channel capacity of the speech data embedding technique developed here.

Proceedings ArticleDOI
07 May 2001
TL;DR: A two-module text to speech system (TTS) structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech, and a multi-tier non-uniform unit selection method that makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance.
Abstract: This paper proposes a two-module text to speech system (TTS) structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech. Instead, many instances of each basic unit from a large speech corpus are classified into categories by a classification and regression tree (CART), in which the expectation of the weighted sum of square regression error of prosodic features is used as splitting criterion. Better prosody is achieved by keeping slender diversity in prosodic features of instances belonging to the same class. A multi-tier non-uniform unit selection method is presented. It makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance. Since the largest available and suitable units are selected for concatenating, distortion caused by mismatches at concatenated points is minimized. Very natural and fluent speech is synthesized, according to informal listening test.

Patent
01 Oct 2001
TL;DR: In this article, the authors fuse audio and visual speech recognition by fusing audio transducers and visual transducers, which is referred to as audio-visual transducer fusion.
Abstract: Recognizing and enhancing speech (22) is accomplished by fusing audio and visual speech recognition. An audio speech recognizer (70) determines a subset of speech elements (72) for speech segments (22) received from at least one audio transducer (66). A visual speech recognizer (74) determines a figure of merit (80) for at least one speech element (22) based on at least one image (64) received from at least one visual transducer (62). Speech (22) may also be enhanced by variably filtering (136) or editing (182) received audio signals (68) based on at least one visual speech parameter (134).

Proceedings ArticleDOI
07 May 2001
TL;DR: In this article, a method for speech/non-speech detection using a linear discriminant analysis (LDA) applied to mel frequency cepstrum coefficients (MFCC) is presented.
Abstract: In speech recognition, speech/non-speech detection must be robust to,noise. In the paper, a method for speech/non-speech detection using a linear discriminant analysis (LDA) applied to mel frequency cepstrum coefficients (MFCC) is presented. The energy is the most discriminant parameter between noise and speech. But with this single parameter, the speech/non-speech detection system detects too many noise segments. The LDA applied to MFCC and the associated test reduces the detection of noise segments. This new algorithm is compared to the one based on signal to noise ratio (Mauuary and Monne, 1993).

Patent
05 Feb 2001
TL;DR: In this paper, a method and apparatus for encoding speech for communication to a decoder for reproduction of the speech where the speech signal is classified into steady state voiced (harmonic), stationary unvoiced, and "transitory" or "transition" speech, and a particular type of coding scheme is used for each class.
Abstract: A method and apparatus for encoding speech for communication to a decoder for reproduction of the speech where the speech signal is classified into steady state voiced (harmonic), stationary unvoiced, and “transitory” or “transition” speech, and a particular type of coding scheme is used for each class. Harmonic coding is used for steady state voiced speech, “noise-like” coding is used for stationary unvoiced speech, and a special coding mode is used for transition speech, designed to capture the location, the structure, and the strength of the local time events that characterize the transition portions of the speech. The compression schemes can be applied to the speech signal or to the LP residual signal.

Journal ArticleDOI
TL;DR: The proposed inverse-filter control method has a specific feature that it directly estimates resonant frequencies of a vocal tract, unlike analysis-by-synthesis (A-b-S) or linear predictive coding (LPC) as a spectral matching method.
Abstract: This paper proposes a new method for estimating formant frequencies of speech signals, based on inverse-filter control and zero-crossing frequency distributions. In this method, which is called the inverse-filter control (IFC) method, we use 32 basic inverse filters that are mutually controlled by weighted means of zero-crossing frequency distributions. After quick convergence of the inverse filters, we can gain four to six formant frequencies as final mean-values of the zero-crossing frequencies. The proposed method (IFC) has a specific feature that it directly estimates resonant frequencies of a vocal tract, unlike analysis-by-synthesis (A-b-S) or linear predictive coding (LPC) as a spectral matching method. Therefore, spectral shapes influence indirectly alone the formant estimation in the IFC. Although the superiority of IFC to LPC was not necessarily prominent in the systematic evaluation using synthetic speech, the estimates showed satisfactorily small errors for the practical analysis. On the other hand, when observing some analysis examples of real speech, we found many fewer gross errors in IFC than in LPC. Last, we describe in brief a method for estimating a spectral envelope (or formant bandwidths) based on the obtained formant frequencies and the spectrum to be analyzed. According to the results, it is understandable that the existence of the wide-band formants also contributes to stable formant trajectories.

PatentDOI
TL;DR: In this article, a method and a system for controlling the coding rates of a multimode coding system with respect to a sequence of input audio signal frames is presented. But this method does not address the overflow and underflow of the bit-stream buffer.
Abstract: A method and a system are provided for controlling the coding rates of a multimode coding system with respect to a sequence of input audio signal frames. The method eliminates or minimizes the overflow and underflow of a bit-stream buffer maintained by the coding system for temporarily recording bit-stream data prior to transmission or storage.

Journal ArticleDOI
TL;DR: It is shown theoretically that 16 bits are needed to achieve an average SD of 1 dB when quantizing ten-dimensional (10-D) spectrum vectors using a first-order recursive scheme and validated in experiments, and how to approximate the SD with an L/sub 2/-norm measure.
Abstract: A theoretical analysis of recursive speech spectrum coding, where predictive and finite state schemes are special cases, is presented. We evaluate the spectral distortion (SD) theoretically and design coders that minimize the SD. The analysis rests on three cornerstones: high-rate theory, PDF modeling, and an approximation of SD. A derivation of the mean L/sub 2/-norm distortion of a recursive quantizer operating at high rate is provided. Also, the distortion distribution is supplied. The evaluation of the distortion expressions requires a model of the joint PDF of two consecutive spectrum vectors. The LPC spectrum source considered here has outcomes in a bounded region, and this is taken into account in the choice of model and modeling algorithm. It is further shown how to approximate the SD with an L/sub 2/-norm measure. Combining the results, we show theoretically that 16 bits are needed to achieve an average SD of 1 dB when quantizing ten-dimensional (10-D) spectrum vectors using a first-order recursive scheme. A gain of six bits per frame is noted compared to memoryless quantization. These results rely on high-rate assumptions which are validated in experiments. There, actual high-rate optimal coders are designed and evaluated.

Proceedings ArticleDOI
07 May 2001
TL;DR: The theory provides formulas relating minimum discrimination information for model selection and the mean squared error resulting when the MDI criterion is used in an optimized robust classified vector quantizer, which provides motivation for the use of Gauss mixture models for robust compression systems for general random vectors.
Abstract: Gauss mixtures are a popular class of models in statistics and statistical signal processing because they can provide good fits to smooth densities, because they have a rich theory, and because they can be well estimated by existing algorithms such as the EM (expectation maximization) algorithm. We here extend an information theoretic extremal property for source coding from Gaussian sources to Gauss mixtures using high rate quantization theory and extend a method originally used for LPC (linear predictive coding) speech vector quantization to provide a Lloyd clustering approach to the design of Gauss mixture models. The theory provides formulas relating minimum discrimination information (MDI) for model selection and the mean squared error resulting when the MDI criterion is used in an optimized robust classified vector quantizer. It also provides motivation for the use of Gauss mixture models for robust compression systems for general random vectors.

Journal ArticleDOI
TL;DR: This work describes a system, based on linear predictive coding, for estimating wideband speech from narrowband, which employs both previously identified and novel techniques to improve speech quality.
Abstract: This paper addresses the problem of reconstructing wideband speech signals from observed narrowband speech signals. The goal of this work is to improve the perceived quality of speech signals which have been transmitted through narrowband channels or degraded during acquisition. We describe a system, based on linear predictive coding, for estimating wideband speech from narrowband. This system employs both previously identified and novel techniques. Experimental results are provided in order to illustrate the system's ability to improve speech quality. Both objective and subjective criteria are used to evaluate the quality of the processed speech signals.

Proceedings ArticleDOI
07 May 2001
TL;DR: This paper describes the SMV algorithm developed by Conexant, which will become a service option in CDMA systems such as IS-95 and cdma2000, providing a higher quality, flexibility, and capacity over the existing speech coding service options.
Abstract: During the years 1999 and 2000, the Telecommunication Industry Association (TIA) and the 3rd Generation Partnership Project 2 (3GPP2), managed a competition and a selection process for a new speech coding standard for CDMA applications. The new speech coding standard, which is coined Selectable Mode Vocoder (SMV), will become a service option in CDMA systems such as IS-95 and cdma2000, providing a higher quality, flexibility, and capacity over the existing speech coding service options, IS-96C, IS-127, and IS-733. Eight companies submitted candidates to selection phase. For all of the 36 test conditions, the Conexant SMV candidate was ranked at the top, or was statistically equivalent to other top-ranking candidates. The Conexant SMV candidate was chosen as the core speech coding technology for the SMV system. This paper describes the SMV algorithm developed by Conexant.