Showing papers on "Linear predictive coding published in 2001"

PDF

Open Access

Journal Article•DOI•

Noise power spectral density estimation based on optimal smoothing and minimum statistics

[...]

Rainer Martin¹•Institutions (1)

01 Jul 2001-IEEE Transactions on Speech and Audio Processing

TL;DR: An unbiased noise estimator is developed which derives the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal by minimizing a conditional mean square estimation error criterion in each time step.

...read moreread less

Abstract: We describe a method to estimate the power spectral density of nonstationary noise when a noisy speech signal is given. The method can be combined with any speech enhancement algorithm which requires a noise power spectral density estimate. In contrast to other methods, our approach does not use a voice activity detector. Instead it tracks spectral minima in each frequency band without any distinction between speech activity and speech pause. By minimizing a conditional mean square estimation error criterion in each time step we derive the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal. Based on the optimally smoothed power spectral density estimate and the analysis of the statistics of spectral minima an unbiased noise estimator is developed. The estimator is well suited for real time implementations. Furthermore, to improve the performance in nonstationary noise we introduce a method to speed up the tracking of the spectral minima. Finally, we evaluate the proposed method in the context of speech enhancement and low bit rate speech coding with various noise types.

...read moreread less

1,731 citations

Book•

Discrete-Time Speech Signal Processing: Principles and Practice

[...]

Thomas F. Quatieri

01 Jan 2001

TL;DR: This chapter discusses the Discrete-Time Speech Signal Processing Framework, a model based on the FBS Method, and its applications in Speech Communication Pathway and Homomorphic Signal Processing.

...read moreread less

Abstract: (NOTE: Each chapter begins with an introduction and concludes with a Summary, Exercises and Bibliography.) 1. Introduction. Discrete-Time Speech Signal Processing. The Speech Communication Pathway. Analysis/Synthesis Based on Speech Production and Perception. Applications. Outline of Book. 2. A Discrete-Time Signal Processing Framework. Discrete-Time Signals. Discrete-Time Systems. Discrete-Time Fourier Transform. Uncertainty Principle. z-Transform. LTI Systems in the Frequency Domain. Properties of LTI Systems. Time-Varying Systems. Discrete-Fourier Transform. Conversion of Continuous Signals and Systems to Discrete Time. 3. Production and Classification of Speech Sounds. Anatomy and Physiology of Speech Production. Spectrographic Analysis of Speech. Categorization of Speech Sounds. Prosody: The Melody of Speech. Speech Perception. 4. Acoustics of Speech Production. Physics of Sound. Uniform Tube Model. A Discrete-Time Model Based on Tube Concatenation. Vocal Fold/Vocal Tract Interaction. 5. Analysis and Synthesis of Pole-Zero Speech Models. Time-Dependent Processing. All-Pole Modeling of Deterministic Signals. Linear Prediction Analysis of Stochastic Speech Sounds. Criterion of "Goodness". Synthesis Based on All-Pole Modeling. Pole-Zero Estimation. Decomposition of the Glottal Flow Derivative. Appendix 5.A: Properties of Stochastic Processes. Random Processes. Ensemble Averages. Stationary Random Process. Time Averages. Power Density Spectrum. Appendix 5.B: Derivation of the Lattice Filter in Linear Prediction Analysis. 6. Homomorphic Signal Processing. Concept. Homomorphic Systems for Convolution. Complex Cepstrum of Speech-Like Sequences. Spectral Root Homomorphic Filtering. Short-Time Homomorphic Analysis of Periodic Sequences. Short-Time Speech Analysis. Analysis/Synthesis Structures. Contrasting Linear Prediction and Homomorphic Filtering. 7. Short-Time Fourier Transform Analysis and Synthesis. Short-Time Analysis. Short-Time Synthesis. Short-Time Fourier Transform Magnitude. Signal Estimation from the Modified STFT or STFTM. Time-Scale Modification and Enhancement of Speech. Appendix 7.A: FBS Method with Multiplicative Modification. 8. Filter-Bank Analysis/Synthesis. Revisiting the FBS Method. Phase Vocoder. Phase Coherence in the Phase Vocoder. Constant-Q Analysis/Synthesis. Auditory Modeling. 9. Sinusoidal Analysis/Synthesis. Sinusoidal Speech Model. Estimation of Sinewave Parameters. Synthesis. Source/Filter Phase Model. Additive Deterministic-Stochastic Model. Appendix 9.A: Derivation of the Sinewave Model. Appendix 9.B: Derivation of Optimal Cubic Phase Parameters. 10. Frequency-Domain Pitch Estimation. A Correlation-Based Pitch Estimator. Pitch Estimation Based on a "Comb Filter<170. Pitch Estimation Based on a Harmonic Sinewave Model. Glottal Pulse Onset Estimation. Multi-Band Pitch and Voicing Estimation. 11. Nonlinear Measurement and Modeling Techniques. The STFT and Wavelet Transform Revisited. Bilinear Time-Frequency Distributions. Aeroacoustic Flow in the Vocal Tract. Instantaneous Teager Energy Operator. 12. Speech Coding. Statistical Models of Speech. Scaler Quantization. Vector Quantization (VQ). Frequency-Domain Coding. Model-Based Coding. LPC Residual Coding. 13. Speech Enhancement. Introduction. Preliminaries. Wiener Filtering. Model-Based Processing. Enhancement Based on Auditory Masking. Appendix 13.A: Stochastic-Theoretic parameter Estimation. 14. Speaker Recognition. Introduction. Spectral Features for Speaker Recognition. Speaker Recognition Algorithms. Non-Spectral Features in Speaker Recognition. Signal Enhancement for the Mismatched Condition. Speaker Recognition from Coded Speech. Appendix 14.A: Expectation-Maximization (EM) Estimation. Glossary.Speech Signal Processing.Units.Databases.Index.About the Author.

...read moreread less

984 citations

Journal Article•DOI•

Robust voice activity detection using higher-order statistics in the LPC residual domain

[...]

Elias Nemer¹, Rafik Goubran², Samy A. Mahmoud²•Institutions (2)

Intel¹, Carleton University²

01 Mar 2001-IEEE Transactions on Speech and Audio Processing

TL;DR: The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames and derives a voicing condition for speech frames based on the relation between the skewness and kurtosis of voiced speech.

...read moreread less

Abstract: This paper presents a robust algorithm for voice activity detection (VAD) based on newly established properties of the higher order statistics (HOS) of speech. Analytical expressions for the third and fourth-order cumulants of the LPC residual of short-term speech are derived assuming a sinusoidal model. The flat spectral feature of this residual results in distinct characteristics for these cumulants in terms of phase, periodicity and harmonic content and yields closed-form expressions for the skewness and kurtosis. Important properties about these cumulants and their similarity with the autocorrelation function are revealed from this exploratory part. They show that the HOS of speech are sufficiently distinct from those of Gaussian noise and can be used as a basis for speech detection. Their immunity to Gaussian noise makes them particularly useful in algorithms designed for low SNR environments. The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames. The variance of the HOS estimators is quantified and used to yield a likelihood measure for noise frames. Moreover, a voicing condition for speech frames is derived based on the relation between the skewness and kurtosis of voiced speech. The performance of the algorithm is compared to the ITU-T G.729B VAD in various noise conditions, and quantified using the probability of correct and false classifications. The results show that the proposed algorithm has an overall better performance than G.729B, with noticeable improvement in Gaussian-like noises, such as street and parking garage, and moderate to low SNR.

...read moreread less

249 citations

Proceedings Article•DOI•

Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR

[...]

Masatsune Tamura¹, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi•Institutions (1)

Tokyo Institute of Technology¹

07 May 2001

TL;DR: It is demonstrated that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features, and synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.

...read moreread less

Abstract: Describes a technique for synthesizing speech with arbitrary speaker characteristics using speaker independent speech units, which we call "average voice" units. The technique is based on an HMM-based text-to-speech (TTS) system and maximum likelihood linear regression (MLLR) adaptation algorithm. In the HMM-based TTS system, speech synthesis units are modeled by multi-space probability distribution (MSD) HMMs which can model spectrum and pitch simultaneously in a unified framework. We derive an extension of the MLLR algorithm to apply it to MSD-HMMs. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.

...read moreread less

158 citations

Patent•

Method and apparatus for performing speech frame encoding mode selection in a variable rate encoding system

[...]

Andrew P. Dejaco¹•Institutions (1)

Qualcomm¹

12 Apr 2001

TL;DR: In this paper, the authors proposed a method for dynamically adjusting the average output data rate of the speech coder with minimal impact on the quality of the input speech, which can be applied to both unvoiced speech and temporally masked speech.

...read moreread less

Abstract: It is an objective of the present invention to provide an optimized method of selection of the encoding mode that provides rate efficient coding of the input speech. It is a second objective of the present invention to identify and provide a means for generating a set of parameters ideally suited for this operational mode selection. Third, it is an objective of the present invention to provide identification of two separate conditions that allow low rate coding with minimal sacrifice to quality. The two conditions are the coding of unvoiced speech and the coding of temporally masked speech. It is a fourth objective of the present invention to provide a method for dynamically adjusting the average output data rate of the speech coder with minimal impact on speech quality.

...read moreread less

157 citations

Patent•DOI•

Feature-domain concatenative speech synthesis

[...]

Dan Chazan¹, Ron Hoory¹•Institutions (1)

IBM¹

10 Jul 2001-Journal of the Acoustical Society of America

TL;DR: In this article, the spectral envelopes are integrated over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments, and an output speech signal is reconstructed by concatenating the feature vector corresponding to a sequence of speech segments.

...read moreread less

Abstract: A method for speech synthesis includes receiving an input speech signal containing a set of speech segments, and estimating spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments. The spectral envelopes are integrated over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments. An output speech signal is reconstructed by concatenating the feature vectors corresponding to a sequence of the speech segments.

...read moreread less

151 citations

Patent•DOI•

Compressing and using a concatenative speech database in text-to-speech systems

[...]

Sudheer Sirivara¹•Institutions (1)

Intel¹

30 Mar 2001-Journal of the Acoustical Society of America

TL;DR: In this article, a method and apparatus are provided for compressing and using a concatenative speech database in TTS systems to improve the quality of speech output generated by handheld TTS system by allowing synthesis to occur on the client.

...read moreread less

Abstract: A method and apparatus are provided for compressing and using a concatenative speech database in TTS systems to improve the quality of speech output generated by handheld TTS systems by allowing synthesis to occur on the client. According to one embodiment of the present invention, a G.723 encoder receives diphone waveforms, and compresses them into diphone residuals. While compressing the diphone waveforms, the encoder generates Linear Predictive Coding (LPC) coefficients. The diphone residuals, and the encoder-generated LPC coefficients are then stored in encoder-generated compressed packet.

...read moreread less

145 citations

Proceedings Article•DOI•

Speech bandwidth extension

[...]

H. Gustafsson¹, I. Claesson¹, U. Lindgren²•Institutions (2)

Blekinge Institute of Technology¹, Ericsson Mobile Communications²

20 Dec 2001

TL;DR: In this paper, a common narrow-band speech signal is expanded into a wide-band signal and the expanded signal gives the impression of a wide band speech signal regardless of what type of vocoder is used in a receiver.

...read moreread less

Abstract: A common narrow-band speech signal is expanded into a wide-band speech signal. The expanded signal gives the impression of a wide-band speech signal regardless of what type of vocoder is used in a receiver. The robust techniques suggested herein are based on speech acoustics and fundamentals of human hearing. That is the techniques extend the harmonic structure of the speech signal during voiced speech segments and introduce a linearly estimated amount of speech energy in the wide frequency-band. During unvoiced speech segments, a fricated noise may be introduced in the upper frequency-band.

...read moreread less

139 citations

Proceedings Article•DOI•

Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction

[...]

Alexander Kain¹, Michael W. Macon¹•Institutions (1)

Oregon Health & Science University¹

07 May 2001

TL;DR: Results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech, however, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

...read moreread less

Abstract: The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

...read moreread less

135 citations

Patent•

System and method for modifying speech signals

[...]

Harald Gustafsson, Ulf Lindgren, Clas Thurban, Petra Deutgen

05 Jan 2001

TL;DR: In this article, a system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wide band speech signal, and then the lower frequency range of the wide band signal is reproduced using the received narrow band signal.

...read moreread less

Abstract: A system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wideband speech signal. The lower frequency range of the wideband speech signal is reproduced using the received narrowband speech signal. The received narrowband speech signal is analyzed to determine its formants and pitch information. The upper frequency range of the wideband speech signal is synthesized using information derived from the received narrowband speech signal.

...read moreread less

123 citations

Patent•

Method and configuration for determining a descriptive feature of a speech signal

[...]

Martin Holzapfel¹•Institutions (1)

Siemens¹

10 Sep 2001

TL;DR: In this article, a first speech model is trained with a first time pattern and a second speech model with a second time pattern, and the second model is initialized with the first model.

...read moreread less

Abstract: A method and also a configuration for determining a descriptive feature of a speech signal, in which a first speech model is trained with a first time pattern and a second speech model is trained with a second time pattern. The second speech model is initialized with the first speech model.

...read moreread less

Proceedings Article•DOI•

Speech enhancement via frequency bandwidth extension using line spectral frequencies

[...]

Samir Chennoukh¹, Andreas J. Gerrits¹, G. Miet¹, R.J. Sluijter¹•Institutions (1)

Philips¹

07 May 2001

TL;DR: A new algorithm is proposed for generating synthetic frequency components in the high-band given the low-band ones for wide-band speech synthesis based on linear prediction (LPC) analysis-synthesis based on spectral envelope extension and bandwidth extension of the LPC analysis residual using a spectral folding.

...read moreread less

Abstract: This paper contributes to narrowband speech enhancement by means of frequency bandwidth extension A new algorithm is proposed for generating synthetic frequency components in the high-band (ie, 4-8 kHz) given the low-band ones (ie, 0-4 kHz) for wide-band speech synthesis It is based on linear prediction (LPC) analysis-synthesis It consists of a spectral envelope extension using efficiently line spectral frequencies (LSF) and a bandwidth extension of the LPC analysis residual using a spectral folding The low-band LSF of the synthesis signal are obtained from the input speech signal and the high-band LSF are estimated from the low-band ones using statistical models This estimation is achieved by means of four models that are distinguished by means of the first two reflection coefficients obtained from the input signal linear prediction analysis

...read moreread less

Patent•DOI•

Speech recognition using microphone antenna array

[...]

Leonid Krasny¹, Ali S. Khayrallah¹, Thomas J. Makovicka¹•Institutions (1)

Ericsson¹

02 Oct 2001-Journal of the Acoustical Society of America

TL;DR: In this article, a system and method of audio processing provides enhanced speech recognition, where the multi-channel audio signal from the microphones may be processed by a beamforming network to generate a single-channel enhanced audio signal, on which voice activity is detected.

...read moreread less

Abstract: A system and method of audio processing provides enhanced speech recognition. Audio input is received at a plurality of microphones. The multi-channel audio signal from the microphones may be processed by a beamforming network to generate a single-channel enhanced audio signal, on which voice activity is detected. Audio signals from the microphones are additionally processed by an adaptable noise cancellation filter having variable filter coefficients to generate a noise-suppressed audio signal. The variable filter coefficients are updated during periods of voice inactivity. A speech recognition engine may apply a speech recognition algorithm to the noise-suppressed audio signal and generate an appropriate output. The operation of the speech recognition engine and the adaptable noise cancellation filter may advantageously be controlled based on voice activity detected in the single-channel enhanced audio signal from the beamforming network.

...read moreread less

Journal Article•DOI•

A comparison of warped and conventional linear predictive coding

[...]

Aki Härmä¹, Unto K. Laine²•Institutions (2)

Agere Systems¹, Helsinki University of Technology²

01 Jul 2001-IEEE Transactions on Speech and Audio Processing

TL;DR: The results indicate that the use of warped techniques is beneficial especially in wideband coding and may result in savings of one bit per sample compared to the conventional algorithm while retaining the same subjective quality.

...read moreread less

Abstract: Frequency-warped signal processing techniques are attractive to many wideband speech and audio applications since they have a clear connection to the frequency resolution of human hearing. A warped version of linear predictive coding (LPC) is studied. The performance of conventional and warped LPC algorithms are compared in a simulated coding system using listening tests and conventional technical measures. The results indicate that the use of warped techniques is beneficial especially in wideband coding and may result in savings of one bit per sample compared to the conventional algorithm while retaining the same subjective quality.

...read moreread less

Journal Article•DOI•

Cross-spectral methods for processing speech.

[...]

Douglas J. Nelson

29 Oct 2001-Journal of the Acoustical Society of America

TL;DR: The proposed time-frequency method has several interesting properties, the most important of which is the ability to simultaneously resolve all FM components of a multicomponent signal, as long as the STFT of the composite signal satisfies a simple separability condition.

...read moreread less

Abstract: We present time-frequency methods which are well suited to the analysis of nonstationary multicomponent FM signals, such as speech. These methods are based on group delay, instantaneous frequency, and higher-order phase derivative surfaces computed from the short time Fourier transform (STFT). Unlike more conventional approaches, these methods do not assume a locally stationary approximation of the signal model. We describe the computation of the phase derivatives, the physical interpretation of these derivatives, and a re-mapping algorithm based on these phase derivatives. We show analytically, and by example, the convergence of the re-mapping to the FM representation of the signal. The methods are applied to speech to estimate signal parameters, such as the group delay of a transmission channel and speech formant frequencies. Our goal is to develop a unified method which can accurately estimate speech components in both time and frequency and to apply these methods to the estimation of instantaneous formant frequencies, effective excitation time, vocal tract group delay, and channel group delay. The proposed method has several interesting properties, the most important of which is the ability to simultaneously resolve all FM components of a multicomponent signal, as long as the STFT of the composite signal satisfies a simple separability condition. The method can provide super-resolution in both time and frequency in the sense that it can simultaneously provide time and frequency estimates of FM components, which have much better accuracy than the Heisenberg uncertainty of the STFT. Super-resolution provides the capability to accurately "re-map" each component of the STFT surface to the time and frequency of the FM signal component it represents. To attain high resolution and accuracy, the signal must be jointly estimated simultaneously in time and frequency. This is accomplished by estimating two surfaces, which are essentially the derivatives of the STFT phase with respect to time and frequency. To avoid phase ambiguities, the differentiation is performed as a cross-spectral product.

...read moreread less

Proceedings Article•DOI•

Weighting schemes for audio-visual fusion in speech recognition

[...]

Hervé Glotin, D. Vergyr, Chalapathy Neti, Gerasimos Potamianos, Juergen Luettin - Show less +1 more

07 May 2001

TL;DR: An improvement in the state-of-the-art large vocabulary continuous speech recognition (LVCSR) performance is demonstrated by the use of visual information, in addition to the traditional audio one, by taking a decision fusion approach for the audio-visual information.

...read moreread less

Abstract: We demonstrate an improvement in the state-of-the-art large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio- and visual- only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: the first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).

...read moreread less

Proceedings Article•DOI•

Source and system features for speaker recognition using AANN models

[...]

B. Yegnanarayana¹, K. Sharat Reddy, Sanjeev Kishore•Institutions (1)

Indian Institute of Technology Madras¹

07 May 2001

TL;DR: The study demonstrates the complementary nature of the two components, which are derived using linear prediction analysis of short segments of speech and captured implicitly by a feedforward autoassociative neural network.

...read moreread less

Abstract: We study the effectiveness of the features extracted from the source and system components of the speech production process for the purpose of speaker recognition. The source and system components are derived using linear prediction (LP) analysis of short segments of speech. The source component is the LP residual derived from the signal, and the system component is a set of weighted linear prediction cepstral coefficients. The features are captured implicitly by a feedforward autoassociative neural network (AANN). Two separate speaker models are derived by training two AANN models using feature vectors corresponding to source and system components. A speaker recognition system for 20 speakers is built and tested using both the models to evaluate the performance of source and system features. The study demonstrates the complementary nature of the two components.

...read moreread less

Journal Article•DOI•

Cepstral coefficients, covariance lags, and pole-zero models for finite data strings

[...]

C.I. Ryrnes¹, Per Enqvist², Anders Lindquist²•Institutions (2)

Washington University in St. Louis¹, Royal Institute of Technology²

01 Apr 2001-IEEE Transactions on Signal Processing

TL;DR: In this paper, an autoregressive moving-average (ARMA) model of processes has been studied in the context of cepstral analysis and homomorphic filtering, and it is shown that each such model determines and is completely determined by its finite windows of coefficients and covariance lags, and that the pole-zero model can be determined from the windows as the unique minimum of a convex objective function.

...read moreread less

Abstract: One of the most widely used methods of spectral estimation in signal and speech processing is linear predictive coding (LPC), LPC has some attractive features, which account for its popularity, including the properties that the resulting modeling filter (i) matches a finite window of n+1 covariance lags, (ii) is rational of degree at most n, and (iii) has stable zeros and poles. The only limiting factor of this methodology is that the modeling filter is "all-pole," i.e., an autoregressive (AR) model. In this paper, we present a systematic description of all autoregressive moving-average (ARMA) models of processes that have properties (i)-(iii) in the context of cepstral analysis and homomorphic filtering. We show that each such an ARMA model determines and is completely determined by its finite windows of cepstral coefficients and covariance lags. We show that these nth-order windows form local coordinates for all ARMA models of degree n and that the pole-zero model can be determined from the windows as the unique minimum of a convex objective function. We refine this optimization method by first noting that the maximum entropy design of an LPC filter is obtained by maximizing the zeroth cepstral coefficient, subject to the constraint (i). More generally, we modify this scheme to a more well-posed optimization problem where the covariance data enter as a constraint and the linear weights of the cepstral coefficients are "positive"-in a sense that a certain pseudo-polynomial is positive-rather succinctly generalizing the maximum entropy method. This new problem is a homomorphic filter generalization of the maximum entropy method. providing a procedure for the design of any stable, minimum-phase modeling filter of degree less or equal to n that interpolates the given covariance window. We present an algorithm for realizing these filters in a lattice-ladder form, given the covariance window and the moving average part of the model, While we also show how to determine the moving average part using cepstral smoothing, one can make use of any good a priori estimate for the system zeros to initialize the algorithm. We conclude the paper with an example of this method, incorporating an example from the literature on ARMA modeling.

...read moreread less

Journal Article•DOI•

Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech

[...]

Philip J. B. Jackson¹, Christine H. Shadle²•Institutions (2)

University of Birmingham¹, University of Southampton²

01 Oct 2001-IEEE Transactions on Speech and Audio Processing

TL;DR: A method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach is proposed.

...read moreread less

Abstract: Almost all speech contains simultaneous contributions from more than one acoustic source within the speaker's vocal tract. In this paper, we propose a method-the pitch-scaled harmonic filter (PSHF)-which aims to separate the voiced and turbulence-noise components of the speech signal during phonation, based on a maximum likelihood approach. The PSHF outputs periodic and aperiodic components that are estimates of the respective contributions of the different types of acoustic source. It produces four reconstructed time series signals by decomposing the original speech signal, first, according to amplitude, and then according to power of the Fourier coefficients. Thus, one pair of periodic and aperiodic signals is optimized for subsequent time-series analysis, and another pair for spectral analysis. The performance of the PSHF algorithm is tested on synthetic signals, using three forms of disturbance (jitter, shimmer and additive noise), and the results were used to predict the performance on real speech. Processing recorded speech examples elicited latent features from the signals, demonstrating the PSHF's potential for analysis of mixed-source speech.

...read moreread less

Proceedings Article•DOI•

Spread spectrum signaling for speech watermarking

[...]

Qiang Cheng¹, Jeffrey Sorensen²•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

07 May 2001

TL;DR: Experiments demonstrating the subliminal channel capacity of the speech data embedding technique developed here, which may design an effective data signal that can be used to hide an arbitrary message in a speech signal.

...read moreread less

Abstract: The technique of embedding a digital signal into an audio recording or image using techniques that render the signal imperceptible has received significant attention. Embedding an imperceptible, cryptographically secure signal, or watermark, is seen as a potential mechanism that may be used to prove ownership or detect tampering. While there has been a considerable amount of attention devoted to the techniques of spread-spectrum signaling for use in image and audio watermarking applications, there has only been a limited study for embedding data signals in speech. Speech is an uncharacteristically narrow band signal given the perceptual capabilities of the human hearing system. However, using speech analysis techniques, one may design an effective data signal that can be used to hide an arbitrary message in a speech signal. Also included are experiments demonstrating the subliminal channel capacity of the speech data embedding technique developed here.

...read moreread less

Proceedings Article•DOI•

Selecting non-uniform units from a very large corpus for concatenative speech synthesizer

[...]

Min Chu¹, Hu Peng¹, Hong-yun Yang¹, Eric Chang¹•Institutions (1)

Microsoft¹

07 May 2001

TL;DR: A two-module text to speech system (TTS) structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech, and a multi-tier non-uniform unit selection method that makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance.

...read moreread less

Abstract: This paper proposes a two-module text to speech system (TTS) structure, which bypasses the prosody model that predicts numerical prosodic parameters for synthetic speech. Instead, many instances of each basic unit from a large speech corpus are classified into categories by a classification and regression tree (CART), in which the expectation of the weighted sum of square regression error of prosodic features is used as splitting criterion. Better prosody is achieved by keeping slender diversity in prosodic features of instances belonging to the same class. A multi-tier non-uniform unit selection method is presented. It makes the best decision on unit selection by minimizing the concatenated cost of a whole utterance. Since the largest available and suitable units are selected for concatenating, distortion caused by mismatches at concatenated points is minimized. Very natural and fluent speech is synthesized, according to informal listening test.

...read moreread less

Patent•

Audio visual speech processing

[...]

Gamze Erten

01 Oct 2001

TL;DR: In this article, the authors fuse audio and visual speech recognition by fusing audio transducers and visual transducers, which is referred to as audio-visual transducer fusion.

...read moreread less

Abstract: Recognizing and enhancing speech (22) is accomplished by fusing audio and visual speech recognition. An audio speech recognizer (70) determines a subset of speech elements (72) for speech segments (22) received from at least one audio transducer (66). A visual speech recognizer (74) determines a figure of merit (80) for at least one speech element (22) based on at least one image (64) received from at least one visual transducer (62). Speech (22) may also be enhanced by variably filtering (136) or editing (182) received audio signals (68) based on at least one visual speech parameter (134).

...read moreread less

Proceedings Article•DOI•

Robust speech/non-speech detection using LDA applied to MFCC

[...]

Arnaud Martin¹, Delphine Charlet¹, L. Mauuary¹•Institutions (1)

Orange S.A.¹

07 May 2001

TL;DR: In this article, a method for speech/non-speech detection using a linear discriminant analysis (LDA) applied to mel frequency cepstrum coefficients (MFCC) is presented.

...read moreread less

Abstract: In speech recognition, speech/non-speech detection must be robust to,noise. In the paper, a method for speech/non-speech detection using a linear discriminant analysis (LDA) applied to mel frequency cepstrum coefficients (MFCC) is presented. The energy is the most discriminant parameter between noise and speech. But with this single parameter, the speech/non-speech detection system detects too many noise segments. The LDA applied to MFCC and the associated test reduces the detection of noise segments. This new algorithm is compared to the one based on signal to noise ratio (Mauuary and Monne, 1993).

...read moreread less

Patent•

Method and apparatus for hybrid coding of speech at 4KBPS having phase alignment between mode-switched frames

[...]

Allen Gersho¹, Eyal Shlomot¹, Vladimir Cuperman¹, Chunyan Li¹•Institutions (1)

University of California¹

05 Feb 2001

TL;DR: In this paper, a method and apparatus for encoding speech for communication to a decoder for reproduction of the speech where the speech signal is classified into steady state voiced (harmonic), stationary unvoiced, and "transitory" or "transition" speech, and a particular type of coding scheme is used for each class.

...read moreread less

Abstract: A method and apparatus for encoding speech for communication to a decoder for reproduction of the speech where the speech signal is classified into steady state voiced (harmonic), stationary unvoiced, and “transitory” or “transition” speech, and a particular type of coding scheme is used for each class. Harmonic coding is used for steady state voiced speech, “noise-like” coding is used for stationary unvoiced speech, and a special coding mode is used for transition speech, designed to capture the location, the structure, and the strength of the local time events that characterize the transition portions of the speech. The compression schemes can be applied to the speech signal or to the LP residual signal.

...read moreread less

Journal Article•DOI•

Formant estimation method using inverse-filter control

[...]

Akira Watanabe¹•Institutions (1)

Kumamoto University¹

01 May 2001-IEEE Transactions on Speech and Audio Processing

TL;DR: The proposed inverse-filter control method has a specific feature that it directly estimates resonant frequencies of a vocal tract, unlike analysis-by-synthesis (A-b-S) or linear predictive coding (LPC) as a spectral matching method.

...read moreread less

Abstract: This paper proposes a new method for estimating formant frequencies of speech signals, based on inverse-filter control and zero-crossing frequency distributions. In this method, which is called the inverse-filter control (IFC) method, we use 32 basic inverse filters that are mutually controlled by weighted means of zero-crossing frequency distributions. After quick convergence of the inverse filters, we can gain four to six formant frequencies as final mean-values of the zero-crossing frequencies. The proposed method (IFC) has a specific feature that it directly estimates resonant frequencies of a vocal tract, unlike analysis-by-synthesis (A-b-S) or linear predictive coding (LPC) as a spectral matching method. Therefore, spectral shapes influence indirectly alone the formant estimation in the IFC. Although the superiority of IFC to LPC was not necessarily prominent in the systematic evaluation using synthetic speech, the estimates showed satisfactorily small errors for the practical analysis. On the other hand, when observing some analysis examples of real speech, we found many fewer gross errors in IFC than in LPC. Last, we describe in brief a method for estimating a spectral envelope (or formant bandwidths) based on the obtained formant frequencies and the spectrum to be analyzed. According to the results, it is understandable that the existence of the wide-band formants also contributes to stable formant trajectories.

...read moreread less

Patent•DOI•

Rate control strategies for speech and music coding

[...]

Tian Wang¹, Kazuhito Koishida¹, Vladimir Cuperman¹•Institutions (1)

Microsoft¹

28 Dec 2001-Journal of the Acoustical Society of America

TL;DR: In this article, a method and a system for controlling the coding rates of a multimode coding system with respect to a sequence of input audio signal frames is presented. But this method does not address the overflow and underflow of the bit-stream buffer.

...read moreread less

Abstract: A method and a system are provided for controlling the coding rates of a multimode coding system with respect to a sequence of input audio signal frames. The method eliminates or minimizes the overflow and underflow of a bit-stream buffer maintained by the coding system for temporarily recording bit-stream data prior to transmission or storage.

...read moreread less

Journal Article•DOI•

Recursive coding of spectrum parameters

[...]

J. Samuelsson¹, Per Hedelin•Institutions (1)

Chalmers University of Technology¹

01 Jul 2001-IEEE Transactions on Speech and Audio Processing

TL;DR: It is shown theoretically that 16 bits are needed to achieve an average SD of 1 dB when quantizing ten-dimensional (10-D) spectrum vectors using a first-order recursive scheme and validated in experiments, and how to approximate the SD with an L/sub 2/-norm measure.

...read moreread less

Abstract: A theoretical analysis of recursive speech spectrum coding, where predictive and finite state schemes are special cases, is presented. We evaluate the spectral distortion (SD) theoretically and design coders that minimize the SD. The analysis rests on three cornerstones: high-rate theory, PDF modeling, and an approximation of SD. A derivation of the mean L/sub 2/-norm distortion of a recursive quantizer operating at high rate is provided. Also, the distortion distribution is supplied. The evaluation of the distortion expressions requires a model of the joint PDF of two consecutive spectrum vectors. The LPC spectrum source considered here has outcomes in a bounded region, and this is taken into account in the choice of model and modeling algorithm. It is further shown how to approximate the SD with an L/sub 2/-norm measure. Combining the results, we show theoretically that 16 bits are needed to achieve an average SD of 1 dB when quantizing ten-dimensional (10-D) spectrum vectors using a first-order recursive scheme. A gain of six bits per frame is noted compared to memoryless quantization. These results rely on high-rate assumptions which are validated in experiments. There, actual high-rate optimal coders are designed and evaluated.

...read moreread less

Proceedings Article•DOI•

Gauss mixture vector quantization

[...]

Robert M. Gray¹•Institutions (1)

Stanford University¹

07 May 2001

TL;DR: The theory provides formulas relating minimum discrimination information for model selection and the mean squared error resulting when the MDI criterion is used in an optimized robust classified vector quantizer, which provides motivation for the use of Gauss mixture models for robust compression systems for general random vectors.

...read moreread less

Abstract: Gauss mixtures are a popular class of models in statistics and statistical signal processing because they can provide good fits to smooth densities, because they have a rich theory, and because they can be well estimated by existing algorithms such as the EM (expectation maximization) algorithm. We here extend an information theoretic extremal property for source coding from Gaussian sources to Gauss mixtures using high rate quantization theory and extend a method originally used for LPC (linear predictive coding) speech vector quantization to provide a Lloyd clustering approach to the design of Gauss mixture models. The theory provides formulas relating minimum discrimination information (MDI) for model selection and the mean squared error resulting when the MDI criterion is used in an optimized robust classified vector quantizer. It also provides motivation for the use of Gauss mixture models for robust compression systems for general random vectors.

...read moreread less

Journal Article•DOI•

Techniques for the regeneration of wideband speech from narrowband speech

[...]

Jason A. Fuemmeler¹, Russell C. Hardie¹, Gardner William R²•Institutions (2)

University of Dayton¹, Qualcomm²

01 Jan 2001-EURASIP Journal on Advances in Signal Processing

TL;DR: This work describes a system, based on linear predictive coding, for estimating wideband speech from narrowband, which employs both previously identified and novel techniques to improve speech quality.

...read moreread less

Abstract: This paper addresses the problem of reconstructing wideband speech signals from observed narrowband speech signals. The goal of this work is to improve the perceived quality of speech signals which have been transmitted through narrowband channels or degraded during acquisition. We describe a system, based on linear predictive coding, for estimating wideband speech from narrowband. This system employs both previously identified and novel techniques. Experimental results are provided in order to illustrate the system's ability to improve speech quality. Both objective and subjective criteria are used to evaluate the quality of the processed speech signals.

...read moreread less

Proceedings Article•DOI•

The SMV algorithm selected by TIA and 3GPP2 for CDMA applications

[...]

Yang Gao¹, Eyal Shlomot, Adil Benyassine, Jes Thyssen, Huan-Yu Su, C. Murgia - Show less +2 more•Institutions (1)

University of California, San Diego¹

07 May 2001

TL;DR: This paper describes the SMV algorithm developed by Conexant, which will become a service option in CDMA systems such as IS-95 and cdma2000, providing a higher quality, flexibility, and capacity over the existing speech coding service options.

...read moreread less

Abstract: During the years 1999 and 2000, the Telecommunication Industry Association (TIA) and the 3rd Generation Partnership Project 2 (3GPP2), managed a competition and a selection process for a new speech coding standard for CDMA applications. The new speech coding standard, which is coined Selectable Mode Vocoder (SMV), will become a service option in CDMA systems such as IS-95 and cdma2000, providing a higher quality, flexibility, and capacity over the existing speech coding service options, IS-96C, IS-127, and IS-733. Eight companies submitted candidates to selection phase. For all of the 36 test conditions, the Conexant SMV candidate was ranked at the top, or was statistically equivalent to other top-ranking candidates. The Conexant SMV candidate was chosen as the core speech coding technology for the SMV system. This paper describes the SMV algorithm developed by Conexant.

...read moreread less

Collapse