scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2004"


Journal ArticleDOI
TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.

412 citations


PatentDOI
TL;DR: In this article, a plurality of synthesis speech segments are generated by synthesizing training speech segments labeled with phonetic contexts and input speech segments while altering the pitch/duration of the input text segments in accordance with the pitch and duration of the training text segments.
Abstract: In a synthesis unit generator, a plurality of synthesis speech segments are generated by synthesizing training speech segments labeled with phonetic contexts and input speech segments while altering the pitch/duration of the input speech segments in accordance with the pitch/duration of the training speech segments. Typical speech segments are selected from the input speech segments on the basis of a distance between the synthesis speech segments and the training speech segments, and are stored in a storage. In addition, a plurality of phonetic context clusters corresponding to the synthesis units are generated on the basis of the distance, and are stored in a storage. A synthesis speech signal is generated by reading out, from the storage, those of the synthesis units, which correspond to the phonetic context clusters including phonetic contexts of input phonemes, and connecting the selected synthesis units in a speech synthesizer.

203 citations


Journal ArticleDOI
TL;DR: A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis.
Abstract: Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

147 citations


Proceedings ArticleDOI
24 Oct 2004
TL;DR: A rate control scheme for H.264 is presented by introducing the concept of basic unit and a linear prediction model that is used to solve the chicken and egg dilemma existing in the rate control of H. 264.
Abstract: This paper presents a rate control scheme for H264 by introducing the concept of basic unit and a linear prediction model The basic unit can be a macroblock (MB), a slice, or a frame It can be used to obtain a trade-off between the overall coding efficiency and the bits fluctuation The linear model is used to solve the chicken and egg dilemma existing in the rate control of H264 Both constant bit rate (CBR) and variable bit rate (VBR) cases are studied Our scheme has been adopted by H264

119 citations


Patent
Hao Jiang1, Hong-Jiang Zhang1
TL;DR: In this paper, a portion of an audio signal is separated into multiple frames from which one or more different features are extracted, in combination with a set of rules, to classify the portion of the audio signal into one of multiple different classifications (for example, speech, non-speech, music, environment sound, silence).
Abstract: A portion of an audio signal is separated into multiple frames from which one or more different features are extracted. These different features are used, in combination with a set of rules, to classify the portion of the audio signal into one of multiple different classifications (for example, speech, non-speech, music, environment sound, silence, etc.). In one embodiment, these different features include one or more of line spectrum pairs (LSPs), a noise frame ratio, periodicity of particular bands, spectrum flux features, and energy distribution in one or more of the bands. The line spectrum pairs are also optionally used to segment the audio signal, identifying audio classification changes as well as speaker changes when the audio signal is speech.

102 citations


Journal ArticleDOI
TL;DR: A new information theoretic algorithm is proposed for signal enumeration in array processing based on predictive description length that is defined as the length of a predictive code for the set of observations and can detect both coherent and noncoherent signals.
Abstract: In this paper, a new information theoretic algorithm is proposed for signal enumeration in array processing. The approach is based on predictive description length (PDL) that is defined as the length of a predictive code for the set of observations. We assume that several models, with each model representing a certain number of sources, will compete. The PDL criterion is computed for the candidate models and is minimized over all models to select the best model and to determine the number of signals. In the proposed method, the correlation matrix is decomposed into two orthogonal components in the signal and noise subspaces. The maximum likelihood (ML) estimates of the angles-of-arrival are used to find the projection of the sample correlation matrix onto the signal and noise subspaces. The summation of the ML estimates of these matrices is the ML estimate of the correlation matrix. This method can detect both coherent and noncoherent signals. The proposed method can be used online and can be applied to time-varying systems and target tracking.

102 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.
Abstract: Autoregressive modeling is applied for approximating the temporal evolution of spectral density in critical-band-sized sub-bands of a segment of speech signal. The generalized autocorrelation linear predictive technique allows for a compromise between fitting the peaks and the troughs of the Hilbert envelope of the signal in the sub-band. The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.

61 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: This paper compares entropy and Euclidian distance measures for VFR in ASR experiments using the Aurora2 and T146 databases and finds better performance is observed for the entropy-based VFR over the earlier VFR approach and over the fixed-rate system.
Abstract: Most speech processing algorithms analyze speech signals frame by frame with a fixed frame rate. Fixed-rate analysis is inconsistent with human speech perception and effectively assigns the same importance or 'weight' to all equi-duration frames. In Zhu et al. (2000), we proposed a variable frame rate (VFR) analysis technique that is based on a Euclidian distance measure. In this paper, we propose another approach for VFR based on the entropy of the signal. We compare entropy and Euclidian distance measures for VFR in ASR experiments using the Aurora2 and T146 databases. Better performance is observed for the entropy-based VFR over our earlier VFR approach and over the fixed-rate system.

59 citations


Journal ArticleDOI
TL;DR: In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system and results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used.
Abstract: Studies by Shannon et al. [Science, 270, 303-304 (1995)], Van Tasell et al. [J. Acoust. Soc. Am. 82, 1152-1161 (1987)], and others show that human listeners can understand important aspects of the speech signal when spectral shape has been significantly degraded. These experiments suggest that temporal information is particularly important in human speech perception when the speech signal is heavily degraded. In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system. The parameters targeted include energy onset and offsets (computed using an adaptive algorithm) and measures of periodic and aperiodic content; together these are used to find abrupt acoustic events which signify landmarks. Overall detection rates for strongly robust events, robust events, and weak events in a portion of the TIMIT test database are 98.9%, 94.7%, and 52.1%, respectively. Error rates increase by less than 5% when the speech signals are spectrally impoverished. Use of the four temporal parameters as the front end of a hidden Markov model (HMM)-based system for the automatic recognition of the manner classes "sonorant," "fricative," "stop," and "silence" results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used, 70.1%. The combination of the temporal parameters and cepstral parameters results in an accuracy of 74.8%.

57 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: An iterative tracking algorithm is described and evaluated that embeds both the prediction-residual training and the piecewise linearization design in an adaptive Kalman filtering framework and provides meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.
Abstract: A novel approach is developed for efficient and accurate tracking of vocal tract resonances, which are natural frequencies of the resonator from larynx to lips, in fluent speech. The tracking algorithm is based on a version of the structured speech model consisting of continuous-valued hidden dynamics and a piecewise-linearized prediction function from resonance frequencies and bandwidths to LPC cepstra. We present details of the piecewise linearization design process and an adaptive training technique for the parameters that characterize the prediction residuals. An iterative tracking algorithm is described and evaluated that embeds both the prediction-residual training and the piecewise linearization design in an adaptive Kalman filtering framework. Experiments on tracking vocal tract resonances in Switchboard speech data demonstrate high accuracy in the results, as well as the effectiveness of residual training embedded in the algorithm. Our approach differs from traditional formant trackers in that it provides meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.

57 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: Non-temporal, frame level features only are used so that the proposed system is scalable from the isolated notes to the solo instrumental phrases scenario without the need for temporal segmentation of solo music.
Abstract: Speech and audio processing techniques are used along with statistical pattern recognition principles to solve the problem of music instrument recognition. Non-temporal, frame level features only are used so that the proposed system is scalable from the isolated notes to the solo instrumental phrases scenario without the need for temporal segmentation of solo music. Based on their effectiveness in speech, line spectral frequencies (LSF) are proposed as features for music instrument recognition. The proposed system has also been evaluated using MFCC and LPCC features. Gaussian mixture models and K-nearest neighbour model classifier are used for classification. The experimental dataset included the Ulowa MIS and the C Music corporation RWC databases. Our best results at the instrument family level is about 95% and at the instrument level is about 90% when classifying 14 instruments.


Book
15 Oct 2004
TL;DR: The theory and methods for quality enhancement of clean speech signals and distorted speech signals such as those that have undergone a band limitation, for instance, in a telephone network are described.
Abstract: Bandwidth Extension of Speech Signals describes the theory and methods for quality enhancement of clean speech signals and distorted speech signals such as those that have undergone a band limitation, for instance, in a telephone network. Problems and the respective solutions are discussed for the different approaches. The different approaches are evaluated and a real-time implementation of the most promising approach is presented. The book includes topics related to speech coding, pattern- / speech recognition, speech enhancement, statistics and digital signal processing in general.

Patent
23 Sep 2004
TL;DR: In this article, the pitch detection of speech signals finds numerous applications in karaoke, voice recognition and scoring applications using frequency domain methods, while most of the existing techniques rely on time domain methods.
Abstract: Pitch detection of speech signals finds numerous applications in karaoke, voice recognition and scoring applications. While most of the existing techniques rely on time domain methods, the invention utilizes frequency domain methods. There is provided a method and system for determining the pitch of speech from a speech signal. The method includes the steps of: producing or obtaining the speech signal; distinguishing the speech signal into voiced, unvoiced or silence sections using speech signal energy levels; applying a Fourier Transform to the speech signal and obtaining speech signal parameters; determining peaks of the Fourier transformed speech signal; tracking the speech signal parameters of the determined peaks to select partials; and determining the pitch from the selected partials using a two-way mismatch error calculation.

Patent
Marc Boillot1, John G. Harris1
31 Dec 2004
TL;DR: In this paper, a speech filter (108) is proposed to enhance the loudness of a speech signal by expanding the formant regions of the speech signal beyond a natural bandwidth of the formants.
Abstract: A speech filter (108) enhances the loudness of a speech signal by expanding the formant regions of the speech signal beyond a natural bandwidth of the formant regions. The energy level of the speech signal is maintained so that the filtered speech signal contains the same energy as the pre-filtered signal. By expanding the formant regions of the speech signal on a critical band scale corresponding to human hearing, the listener of the speech signal perceives it to be louder even though the signal contains the same energy.

Patent
26 Nov 2004
TL;DR: In this paper, a speech synthesis system stores a group of speech units in a memory, selects a plurality of speech unit from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech unit selected to the target text.
Abstract: A speech synthesis system stores a group of speech units in a memory, selects a plurality of speech units from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech units selected to the target speech, generates a new speech unit corresponding to the each of the segments, by fusing the speech units selected, to obtain a plurality of new speech units corresponding to the segments respectively, and generates synthetic speech by concatenating the new speech units.

Journal ArticleDOI
TL;DR: A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed and a maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise.
Abstract: A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.

Proceedings ArticleDOI
Yuriy Reznik1
17 May 2004
TL;DR: Two alternative schemes for encoding of the prediction residual adopted in the MPEG-4 ALS (audio lossless coding) standard for lossless audio coding are described and analytical and experimental analysis of their performance is provided.
Abstract: We describe two alternative schemes for encoding of the prediction residual adopted in the MPEG-4 ALS (audio lossless coding) standard for lossless audio coding. We explain choices of algorithms used in their design and provide both analytical and experimental analysis of their performance.

Proceedings ArticleDOI
27 Sep 2004
TL;DR: This paper describes a new technique, called the empirical mode decomposition (EMD), for adaptively representing nonstationary signals as sums of zero-mean AM-FM components that allows the analysis of frequency composition of one-dimensional signals.
Abstract: This paper describes a new technique, called the empirical mode decomposition (EMD) that has recently been pioneered by N. E. Huang and al., for adaptively representing nonstationary signals as sums of zero-mean AM-FM components [N. E. Huang, et al., 1998]. The components, called intrinsic mode functions (IMFs), allow the analysis of frequency composition of one-dimensional signals. Applied to speech signal, the EMD allows us to study the different intrinsic oscillatory modes. Besides, computing the LPC analysis of each mode provides an estimation of formants. The presented method is firstly applied on a sum of pure frequency signals. Among different modes we can detect all frequencies taking a part of a signal.

Journal Article
TL;DR: An automated system from the training stage to the recognition stage without the need of manual cropping for speech signals is developed to evaluate the performance of the automatic speech recognition (ASR) system.
Abstract: This paper investigates the use of feed-forward multi-layer perceptrons trained by back-propagation in speech recognition. Besides this, the paper also proposes an automatic technique for both training and recognition. The use of neural networks for speaker independent isolated word recognition on small vocabularies is studied and an automated system from the training stage to the recognition stage without the need of manual cropping for speech signals is developed to evaluate the performance of the automatic speech recognition (ASR) system. Linear predictive coding (LPC) has been applied to represent speech signal in frames in early stage. Features from the selected frames are used to train multilayer perceptrons (MLP) using back-propagation. The same routine is applied to the speech signal during the recognition stage and unknown test patterns are classified to the nearest patterns. In short, the selected frames represent the local features of the speech signal and all of them contribute to the global similarity for the whole speech signal. The analysis, design and development of the automation system are done in MATLAB, in which an isolated word speaker independent digits recogniser is developed.

Proceedings ArticleDOI
01 Jan 2004
TL;DR: Experimental results show that the watermark is imperceptible and the algorithm is robust to many attacks, such as low pass filtering, resampling, MP3 compression and so on.
Abstract: A digital audio watermarking algorithm based on discrete wavelet transform is presented. A visually significant binary image via some pre-processing and SS modulating is embedded in audio low-middle frequency coefficients in wavelet domain. A scheme of watermark detection is presented by using linear predictive coding, and it does not use the original signal during extracting watermark. The BER is improved 10%-15% in this algorithm compared with the algorithm in Wang R.D. and Chai P.Q. (2003). Experimental results show that the watermark is imperceptible and the algorithm is robust to many attacks, such as low pass filtering, resampling, MP3 compression and so on.

Journal ArticleDOI
TL;DR: A speech enhancement algorithm which leads to significant quality and intelligibility improvements when used as a preprocessor to a low bit rate speech coder and special emphasis is placed on enhancing the performance of the preprocessor in nonstationary noise environments.
Abstract: We describe a speech enhancement algorithm which leads to significant quality and intelligibility improvements when used as a preprocessor to a low bit rate speech coder. This algorithm was developed in conjunction with the mixed excitation linear prediction (MELP) coder which, by itself, is highly susceptible to environmental noise. The paper presents novel as well as known speech and noise estimation techniques and combines them into a highly effective speech enhancement system. The algorithm is based on short-time spectral amplitude estimation, soft-decision gain modification, tracking of the a priori probability of speech absence, and minimum statistics noise power estimation. Special emphasis is placed on enhancing the performance of the preprocessor in nonstationary noise environments.

Patent
16 Feb 2004
TL;DR: In this paper, the clean speech value and the noise value are estimated from the noisy speech signal and then used to define a gain on a filter, with the numerator being guaranteed to be positive.
Abstract: A method and apparatus identify a clean speech signal from a noisy speech signal. To do this, a clean speech value and a noise value are estimated from the noisy speech signal. The clean speech value and the noise value are then used to define a gain on a filter. The noisy speech signal is applied to the filter to produce the clean speech signal. Under some embodiments, the noise value and the clean speech value are used in both the numerator and the denominator of the filter gain, with the numerator being guaranteed to be positive.

Patent
04 Jun 2004
TL;DR: A speech correction apparatus includes a speaker for generating guidance speech, a microphone set at a hearing position, an acoustic-characteristic setting unit for separating ambient noise from the guidance speech at the hearing position; an operating unit; a speech correcting filter for correcting the sound pressure level of guidance speech generated by the speaker based on the average power of the speaker.
Abstract: A speech correction apparatus includes a speaker for generating guidance speech; a microphone set at a hearing position; an acoustic-characteristic setting unit for separating ambient noise from the guidance speech at the hearing position; an operating unit; a speech correcting filter for correcting the sound pressure level of the guidance speech generated by the speaker based on the average power of the guidance speech and the average power of the ambient noise which are separated; a loudness-compensating-gain calculating unit; and a speech-head correcting unit for correcting the average power of the guidance speech corresponding to the speech head at the border between a silent state and a speech state of the guidance speech.

Proceedings Article
01 Sep 2004
TL;DR: The NR is as a part of VMR-WB speech codec recently selected as a new 3GPP2 standard for wideband speech applications in cdma2000 3G wireless system.
Abstract: We present a new low complexity noise reduction (NR) method based on spectral subtraction and overlap-add analysis/synthesis. A voicing dependent cut-off frequency is introduced, dividing the speech spectrum into two parts. In lower end, the NR gain function varies with frequency bins to minimize distortion at pitch harmonic frequencies while maximizing the suppression between them. In higher end, the gain function is estimated per critical band reducing energy variations. The gain function is further smoothed over time with a smoothing factor adaptive with the actual NR gain to prevent distortion on voiced speech onsets. The NR is as a part of VMR-WB speech codec recently selected as a new 3GPP2 standard for wideband speech applications in cdma2000 3G wireless system.

Proceedings ArticleDOI
17 May 2004
TL;DR: The paper describes the basic elements of the codec, points out envisaged applications, and gives an outline of the standardization process.
Abstract: Lossless coding is to become the latest extension of the MPEG-4 audio standard. In response to a call for proposals, many companies have submitted lossless audio codecs for evaluation. The codec of the Technical University of Berlin was chosen as reference model for MPEG-4 audio lossless coding (ALS), attaining working draft status in July 2003. The encoder is based on linear prediction, which enables high compression even with moderate complexity, while the corresponding decoder is straightforward. The paper describes the basic elements of the codec, points out envisaged applications, and gives an outline of the standardization process.

Proceedings ArticleDOI
17 May 2004
TL;DR: Experimental results show clear improvements over different VAD methods in speech/pause discrimination and speech recognition performance, and the proposed VAD reduces misclassification errors in highly noisy environments by using a noise reduction stage before the long-term spectral tracking.
Abstract: The paper mainly focusses on an improved voice activity detection algorithm employing long-term signal processing and maximum spectral component tracking. The benefits of this approach have been analyzed in a previous work (Ramirez, J. et al., Proc. EUROSPEECH 2003, p.3041-4, 2003) with clear improvements in speech/non-speech discriminability and speech recognition performance in noisy environments. Two clear aspects are now considered. The first one, which improves the performance of the VAD in low noise conditions, considers an adaptive length frame window to track the long-term spectral components. The second one reduces misclassification errors in highly noisy environments by using a noise reduction stage before the long-term spectral tracking. Experimental results show clear improvements over different VAD methods in speech/pause discrimination and speech recognition performance. Particularly, improvements in recognition rate were reported when the proposed VAD replaced the VADs of the ETSI advanced front-end (AFE) for distributed speech recognition (DSR).

Journal ArticleDOI
TL;DR: The subject of this work is the robust estimation of speech presence probability of every spectral component of a speech signal impinging on a linear microphone array based on the generalized likelihood ratio test applied to the multichannel framework and far-field, wideband sources.
Abstract: The subject of this work is the robust estimation of speech presence probability of every spectral component of a speech signal impinging on a linear microphone array. The approach is based on the generalized likelihood ratio test (GLRT) applied to the multichannel framework and far-field, wideband sources. It is shown that under certain distributional assumptions the GLRT provides a framework for speech presence detection by exploiting both the spatial localization and spectral content of the speech signal. The efficiency of the approach and its superiority over a state-of-the-art one-channel speech presence estimation technique is illustrated when additive white Gaussian noise is present in the acoustical field in low signal-to-noise ratio (SNR).

Proceedings ArticleDOI
17 May 2004
TL;DR: Two extension tools for enhancing the compression performance of prediction-based lossless audio coding are proposed, one is progressive-order prediction of the starting samples at the random access points, where the information of previous samples is not available and the other is interchannel joint coding.
Abstract: Two extension tools for enhancing the compression performance of prediction-based lossless audio coding are proposed. One is progressive-order prediction of the starting samples at the random access points, where the information of previous samples is not available. The first sample is coded as is, the second is predicted by first-order prediction, the third is predicted by second-order prediction, and so on. This can be efficiently carried out with PAR-COR (PARtial autoCORrelation) coefficients. The second tool is interchannel joint coding. Both predictive coefficients and prediction error signals are efficiently coded by interchannel differential or three-tap adaptive prediction. These new prediction tools lead to a steady reduction in bit rate when random access is activated and the interchannel correlation is strong.

Proceedings ArticleDOI
21 Nov 2004
TL;DR: This paper investigates the use of neural networks in recognizing 6 Malay vowels of Malay children in a speaker-independent manner using multi-layer perceptron with one hidden layer to recognize these vowels.
Abstract: Most of the speech recognitions are based on adult speech sounds. Less research is done in the recognition of children speech sounds. The speech of children is more dynamic and inconsistent if compared to adult's speech. This paper investigates the use of neural networks in recognizing 6 Malay vowels of Malay children in a speaker-independent manner. Multi-layer perceptron with one hidden layer was used to recognize these vowels. The multi-layer perceptron was trained and tested with speech samples of Malay children with their ages between seven and ten years old. A single frame of cepstral coefficients was extracted around the vowel onset point using linear predictive coding. The vowel length was examined from 5 ms to 70 ms. Experiments were conducted to determine the optimal vowel length as well as the number of cepstral coefficients.