scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 1999"


PatentDOI
TL;DR: A high quality speech synthesizer in various embodiments concatenates speech waveforms referenced by a large speech database as mentioned in this paper, which is further improved by speech unit selection and concatenation smoothing.
Abstract: A high quality speech synthesizer in various embodiments concatenates speech waveforms referenced by a large speech database. Speech quality is further improved by speech unit selection and concatenation smoothing.

318 citations


Journal ArticleDOI
TL;DR: This contribution presents a detailed analysis of a widely used set of parameters, the mel frequency cepstral coefficients (MFCCs), and suggests a new parameterization approach taking into account the whole energy zone in the spectrum.
Abstract: The focus of a continuous speech recognition process is to match an input signal with a set of words or sentences according to some optimality criteria. The first step of this process is parameterization, whose major task is data reduction by converting the input signal into parameters while preserving virtually all of the speech signal information dealing with the text message. This contribution presents a detailed analysis of a widely used set of parameters, the mel frequency cepstral coefficients (MFCCs), and suggests a new parameterization approach taking into account the whole energy zone in the spectrum. Results obtained with the proposed new coefficients give a confidence interval about their use in a large-vocabulary speaker-independent continuous-speech recognition system.

194 citations


Journal ArticleDOI
TL;DR: This work follows a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and finds that by transmitting the cepstral coefficients the authors can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly.
Abstract: We examine alternative architectures for a client-server model of speech-enabled applications over the World Wide Web (WWW). We compare a server-only processing model where the client encodes and transmits the speech signal to the server, to a model where the recognition front end runs locally at the client and encodes and transmits the cepstral coefficients to the recognition server over the Internet. We follow a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and we find that by transmitting the cepstral coefficients we can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly. We find that the required bit rate to achieve the recognition performance of high-quality unquantized speech is just 2000 bits per second.

118 citations


Proceedings ArticleDOI
20 Jun 1999
TL;DR: A new technique for highband spectral envelope prediction, based upon codebook mapping with codebooks split by voicing with a suitable highband excitation synthesis scheme is proposed, which produces a significant quality improvement in speech that has been coded using narrowband standards.
Abstract: Telephone speech is typically bandlimited to 4 kHz, resulting in a 'muffled' quality. Coding speech with a bandwidth greater than 4 kHz reduces this distortion, but requires a higher bit rate to avoid other types of distortion. An alternative to coding wider bandwidth speech is to exploit correlations between the 0-4 kHz and 4-8 kHz speech bands to re-synthesize wideband speech from decoded narrowband speech. This paper proposes a new technique for highband spectral envelope prediction, based upon codebook mapping with codebooks split by voicing. An objective comparison with several existing methods reveals that this new technique produces the smallest highband spectral distortion. Combined with a suitable highband excitation synthesis scheme, this envelope prediction scheme produces a significant quality improvement in speech that has been coded using narrowband standards.

116 citations


Patent
24 Aug 1999
TL;DR: In this paper, a method of encoding an input speech signal using a multi-rate encoder having a plurality of encoding rates is disclosed, where a high-pass filter and then a perceptual weighting filter are applied to such signal to generate a first target signal.
Abstract: A method of encoding an input speech signal using a multi-rate encoder having a plurality of encoding rates is disclosed. A high-pass filter and then a perceptual weighting filter are applied to such signal to generate a first target signal. An adaptive codebook vector is identified from an adaptive codebook using the first target signal by filtering the vector to generate a filtered adaptive codebook vector. An adaptive codebook gain for the adaptive codebook vector is calculated and an error signal minimized. The adaptive codebook gain is adaptively reduced based on one encoding rate from the plurality of encoding rates to generate a reduced adaptive codebook gain. A second target signal based at least on the first target signal and the reduced adaptive codebook gain is generated. The input speech signal is converted into an encoded speech based on the second target signal.

111 citations


Patent
10 Aug 1999
TL;DR: In this article, a speech or voice activity detector (VAD) is provided for detecting whether speech signals are present in individual time frames of an input signal, and a state machine is coupled to the VAD and having a plurality of states.
Abstract: A system and method for removing noise from a signal containing speech (or a related, information carrying signal) and noise. A speech or voice activity detector (VAD) is provided for detecting whether speech signals are present in individual time frames of an input signal. The VAD comprises a speech detector that receives as input the input signal and examines the input signal in order to generate a plurality of statistics that represent characteristics indicative of the presence or absence of speech in a time frame of the input signal, and generates an output based on the plurality of statistics representing a likelihood of speech presence in a current time frame; and a state machine coupled to the speech detector and having a plurality of states. The state machine receives as input the output of the speech detector and transitions between the plurality of states based on a state at a previous time frame and the output of the speech detector for the current time frame. The state machine generates as output a speech activity status signal based on the state of the state machine, which provides a measure of the likelihood of speech being present during the current time frame. The VAD may be used in a noise reduction system.

104 citations


Proceedings ArticleDOI
20 Jun 1999
TL;DR: The adaptive multi-rate (AMR) speech coder currently under standardization for GSM systems as part of the AMR speech service is described, which provides seamless switching on 20 ms frame boundaries and the quality when used on GSM channels is significantly higher than for existing services.
Abstract: In this paper, we describe the adaptive multi-rate (AMR) speech coder currently under standardization for GSM systems as part of the AMR speech service. The coder is a multi-rate ACELP coder with 8 modes operating at bit-rates from 12.2 kbit/s down to 4.75 kbit/s. The coder modes are integrated in a common structure where the bit-rate scalability is realized mainly by altering the quantization schemes for the different parameters. The coder provides seamless switching on 20 ms frame boundaries. The quality when used on GSM channels is significantly higher than for existing services.

85 citations


Patent
Allen Gersho1, Vladimir Cuperman1, Ajit V. Rao1, Tung-Chiang Yang1, Sassan Ahmadi1, Fenghua Liu1 
23 Dec 1999
TL;DR: In this paper, the authors proposed a method for speech coding wherein the speech signal is represented by an excitation signal applied to a synthesis filter, and the speech is partitioned into frames and subframes.
Abstract: A speech coder (12) and a method for speech coding wherein the speech signal is represented by an excitation signal applied to a synthesis filter. The speech is partitioned into frames and subframes. A classifier (22) identifies which of several categories the speech frame belongs to, and a different coding method is applied to represent the excitation for each category. For some categories, one or more windows are identified for the frame where all or most of the excitation signal samples are assigned by a coding scheme. Performance is enhanced by coding the important segments of the excitation more accurately. The window locations are determined from a linear prediction residual by identifying peaks of the smoothed residual energy contour. The method adjusts the frame and subframe boundaries so that each window is located entirely within a modified subframe or frame. This eliminates the artificial restriction incurred when coding a frame or subframe in isolation, without regard for the local behavior of the speech signal across frame or subframe boundaries.

83 citations


Proceedings ArticleDOI
20 Jun 1999
TL;DR: A hybrid ACELP/TCX algorithm for coding speech and music signals at 16, 24, and 32 kbit/s is presented, which switches between algebraic code excited linear prediction (ACELP) and transform coded excitation (TCX) modes on a 20-ms frame basis.
Abstract: A hybrid ACELP/TCX algorithm for coding speech and music signals at 16, 24, and 32 kbit/s is presented. The algorithm switches between algebraic code excited linear prediction (ACELP) and transform coded excitation (TCX) modes on a 20-ms frame basis. Applying TCX on 20 ms frames improved the quality for music signals. Special care was taken to alleviate the switching artifacts between the two modes resulting in a transparent switching process. Subjective test results showed that for speech signals, the performance at 16, 24, and 32 kbit/s, is equivalent to G.722 at 48, 56, and 64 kbit/s, respectively. For music signals, the quality at 24 kbit/s was found equivalent to G.722 at 56 kbit/s. However, at 16 kbit/s, the quality for music was slightly lower than G.722 at 48 kbit/s.

76 citations


Journal ArticleDOI
TL;DR: The STI can be computed using speech probe waveforms and the values of the resulting indices are as good predictors of intelligibility scores as those derived from MTFs by theoretical methods.
Abstract: A method for computing the speech transmission index (STI) using real speech stimuli is presented and evaluated. The method reduces the effects of some of the artifacts that can be encountered when speech waveforms are used as probe stimuli. Speech-based STIs are computed for conversational and clearly articulated speech in several noisy, reverberant, and noisy-reverberant environments and compared with speech intelligibility scores. The results indicate that, for each speaking style, the speech-based STI values are monotonically related to intelligibility scores for the degraded speech conditions tested. Therefore, the STI can be computed using speech probe waveforms and the values of the resulting indices are as good predictors of intelligibility scores as those derived from MTFs by theoretical methods.

74 citations


Patent
21 Dec 1999
TL;DR: In this paper, variable rate coding of a speech signal is proposed to achieve low average bit rates by only employing high fidelity modes (i.e., high bit rate, broadly applicable to different types of speech) during portions of the speech where this fidelity is required for acceptable output.
Abstract: A method and apparatus for the variable rate coding of a speech signal. An input speech signal is classified and an appropriate coding mode is selected based on this classification. For each classification, the coding mode that achieves the lowest bit rate with an acceptable quality of speech reproduction is selected. Low average bit rates are achieved by only employing high fidelity modes (i.e., high bit rate, broadly applicable to different types of speech) during portions of the speech where this fidelity is required for acceptable output. Lower bit rate modes are used during portions of speech where these modes produce acceptable output. Input speech signal is classified into active and inactive regions. Active regions are further classified into voiced, unvoiced, and transient regions. Various coding modes are applied to active speech, depending upon the required level of fidelity. Coding modes may be utilized according to the strengths and weaknesses of each particular mode. The apparatus dynamically switches between these modes as the properties of the speech signal vary with time. And where appropriate, regions of speech are modeled as pseudo-random noise, resulting in a significantly lower bit rate. This coding is used in a dynamic fashion whenever unvoiced speech or background noise is detected.

Proceedings ArticleDOI
A.J. Accardi1, R.V. Cox
15 Mar 1999
TL;DR: In this paper, a modified version of Ephraim and van trees's (see IEEE Trans. Speech and Audio Proc., vol.3, p.251-66, 1995) spectral domain constrained signal subspace estimator is used in this manner, obtaining a system with greater flexibility and similar performance.
Abstract: Ephraim and Malah's (1984, 1985) MMSE-LSA speech enhancement algorithm, while robust and effective, is difficult to tune and adjust for the tradeoff between noise reduction and distortion. We suggest a means of generalizing this design, which allows for other estimators besides the MMSE-LSA to be used within the same supporting framework. When a modified version of Ephraim and Van Trees's (see IEEE Trans. Speech and Audio Proc., vol.3, p.251-66, 1995) spectral domain constrained signal subspace estimator is used in this manner, we obtain a system with greater flexibility and similar performance. We also explore the possibility of using different speech enhancement techniques as pre-processors for different parameter extraction modules of the IS-641 speech coder (a 7.4 kbit/s ACELP codec). We show that such a strategy can increase the quality of the coded speech and lead to a system that is more robust to differing noise types.

Journal ArticleDOI
TL;DR: The overall results suggest that speech intelligibility is not severely impaired as long as the filtered spectral components have a rate of change between 1 and 16 Hz.
Abstract: The intelligibility of syllables whose cepstral trajectories were temporally filtered was measured. The speech signals were transformed to their LPC cepstral coefficients, and these coefficients were passed through different filters. These filtered trajectories were recombined with the residuals and the speech signal reconstructed. The intelligibility of the reconstructed speech segments was then measured in two perceptual experiments for Japanese syllables. The effect of various low-pass, high-pass, and bandpass filtering is reported, and the results summarized using a theoretical approach based on the independence of the contributions in different modulation bands. The overall results suggest that speech intelligibility is not severely impaired as long as the filtered spectral components have a rate of change between 1 and 16 Hz.

Patent
Amitava Das1
26 Feb 1999
TL;DR: In this paper, a closed-loop, multimode, mixed-domain linear prediction (MDLP) speech coder includes a high-rate, time-domain coding mode, a low rate, frequency-domain encoding mode, and a closedloop mode-selection mechanism for selecting a coding mode for the coder based upon the speech content of frames input to the Coder.
Abstract: A closed-loop, multimode, mixed-domain linear prediction (MDLP) speech coder includes a high-rate, time-domain coding mode, a low-rate, frequency-domain coding mode, and a closed-loop mode-selection mechanism for selecting a coding mode for the coder based upon the speech content of frames input to the coder. Transition speech (i.e., from unvoiced speech to voiced speech, or vice versa) frames are encoded with the high-rate, time-domain coding mode, which may be a CELP coding mode. Voiced speech frames are encoded with the low-rate, frequency-domain coding mode, which may be a harmonic coding mode. Phase parameters are not encoded by the frequency-domain coding mode, and are instead modeled in accordance with, e.g., a quadratic phase model. For each speech frame encoded with the frequency-domain coding mode, the initial phase value is taken to be the initial phase value of the immediately preceding speech frame encoded with the frequency-domain coding mode. If the immediately preceding speech frame was encoded with the time-domain coding mode, the initial phase value of the current speech frame is computed from the decoded speech frame information of the immediately preceding, time-domain-encoded speech frame. Each speech frame encoded with the frequency-domain coding mode may be compared with the corresponding input speech frame to obtain a performance measure. If the performance measure falls below a predefined threshold value, the input speech frame is encoded with the time-domain coding mode.

Journal ArticleDOI
TL;DR: By combining an interframe quantizer and a memoryless "safety-net" quantizer, it is demonstrated that the advantages of both quantization strategies can be utilized, and the performance for both noiseless and noisy channels improves.
Abstract: In linear predictive speech coding algorithms, transmission of linear predictive coding (LPC) parameters-often transformed to the line spectrum frequencies (LSF) representation-consumes a large part of the total bit rate of the coder. Typically, the LSF parameters are highly correlated from one frame to the next, and a considerable reduction in bit rate can be achieved by exploiting this interframe correlation. However, interframe coding leads to error propagation if the channel is noisy, which possibly cancels the achievable gain. In this paper, several algorithms for exploiting interframe correlation of LSF parameters are compared. Especially, performance for transmission over noisy channels is examined, and methods to improve noisy channel performance are proposed. By combining an interframe quantizer and a memoryless "safety-net" quantizer, we demonstrate that the advantages of both quantization strategies can be utilized, and the performance for both noiseless and noisy channels improves. The results indicate that the best interframe method performs as good as a memoryless quantizing scheme, with 4 bits less per frame. Subjective listening tests have been employed that verify the results from the objective measurements.

Proceedings ArticleDOI
20 Jun 1999
TL;DR: Novel solutions for pre-processing noisy speech prior to low bit rate speech coding using a new adaptive limiting algorithm for the a priori signal-to-noise ratio (SNR) estimate and a novel overlap/add scheme are presented.
Abstract: In this paper we present novel solutions for pre-processing noisy speech prior to low bit rate speech coding. We strive especially to improve the estimation of spectral parameters and to reduce the additional algorithmic delay caused by the enhancement pre-processor. While the former is achieved using a new adaptive limiting algorithm for the a priori signal-to-noise ratio (SNR) estimate, the latter makes use of a novel overlap/add scheme. Our enhancement techniques were evaluated in conjunction with the 2400 bps mixed excitation linear prediction (MELP) coder by means of formal and informal listening tests.


Patent
TL;DR: The combination of audio and video speech recognition in a manner to improve the robustness of speech recognition systems in noisy environments is discussed in this article. But, the work is limited to the case where the audio signal is associated with a video source and the video signal associated with an audio source, and the most likely viseme associated with audio signal and video signal is determined.
Abstract: The combination of audio and video speech recognition in a manner to improve the robustness of speech recognition systems in noisy environments. Contemplated are methods and apparatus in which a video signal associated with a video source and an audio signal associated with the video signal are processed, the most likely viseme associated with the audio signal and video signal is determined and, thereafter, the most likely phoneme associated with the audio signal and video signal is determined.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: The results indicate that the new paradigm in general and the auditory model in particular form a promising basis for the coding of both speech and audio at low bit rates.
Abstract: For speech coders which fall within the class of waveform coders, the reconstructed signal approaches the original with increasing bit rate. In such coders, the distortion criterion generally operates on the speech signal or a signal obtained by adaptive linear filtering of the speech signal. To satisfy computational and delay constraints, the distortion criterion must be reduced to a very simple approximation of the auditory system. This drawback of conventional approaches motivates a new speech coding paradigm in which the coding is performed in a domain where the single-letter squared-error criterion forms an accurate representation of perception. The new paradigm requires a model of the auditory periphery which is accurate, can be be inverted with relatively low computational effort, and which represents the signal with relatively few parameters. We develop such a model of the auditory periphery and discuss its suitability for speech coding. The results indicate that the new paradigm in general and our auditory model in particular form a promising basis for the coding of both speech and audio at low bit rates.

Proceedings Article
01 Jan 1999
TL;DR: A formalism for data imputation based on the probability distributions of individual Hidden Markov model states is presented and potential advantages are that it can be followed by conventional techniques like cepstral features or artificial neural networks for speech recognition.
Abstract: Within the context of continuous-density HMM speech recognition in noise, we report on imputation of missing time-frequency regions using emission state probability distributions. Spectral subtraction and local signal–to– noise estimation based criteria are used to separate the present from the missing components. We consider two approaches to the problem of classification with missing data: marginalization and data imputation. A formalism for data imputation based on the probability distributions of individual Hidden Markov model states is presented. We report on recognition experiments comparing state based data imputation to marginalization in the context of connected digit recognition of speech mixed with factory noise at various global signal-to-noise ratios, and wideband restoration of speech. Potential advantages of the approach are that it can be followed by conventional techniques like cepstral features or artificial neural networks for speech recognition.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: The perceptual phase capacity in low pitched speech is found to be much higher than it is for high pitched speech, which is consistent with the well known fact that speech coding schemes which preserve the phase accurately work better for male voices, while coders which put more weight on the amplitude spectrum of the speech signal result in better quality for female speech.
Abstract: In this paper we define perceptual phase capacity as the size of a codebook of phase spectra necessary to represent all possible phase spectra in a perceptually accurate manner. We determine the perceptual phase capacity for voiced speech. To this purpose, we use an auditory model which indicates if phase spectrum changes are audible or not. The correct performance of the model was adjusted and verified by listening tests. The perceptual phase capacity in low pitched speech is found to be much higher than it is for high pitched speech. Our results are consistent with the well known fact that speech coding schemes which preserve the phase accurately work better for male voices, while coders which put more weight on the amplitude spectrum of the speech signal result in better quality for female speech.

Patent
13 Jan 1999
TL;DR: In this article, an adaptive endpointer system and method are used in speech recognition applications, such as telephone-based Internet browsers, to determine barge-in events during the processing of speech.
Abstract: An adaptive endpointer system and method are used in speech recognition applications, such as telephone-based Internet browsers, to determine barge-in events during the processing of speech The endpointer system includes a signal energy level estimator for estimating signal levels in speech data; a noise energy level estimator for estimating noise levels in the speech data; and a barge-in detector for increasing a threshold used in comparing the signal levels and the noise levels to detect the barge-in event in the speech data corresponding to a speech prompt during speech recognition

Patent
Adil Benyassine1, Eyal Shlomot1
TL;DR: In this article, a method and apparatus for generating frame voicing decisions for an incoming speech signal having periods of active voice and non-active voice for a speech encoder in a speech communications system is presented.
Abstract: A method and apparatus for generating frame voicing decisions for an incoming speech signal having periods of active voice and non-active voice for a speech encoder in a speech communications system. A predetermined set of parameters is extracted from the incoming speech signal, including a pitch gain and a pitch lag. A frame voicing decision is made for each frame of the incoming speech signal according to values calculated from the extracted parameters. The predetermined set of parameters further includes a frame full band energy, and a set of spectral parameters called Line Spectral Frequencies (LSF).

Patent
14 Oct 1999
TL;DR: In this article, the frequency characteristics of high-frequency components of broad-band signals can be adjusted to the liking of the user, overflow due to addition is prevented from occurring without power variations being perceived by a user, the number of broad band formants is reduced, and emphasis is attached to the rough structure of the spectrum.
Abstract: A bandwidth expanding method and apparatus in which frequency characteristics of high-frequency components of broad band signals can be adjusted to the liking of the user, overflow due to addition is prevented from occurring without power variations being perceived by a user, the number of broad band formants is reduced, and emphasis is attached to the rough structure of the spectrum, so that the produced broad band speech signals can be improved in quality. To this end, in a speech bandwidth expansion device, frequency characteristics of the frequency components not less than 3400 Hz are adjusted by preset alterable parameter values and summed to the original narrow band speech components. If overflow has occurred in a sample, the high-range gain of the sample is lowered to a level below the overflow level before proceeding to addition. Also, broad band autocorrelation γw is generated and inverse-transformed in an inverse parameter conversion unit to produce broad band linear prediction coefficient αW to synthesize the broad-band speech in a linear predictive coding synthesis unit.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: A combined adaptive transform codec (ATC) and code-excited linear prediction (CELP) algorithm for the compression of wideband (7 kHz) signals is described and a switching scheme between CELP and ATC mode is proposed and a frame erasure concealment technique is proposed.
Abstract: This paper describes a combined adaptive transform codec (ATC) and code-excited linear prediction (CELP) algorithm, called ATCELP, for the compression of wideband (7 kHz) signals. The CELP algorithm applies mainly to speech, whereas the ATC mode is selected for music and noise signals. We propose a switching scheme between CELP and ATC mode and describe a frame erasure concealment technique. Subjective listening tests have shown that the ATCELP codec at bit rates of 16, 24 and 32 kbit/s achieved performances close to those of the CCITT G.722 at 48, 56 and 64 kbit/s, respectively, at most operating conditions.

Journal ArticleDOI
TL;DR: A novel method is proposed to continuously estimate the SNR across the frequency bands without the need for a speech detector, based on a sinusoidal model for speech and a Gaussian assumption about the noise.
Abstract: This article addresses the problem of instantaneous signal-to-noise ratio (SNR) estimation during speech activity for the purpose of improving the performance of speech enhancement algorithms. It is shown that the kurtosis of noisy speech may be used to individually estimate speech and noise energies when speech is divided into narrow bands. Based on this concept, a novel method is proposed to continuously estimate the SNR across the frequency bands without the need for a speech detector. The derivations are based on a sinusoidal model for speech and a Gaussian assumption about the noise. Experimental results using recorded speech and noise show that the model and the derivations are valid, though not entirely accurate across the whole spectrum; it is also found that many noise types encountered in mobile telephony are not far from Gaussianity as far as higher statistics are concerned, making this scheme quite effective.

Proceedings ArticleDOI
S.A. Ramprashad1
20 Jun 1999
TL;DR: This multimode transform predictive coder (MTPC) shows improved performance on both speech and audio inputs when compared to a single-mode transform predictivecoder (TPC).
Abstract: Speech and audio coding are often considered to be two separate technologies, each almost independently developing different techniques for signal compression. At low bit rates the gap in performance between the two technologies begins to be noticeable; speech coders work better on speech and audio coders perform better on music. The challenge is to merge the two technologies into a single coding paradigm which will work as well as either two regardless of the input signal. Presented is a multimode speech and audio coder which can adapt almost continuously between a speech and audio coding mode. This multimode transform predictive coder (MTPC) shows improved performance on both speech and audio inputs when compared to a single-mode transform predictive coder (TPC).

Patent
22 Jan 1999
TL;DR: A communication device capable of screening speech recognizer input includes a microprocessor (110) connected to communication interface circuitry (115), memory (120), audio circuitry (130), an optional keypad (140), a display (150), and a vibrator/buzzer (160) as mentioned in this paper.
Abstract: A communication device capable of screening speech recognizer input includes a microprocessor (110) connected to communication interface circuitry (115), memory (120), audio circuitry (130), an optional keypad (140), a display (150), and a vibrator/buzzer (160). Audio circuitry (130) is connected to microphone (133) and speaker (135). Microprocessor (110) includes a speech/noise classifier and speech recognition technology. Microprocessor (110) analyzes a speech signal to determine speech waveform parameters within a speech acquisition window. Microprocessor (110) compares the speech waveform parameters to determine whether an error exists in the signal format of the speech signal. Microprocessor (110) informs the user when an error exists in the signal format and instructs the user how to correct the signal format to eliminate the error.

Patent
Shihua Wang1
19 Oct 1999
TL;DR: In this paper, a speech encoding method using analysis-by-synthesis includes sampling an input speech and dividing the resulting speech samples into frames and subframes, the frames are analyzed to determine coefficients for the synthesis filter.
Abstract: A speech encoding method using analysis-by-synthesis includes sampling an input speech and dividing the resulting speech samples into frames and subframes. The frames are analyzed to determine coefficients for the synthesis filter. The subframes are categorized into unvoiced, voiced and onset categories. Based on the category, a different coding scheme is used. The coded speech is fed into the synthesis filter, the output of which is compared to the input speech samples to produce an error signal. The coding is then adjusted per the error signal.

Patent
20 May 1999
TL;DR: In this paper, a speech transmission system with background noise dependent processing elements in the speech encoder (12, 36) and/or the speech decoder (30, 48) is proposed.
Abstract: In a speech transmission system, an input speech signal is applied to a speech encoder (12, 36) for encoding the input speech signal. The encoded speech signal is transmitted via a communication channel (10) to a speech decoder (30, 48). In order to improve the performance of the transmission system in the presence of background noise, it is proposed to introduce background noise dependent processing elements in the speech encoder (12, 36) and/or in the speech decoder (30, 48). In a first embodiment of the invention, the parameters of the perceptual weighting filter (124) in the speech encoder (12, 36) are derived by calculating linear prediction coefficients (â) from a speech signal which is processed by means of a high-pass filter (82). In a second embodiment of the invention, an adaptive post filter (150) in a speech decoder (30, 48) is by-passed when the noise level exceeds a threshold value.