scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 1994"


Journal ArticleDOI
01 Oct 1994
TL;DR: The objective of this paper is to provide a tutorial overview of speech coding methodologies with emphasis on those algorithms that are part of the recent low-rate standards for cellular communications.
Abstract: The past decade has witnessed substantial progress towards the application of low-rate speech coders to civilian and military communications as well as computer-related voice applications. Central to this progress has been the development of new speech coders capable of producing high-quality speech at low data rates. Most of these coders incorporate mechanisms to: represent the spectral properties of speech, provide for speech waveform matching, and "optimize" the coder's performance for the human ear. A number of these coders have already been adopted in national and international cellular telephony standards. The objective of this paper is to provide a tutorial overview of speech coding methodologies with emphasis on those algorithms that are part of the recent low-rate standards for cellular communications. Although the emphasis is on the new low-rate coders, we attempt to provide a comprehensive survey by covering some of the traditional methodologies as well. We feel that this approach will not only point out key references but will also provide valuable background to the beginner. The paper starts with a historical perspective and continues with a brief discussion on the speech properties and performance measures. We then proceed with descriptions of waveform coders, sinusoidal transform coders, linear predictive vocoders, and analysis-by-synthesis linear predictive coders. Finally, we present concluding remarks followed by a discussion of opportunities for future research. >

461 citations


Journal ArticleDOI
TL;DR: This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification, and it is shown that performance differences between the basic features is small, and the major gains are due to the channel Compensation techniques.
Abstract: This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification. The goal is to keep all processing and classification steps constant and to vary only the features and compensations used to allow a controlled comparison. A general, maximum-likelihood classifier based on Gaussian mixture densities is used as the classifier, and experiments are conducted on the King speech database, a conversational, telephone-speech database. The features examined are mel-frequency and linear-frequency filterbank cepstral coefficients, linear prediction cepstral coefficients, and perceptual linear prediction (PLP) cepstral coefficients. The channel compensation techniques examined are cepstral mean removal, RASTA processing, and a quadratic trend removal technique. It is shown for this database that performance differences between the basic features is small, and the major gains are due to the channel compensation techniques. The best "across-the-divide" recognition accuracy of 92% is obtained for both high-order LPC features and band-limited filterbank features. >

336 citations


Journal ArticleDOI
01 Jun 1994
TL;DR: Current activity in speech compression is dominated by research and development of a family of techniques commonly described as code-excited linear prediction (CELP) coding, which offer a quality versus bit rate tradeoff that significantly exceeds most prior compression techniques.
Abstract: Speech and audio compression has advanced rapidly in recent years spurred on by cost-effective digital technology and diverse commercial applications. Recent activity in speech compression is dominated by research and development of a family of techniques commonly described as code-excited linear prediction (CELP) coding. These algorithms exploit models of speech production and auditory perception and offer a quality versus bit rate tradeoff that significantly exceeds most prior compression techniques for rates in the range of 4 to 16 kb/s. Techniques have also been emerging in recent years that offer enhanced quality in the neighborhood of 2.4 kb/s over traditional vocoder methods. Wideband audio compression is generally aimed at a quality that is nearly indistinguishable from consumer compact-disc audio. Subband and transform coding methods combined with sophisticated perceptual coding techniques dominate in this arena with nearly transparent quality achieved at bit rates in the neighborhood of 128 kb/s per channel. >

234 citations


Journal ArticleDOI
TL;DR: This paper reviews methods for mapping from the acoustical properties of a speech signal to the geometry of the vocal tract that generated the signal and shows how this nonuniqueness can be alleviated by imposing continuity constraints.
Abstract: This paper reviews methods for mapping from the acoustical properties of a speech signal to the geometry of the vocal tract that generated the signal. Such mapping techniques are studied for their potential application in speech synthesis, coding, and recognition. Mathematically, the estimation of the vocal tract shape from its output speech is a so-called inverse problem, where the direct problem is the synthesis of speech from a given time-varying geometry of the vocal tract and glottis. Different mappings are discussed: mapping via articulatory codebooks, mapping by nonlinear regression, mapping by basis functions, and mapping by neural networks. Besides being nonlinear, the acoustic-to-geometry mapping is also nonunique, i.e., more than one tract geometry might produce the same speech spectrum. The authors show how this nonuniqueness can be alleviated by imposing continuity constraints. >

220 citations


PatentDOI
TL;DR: In this article, an automated method for modifying a speech signal in a telephone network by applying a gain factor which is a function of the level of background noise at a given destination, and transmitting the modified speech signal to the destination was proposed.
Abstract: An automated method for modifying a speech signal in a telephone network by applying a gain factor which is a function of the level of background noise at a given destination, and transmitting the modified speech signal to the destination. The gain applied may be a function of both the background noise level and the original speech signal. Either a linear or a non-linear (e.g., compressed) amplification of the original speech signal may be performed, where a compressed amplification results in the higher level portions of the speech signal being amplified by a smaller gain factor than lower level portions. The speech signal may be separated into a plurality of subbands, each resultant subband signal being individually modified in accordance with the present invention. In this case, each subband speech signal is amplified by a gain factor based on a corresponding subband noise signal, generated by separating the background noise signal into a corresponding plurality of subbands. The individual modified subband signals may then be combined to form the resultant modified speech signal.

214 citations


PatentDOI
TL;DR: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems.
Abstract: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems. An input speech data set is provided to a speech recognition training processor, whose bandwidth is higher than a telephone bandwidth. The process performs a series of alterations to the input speech data set to obtain a modified speech data set. The modified speech data set enables the speech recognition processor to perform speech recognition on voice signals from a telephone system.

159 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: This paper describes the application of transform coded excitation (TCX) coding to encoding wideband speech and audio signals in the bit rate range of 16 k bits/s to 32 kbits/s and proposes novel quantization procedures including inter-frame prediction in the frequency domain.
Abstract: This paper describes the application of transform coded excitation (TCX) coding to encoding wideband speech and audio signals in the bit rate range of 16 kbits/s to 32 kbits/s. The approach uses a combination of time domain (linear prediction; pitch prediction) and frequency domain (transform coding; dynamic bit allocation) techniques, and utilizes a synthesis model similar to that of linear prediction coders such as CELP. However, at the encoder, the high complexity analysis-by-synthesis technique is bypassed by directly quantizing the so-called target signal in the frequency domain. The innovative excitation is derived at the decoder by inverse filtering the quantized target signal. The algorithm is intended for applications whereby a large number of bits is available for the innovative excitation. The TCX algorithm is utilized to encode wideband speech and audio signals with a 50-7000 Hz bandwidth. Novel quantization procedures including inter-frame prediction in the frequency domain are proposed to encode the target signal. The proposed algorithm achieves very high quality for speech at 16 kbits/s, and for music at 24 kbits/s. >

93 citations


Journal ArticleDOI
TL;DR: The authors propose to use the singular value decomposition (SVD) approach to detect the instants of glottal closure from the speech signal using the Frobenius norms of signal matrices and therefore is computationally efficient.
Abstract: The detection of glottal closure instants has been a necessary step in several applications of speech processing, such as voice source analysis, speech prosody manipulation and speech synthesis. The paper presents a new algorithm for glottal closure detection that compares favorably with other methods available in terms of robustness and computational efficiency. The authors propose to use the singular value decomposition (SVD) approach to detect the instants of glottal closure from the speech signal. The proposed SVD method amounts to calculating the Frobenius norms of signal matrices and therefore is computationally efficient. Moreover, it produces well-defined and reliable peaks that indicate the instants of glottal closure. Finally, with the introduction of the total linear least squares technique, two other proposed methods are reinvestigated and unified into the SVD framework. >

91 citations


Proceedings ArticleDOI
19 Apr 1994
TL;DR: The technique, called nonlinear predictive coding, is shown to be superior to the LPC technique and two different nonlinear predictors are presented, one based on a second-order Volterra filter, and the other on a time delay neural network, which is found to be the more suitable for speech coding applications.
Abstract: Addresses the question of how to extract the nonlinearities in speech with the prime purpose of facilitating coding of the residual signal in residual excited coders. The short-term prediction of speech in speech coders is extensively based on linear models, e.g. the linear predictive coding technique (LPC), which is one of the most basic elements in modern speech coders. This technique does not allow extraction of nonlinear dependencies. If nonlinearities are absent from speech the technique is sufficient, but if the speech contains nonlinearities the technique is inadequate. The authors give evidence for nonlinearities in speech and propose nonlinear short-term predictors that can substitute the LPC technique. The technique, called nonlinear predictive coding, is shown to be superior to the LPC technique. Two different nonlinear predictors are presented. The first is based on a second-order Volterra filter, and the second is based on a time delay neural network. The latter is shown to be the more suitable for speech coding applications. >

82 citations


PatentDOI
Walter Kellermann1
TL;DR: In this paper, a speech processing arrangement has at least two microphones for supplying microphone signals formed by speech components and noise components to microphone signal branches that are coupled to an adder device used for forming a sum signal.
Abstract: A speech processing arrangement has at least two microphones for supplying microphone signals formed by speech components and noise components to microphone signal branches that are coupled to an adder device used for forming a sum signal. The microphone signals are delayed and weighted by weight factors in the microphone signal branches. The arrangement includes an evaluation circuit that a) receives the microphone signals, b) estimates the noise components, c) estimates the speech components by forming the difference between one of the microphone signals and the estimated noise component for this microphone signal, d) selects one of the microphone signals as a reference signal which contains a reference noise component and a reference speech component, e) forms speech signal ratios by dividing the estimated speech components by the estimated reference speech component, f) forms noise signal ratios by dividing the powers of the estimated noise components by the power of the estimated reference noise component, and g) determines the weight factors by dividing each speech signal ratio by the associated noise signal ratio. The signal-to-noise ratio corresponds to the ratio of the power of the speech component to the power of the noise component of the sum signal. Because the speech signals are correlated and noise signals are uncorrelated, the sum signal available on the output of the adder device has a reduced noise component yielding improved speech audibility. Real-time computation of the weight factors eliminates any annoying delay during a conversation held using the speech processing arrangement.

78 citations


PatentDOI
TL;DR: An apparatus and method of coding speech including a first circuit being coupled to receive a first signal, the first signal corresponds to the speech signal, for generating an encoded signal responsive to the selected set of parameters and the mode.
Abstract: An apparatus and method of coding speech. The apparatus includes a first circuit being coupled to receive a first signal, the first signal corresponds to the speech signal. The first circuit is for generating a first set of parameters corresponding to the first frame. The apparatus includes a second circuit, being coupled to receive a second signal and the first set of parameters, the second signal corresponding to the speech signal, and the second circuit is for generating a third signal. The apparatus further includes a pulse train analyzer, being coupled to the second circuit, for generating a third match value, a third set of parameters, and a third excitation value. The apparatus further including a fourth circuit, being coupled to the second circuit, for generating a fourth match value, a fourth set of parameters, and a fourth excitation value. The apparatus further including a fifth circuit, being coupled to the third circuit and the fourth circuit, for selecting a mode corresponding to a match value. The apparatus further including a sixth circuit, being coupled to the fifth circuit, for selecting a selected set of parameters and a selected excitation corresponding to the mode. The apparatus further including a seventh circuit, being coupled to the first circuit and the sixth circuit, for generating an encoded signal responsive to the selected set of parameters and the mode.

Proceedings ArticleDOI
Stephan Euler1, J. Zinke1
19 Apr 1994
TL;DR: The authors use a Gaussian classifier for estimation of the coding condition of a test utterance and the combination of this classifier and coder specific word models yields a high overall recognition performance.
Abstract: Examines the influence of different coders in the range from 64 kbit/sec to 4.8 kbit/sec on both a speaker independent isolated word recognizer and a speaker verification system. Applying systems trained with 64 kbit/sec to e.g. the 4.8 kbit/sec data increases the error rate of the word recognizer by a factor of three. For rates below 13 kbit/sec the speaker verification is more affected than the word recognition. The performance improves significantly if word models are provided for the individual coding conditions. Therefore, the authors use a Gaussian classifier for estimation of the coding condition of a test utterance. The combination of this classifier and coder specific word models yields a high overall recognition performance. >

Journal ArticleDOI
TL;DR: The authors describe a novel approach to speech recognition by directly modeling the statistical characteristics of the speech waveforms, which allows them to remove the need for using speech preprocessors, which conventionally serve a role of converting speech waves into frame-based speech data subject to a subsequent modeling process.
Abstract: The authors describe a novel approach to speech recognition by directly modeling the statistical characteristics of the speech waveforms. This approach allows them to remove the need for using speech preprocessors, which conventionally serve a role of converting speech waveforms into frame-based speech data subject to a subsequent modeling process. Central to their method is the representation of the speech waveforms as the output of a time-varying filter excited by a Gaussian source time-varying in its power. In order to formulate a speech recognition algorithm based on this representation, the time variation in the characteristics of the filter and of the excitation source is described in a compact and parametric form of the Markov chain. They analyze in detail the comparative roles played by the filter modeling and by the source modeling in speech recognition performance. Based on the result of the analysis, they propose and evaluate a normalization procedure intended to remove the sensitivity of speech recognition accuracy to often uncontrollable speech power variations. The effectiveness of the proposed speech-waveform modeling approach is demonstrated in a speaker-dependent, discrete-utterance speech recognition task involving 18 highly confusable stop consonant-vowel syllables. The high accuracy obtained shows promising potentials of the proposed time-domain waveform modeling technique for speech recognition. >

Proceedings ArticleDOI
08 Jun 1994
TL;DR: Experiments are described to develop a new technique that requires only the received speech, which uses perceptually-based speaker-independent speech parameters such as perceptual-linear prediction coefficients and the perceptually weighted Bark spectrum to estimate subjective quality.
Abstract: Objective speech quality measures automatically assess performance of communication systems without the need for human listeners. Typical objective quality methods are based on some distortion measure between the known input speech record and the received output signal. This paper describes experiments to develop a new technique that requires only the received speech. The algorithm uses perceptually-based speaker-independent speech parameters such as perceptual-linear prediction coefficients and the perceptually weighted Bark spectrum. Parameters derived from a variety of undegraded source speech material provides reference centroids corresponding to high speech quality. The average distance between output speech parameters and the nearest reference centroid provides an indication of speech degradation, which is used to estimate subjective quality. The paper presents algorithm results for speech processed through low bit-rate codecs and subjected to bit errors due to impaired channel conditions. Output-based quality measures would be a valuable tool for monitoring performance of speech communication systems such as digital mobile radio networks and mobile satellite systems. >

Journal ArticleDOI
TL;DR: A pitch predictor exploiting the present interpolation strategy, with an update rate of 50 Hz, provides a subjective speed quality similar to a conventional pitch predictor where the parameters are updated for every pitch cycle.
Abstract: The pitch-predictor contributes greatly to the efficiency of current analysis-by-synthesis speech coders by mapping the past reconstructed signal into the present. However, for good performance, it is required that its parameters are updated often (one every 2.5-7.5 ms). A slower update rate of the pitch-predictor delay results in time misalignment between the original signal and the pitch-predictor contribution to the reconstructed signal and the pitch-predictor contribution to the reconstructed signal. The authors introduce a new procedure, that allows a slow update rate of the pitch-predictor parameters without this problem. In this method the original signal is modified in a closed-loop fashion such that the parameter values obtained by interpolation of open-loop estimates form the optimal encoding of the modified signal. This new paradigm is a generalization of the familiar analysis-by-synthesis principle. The generalized analysis-by-synthesis principle can be used for interpolation of both the pitch-predictor delay and gain. The authors compare, by means of a subjective test, speech signals encoded with different versions of the code-excited linear predictor delay and gain. They compare, by means of a subjective test, speech signals encoded with different versions of the code-excited linear predictor (CELP) coder. The comparison shows that a pitch predictor exploiting the present interpolation strategy, with an update rate of 50 Hz, provides a subjective speed quality similar to a conventional pitch predictor where the parameters are updated for every pitch cycle. >

PatentDOI
Juin-Hwey Chen1
TL;DR: Modified perceptual weighting parameters and a novel use of postfiltering greatly improve tandeming of a number of encodings and decodings while retaining high quality reproduction.
Abstract: A code-excited linear-predictive (CELP) coder for speech or audio transmission at compressed (e.g., 16 kb/s) data rates is adapted for low-delay (e.g., less than five ms. per vector) coding by performing spectral analysis of at least a portion of a previous frame of simulated decoded speech to determine a synthesis filter of a much higher order than conventionally used for decoding synthesis and then transmitting only the index for the vector which produces the lowest internal error signal. Modified perceptual weighting parameters and a novel use of postfiltering greatly improve tandeming of a number of encodings and decodings while retaining high quality reproduction.

PatentDOI
Yumi Takizawa1
TL;DR: In this article, the relationship between durations of each recognition unit is obtained by a duration training circuit and, at the time of recognizing speech, a beginning and an end of input speech is detected by a speech period sensing circuit, and then by using the relationship and the input speech period length, the durations in the input input speech are estimated.
Abstract: At the time of training reference speech, the relationship between durations of each recognition unit is obtained by a duration training circuit and, at the time of recognizing speech, a beginning and an end of input speech is detected by a speech period sensing circuit, and then by using the relationship and the input speech period length, the durations of the recognition units in the input speech are estimated. Next, the reference speech and the input speech are matched by the matching means by using the calculated estimation values in such a manner that the recognition units have a duration close to that of the estimated values.

Patent
11 Jul 1994
TL;DR: In this article, a method and system for encoding and decoding of speech signals at a low bit rate is presented, where continuous input speech is divided into voiced and unvoiced time segments of a predetermined length.
Abstract: A method and system is provided for encoding and decoding of speech signals at a low bit rate. The continuous input speech is divided into voiced and unvoiced time segments of a predetermined length. The encoder of the system uses a linear predictive coding model for the unvoiced speech segments and harmonic frequencies decomposition for the voiced speech segments. Only the magnitudes of the harmonic frequencies are determined using the discrete Fourier transform of the voiced speech segments. The decoder synthesizes voiced speech segments using the magnitudes of the transmitted harmonics and estimates the phase of each harmonic from the signal in the preceding speech segments. Unvoiced speech segments are synthesized using linear prediction coding (LPC) coefficients obtained from codebook entries for the poles of the LPC coefficient polynomial. Boundary conditions between voiced and unvoiced segments are established to insure amplitude and phase continuity for improved output speech quality.

Journal ArticleDOI
W.B. Kleijn1, J. Haagen1
TL;DR: The decomposition of the characteristic waveform is decomposed into a slowly evolving waveform and a rapidly evolving waveforms, representing the quasi-periodic and other components of speech, respectively, which allows efficient coding of voiced and unvoiced speech at bit rates between 2 and 8 kb/s.
Abstract: The speech signal is represented by an evolving characteristic waveform. The characteristic waveform is decomposed into a slowly evolving waveform and a rapidly evolving waveform, representing the quasi-periodic and other components of speech, respectively. These two evolving waveforms have fundamentally different quantization requirements. The decomposition allows efficient coding of voiced and unvoiced speech at bit rates between 2 and 8 kb/s. >

Proceedings ArticleDOI
19 Apr 1994
TL;DR: A toll quality speech codec at 8 kbit/s with a 10 ms speech-frame currently under standardization by the CCITT is presented and initial subjective tests showed that the codec quality is equivalent to that of G.726 ADPCM in error-free conditions and it performs adequately under tandeming conditions.
Abstract: A toll quality speech codec at 8 kbit/s with a 10 ms speech-frame currently under standardization by the CCITT is presented. The encoding algorithm is based on algebraic code-excited linear prediction (ACELP). Efficient pitch and codebook search strategies, along with efficient quantization procedures, have been developed to achieve toll quality encoded speech with a complexity implementable on current fixed-point DSP chips. Initial subjective tests showed that the codec quality is equivalent to that of G.726 ADPCM at 32 kbit/s in error-free conditions and it outperforms G.726 under error conditions. The codec can support a frame erasure rate up to 3% with slight degradation and performs adequately under tandeming conditions. The algorithm has been implemented on a single fixed-point DSP for the CCITT qualification test. It requires about 24 MIPS. >

PatentDOI
Bruno Lozach1
TL;DR: A system for predictive coding of a digital speech signal with embedded codes used in any transmission system or for storing speech signals and makes it possible to deliver indices representing the coded speech signal.
Abstract: A system for predictive coding of a digital speech signal with embedded codes used in any transmission system or for storing speech signals. The coded digital signal (Sn) is formed by a coded speech signal and, if appropriate, by auxiliary data. A perceptual weighting filter is formed by a filter for short-term prediction of the speech signal to be coded, in order to produce a frequency distribution of the quantization noise. A circuit makes it possible to perform the subtraction from the perceptual signal of the contribution of the past excitation signal P0 n to deliver an updated perceptual signal Pn. A long-term prediction circuit is formed, as a closed loop, from a dictionary updated by the modelled page excitation r1 n for the lowest throughput and makes it possible to deliver an optimal waveform and an associated estimated gain which make up the estimated perceptual signal P1 n. An orthonormal transform module includes an adaptive transform module and a module for progressive modelling by orthogonal vectors, thus making it possible to deliver indices representing the coded speech signal. A circuit makes it possible to insert auxiliary data by stealing bits from the coded speech signal. Decoding is performed through extraction of datasignal and transmission of indices representing coded speech signal which is modelled at the minimum throughput.

Journal ArticleDOI
TL;DR: This correspondence presents an experimental system that uses an energy-tracking operator and a related energy separation algorithm to automatically find speech formants and amplitude/frequency modulations in voiced speech segments.
Abstract: This correspondence presents an experimental system that uses an energy-tracking operator and a related energy separation algorithm to automatically find speech formants and amplitude/frequency modulations in voiced speech segments. Initial estimates of formant center frequencies are provided by either LPC or morphological spectral peak picking. These estimates are then shown to be improved by a combination of bandpass filtering and iterative application of energy separation. >

Journal ArticleDOI
TL;DR: A new method for VFR using the norm of the derivative parameters in deciding to retain or to discard a frame is introduced, and informal inspection of speech spectrograms shows that this new method puts more emphasis on the transient regions of the speech signal.
Abstract: Variable frame rate (VFR) analysis is a technique used in speech processing and recognition for discarding frames that are too much alike. The article introduces a new method for VFR. Instead of calculating the distance between frames, the norm of the derivative parameters is used in deciding to retain or to discard a frame, informal inspection of speech spectrograms shows that this new method puts more emphasis on the transient regions of the speech signal. Experimental results with a hidden Markov model (HMM) based system show that the new method outperforms the classical method. >

Proceedings ArticleDOI
19 Apr 1994
TL;DR: The MOS subjective test shows that 4.075 kbps M-LCELP synthetic speech quality is high, and that its quality is mostly equivalent to that for an 8 kbps North American full-rate VSELP coder.
Abstract: This paper presents the M-LCELP (multi-mode learned code excited LPC) speech coder, which has been developed for the North American half-rate digital cellular systems. M-LCELP develops the following techniques to achieve high-quality synthetic speech at 4 kbps: (1) Multimode and multi-codebook coding, (2) Pitch lag differential coding with pitch tracking, (3) A two-stage joint design regular-pulse codebook with common phase structure in voiced frames, (4) An efficient vector quantization for LSP parameters, (5) An adaptive MA type comb filter to suppress excitation signal inter-harmonic noise. The MOS subjective test shows that 4.075 kbps M-LCELP synthetic speech quality is high, and that its quality is mostly equivalent to that for an 8 kbps North American full-rate VSELP coder. >

Journal ArticleDOI
TL;DR: The advantage of the nonlinear prediction capability of neural networks is exploited and applied to the design of improved predictive speech coders and resulted in a fully vector-quantized, code-excited, nonlinear predictive speech coder.
Abstract: Recent studies have shown that nonlinear predictors can achieve about 2-3 dB improvement in speech prediction over conventional linear predictors. In this paper, we exploit the advantage of the nonlinear prediction capability of neural networks and apply it to the design of improved predictive speech coders. Our studies concentrate on the following three aspects: (a) the development of short-term (formant) and long-term (pitch) nonlinear predictive vector quantizers (b) the analysis of the output variance of the nonlinear predictive filter with respect to the input disturbance (c) the design of nonlinear predictive speech coders. The above studies have resulted in a fully vector-quantized, code-excited, nonlinear predictive speech coder. Performance evaluations and comparisons with linear predictive speech coding are presented. These tests have shown the applicability of nonlinear prediction in speech coding and the improvement in coding performance. >

Proceedings ArticleDOI
29 Mar 1994
TL;DR: The authors introduce a simple and effective formulation of variable-dimension vector quantization (VDVQ) which quantizes variable- dimension vectors using a single universal codebook having fixed dimension yet covering the entire range of input vector dimensions under consideration.
Abstract: Optimal vector quantization of variable-dimension vectors in principle is feasible by using a set of fixed dimension VQ codebooks. However, for typical applications, such a multi-codebook approach demands a grossly excessive and impractical storage and computational complexity. Efficient quantization of such variable-dimension spectral shape vectors is the most challenging and difficult encoding task required in an important family of low bit-rate vocoders. The authors introduce a simple and effective formulation of variable-dimension vector quantization (VDVQ) which quantizes variable-dimension vectors using a single universal codebook having fixed dimension yet covering the entire range of input vector dimensions under consideration. This VDVQ technique is applied to quantize variable-dimension spectral shape vectors leading to a high quality speech coder at the low bit-rate of 2.5 kb/s. The combination of a universal spectral codebook and structured VQ reduces storage and computational complexity, yet delivers a high quantization efficiency and enhanced perceptual quality of the coded speech. >

Proceedings ArticleDOI
08 Jun 1994
TL;DR: An adaptive coding system that adjusts the rate allocation according to actual channel conditions and shows that the objective and the subjective speech quality of the adaptive coders are superior than their non-adaptive counterparts.
Abstract: Although the mobile communication channels are time-varying, most systems allocate the combined rate between the speech coder and error correction coder according to a nominal channel condition. This generally leads to a pessimistic design and consequently an inefficient utilization of the available resources, such as bandwidth and power. This paper describes an adaptive coding system that adjusts the rate allocation according to actual channel conditions. Two types of variable rate speech coders are considered : the embedded coders and the multimode coders and both are based on code excited linear prediction (CELP). On the other hand, the variable rate channel coders are based on the rate compatible punctured convolutional codes (RCPC). A channel estimator is used at the receiver to track both the short term and the long term fading condition in the channel. The estimated channel state information is then used to vary the rate allocation between the speech and the channel coder, on a frame by frame basis. This is achieved by sending an appropriate rate adjustment command through a feedback channel. Experimental results show that the objective and the subjective speech quality of the adaptive coders are superior than their non-adaptive counterparts. Improvements of up to 1.35 dB in SEGSNR of the speech signal and up to 0.9 in informal MOS for a combined rate of 12.8 kbit/s have been found. In addition, we found that the multimode coders perform better than their embedded counterparts. >

Proceedings ArticleDOI
19 Apr 1994
TL;DR: This work investigates the use of four candidate speech models in the context of high quality text-to-speech systems (HQ-TTS), addresses problems typically encountered by their prosody matching and segment concatenation modules, and compares their performances regarding: the segment database compression ratio they allow, the computational load of the related synthesis algorithms, as well as their intelligibility and subjective segmental quality.
Abstract: We investigate the use of four candidate speech models in the context of high quality text-to-speech systems (HQ-TTS), address problems typically encountered by their prosody matching and segment concatenation modules, and compare their performances regarding: the segment database compression ratio they allow, the computational load of the related synthesis algorithms, as well as their intelligibility and subjective segmental quality. The models addressed are: the classical auto-regressive (LPC) one, the hybrid harmonic/stochastic (H/S) model proposed by Griffin and Lim (1988) and by Abrantes, Marques and Transcoso (1991), the 'null' model, as implemented by the time-domain pitch-synchronous overlap-add (TD-PSOLA) synthesis algorithm, and the multi-band re-synthesis pitch-synchronous overlap-add (MBR-PSOLA) model. >

Journal ArticleDOI
J.R.B. de Marca1
TL;DR: A quantizer design procedure which is tailored for the transmission frame structure of the North-American (TIA) cellular communication standard is presented, basically a split vector quantizer which makes use of interframe prediction to lower the number of bits required for quantization.
Abstract: The design of the half-rate coder for the North-American cellular communication standard poses a challenging problem. The desired speech quality is that of the full-rate coder, but the total bit rate specified for the half-rate is only 6.5 kb/s. Since the mobile communication channel is very noisy, the use of error-correcting codes is necessary, which leaves only about 4 kb/s for the actual speech information. This very restricted bit budget requires that fewer than 30 b be allotted to the quantization of the LSF parameters, which precludes the use of scalar quantization. In this work, a quantizer design procedure which is tailored for the transmission frame structure of the North-American (TIA) cellular communication standard is presented. It is basically a split vector quantizer which makes use of interframe prediction to lower the number of bits required for quantization. Since the prediction is performed only on every other frame, the error propagation is limited. The quantization technique yields good performance in the 26-27 b/frame range, and the complexity of its implementation is less than 0.5 Mip. >

PatentDOI
TL;DR: This VSELP speech coder uses single or multi-segment vector quantizer of the reflection coefficients based on a Fixed-Point-Lattice-Technique (FLAT) to reduce the vector codebook search complexity and the amount of memory needed to store the reflection coefficient vector codebooks.
Abstract: A Vector-Sum Excited Linear Predictive Coding (VSELP) speech coder provides improved quality and reduced complexity over a typical speech coder. VSELP uses a codebook which has a predefined structure such that the computations required for the codebook search process can be significantly reduced. This VSELP speech coder uses single or multi-segment vector quantizer of the reflection coefficients based on a Fixed-Point-Lattice-Technique (FLAT). Additionally, this speech coder uses a pre-quantizer to reduce the vector codebook search complexity and a high-resolution scalar quantizer to reduce the amount of memory needed to store the reflection coefficient vector codebooks. Resulting in a high quality speech coder with reduced computations and storage requirements.