scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 1976"


Journal ArticleDOI
Frederick Jelinek1
01 Apr 1976
TL;DR: Experimental results are presented that indicate the power of the methods and concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding.
Abstract: Statistical methods useful in automatic recognition of continuous speech are described. They concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding. Experimental results are presented that indicate the power of the methods.

1,024 citations


Journal ArticleDOI
TL;DR: A pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal, which has been found to provide reliable classification with speech segments as short as 10 ms.
Abstract: In speech analysis, the voiced-unvoiced decision is usually performed in conjunction with pitch analysis The linking of voiced-unvoiced (V-UV) decision to pitch analysis not only results in unnecessary complexity, but makes it difficult to classify short speech segments which are less than a few pitch periods in duration In this paper, we describe a pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal In this method, five different measurements are made on the speech segment to be classified The measured parameters are the zero-crossing rate, the speech energy, the correlation between adjacent speech samples, the first predictor coefficient from a 12-pole linear predictive coding (LPC) analysis, and the energy in the prediction error The speech segment is assigned to a particular class based on a minimum-distance rule obtained under the assumption that the measured parameters are distributed according to the multidimensional Gaussian probability density function The means and covariances for the Gaussian distribution are determined from manually classified speech data included in a training set The method has been found to provide reliable classification with speech segments as short as 10 ms and has been used for both speech analysis-synthesis and recognition applications A simple nonlinear smoothing algorithm is described to provide a smooth 3-level contour of an utterance for use in speech recognition applications Quantitative results and several examples illustrating the performance of the method are included in the paper

479 citations


Proceedings ArticleDOI
12 Apr 1976
TL;DR: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum, which provides a means for controlling and reducing quantizing noise in the coding.
Abstract: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum. The approach provides a means for controlling and reducing quantizing noise in the coding. Each sub-band is quantized with an accuracy (bit allocation) based upon perceptual criteria. As a result, the quality of the coded signal is improved over that obtained from a single full-band coding of the total spectrum. In one implementation, the individual sub-bands are low-pass translated before coding. In another, "integer-band" sampling is employed to alias the signal in an advantageous way before coding. Other possibilities extend to complex demodulation of the sub-bands, and to representing the subband signals in terms of envelopes and phase-derivatives. In all techniques, adaptive quantization is used for the coding, and a parsimonious allocation of bits is made across the bands. Computer simulations are made to demonstrate the signal qualities obtained for codings at 16 and 9.6 Kbits/sec.

276 citations


Journal ArticleDOI
G. White1, R. Neely2
TL;DR: Automatic speech recognition experiments are described in which several popular preprocessing and classification strategies are compared and it is shown that dynamic programming is of major importance for recognition of polysyllabic words.
Abstract: Automatic speech recognition experiments are described in which several popular preprocessing and classification strategies are compared. Preprocessing is done either by linear predictive analysis or by bandpass filtering. The two approaches are shown to produce similar recognition scores. The classifier uses either linear time stretching or dynamic programming to achieve time alignment. It is shown that dynamic programming is of major importance for recognition of polysyllabic words. The speech is compressed into a quasi-phoneme character string or preserved uncompressed. Best results are obtained with uncompressed data, using nonlinear time registration for multisyllabic words.

165 citations


Journal ArticleDOI
TL;DR: It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.
Abstract: This paper presents the results of an examination of rapid amplitude compression following high-pass filtering as a method for processing speech, prior to reception by the listener, as a means of enhancing the intelligibility of speech in high noise levels. Arguments supporting this particular signal processing method are based on the results of previous perceptual studies of speech in noise. In these previous studies, it has been shown that high-pass filtered/clipped speech offers a significant gain in the intelligibility of speech in white noise over that for unprocessed speech at the same signal-to-noise ratios. Similar results have also been obtained for speech processed by high-pass filtering alone. The present paper explores these effects and it proposes the use of high-pass filtering followed by rapid amplitude compression as a signal processing method for enhancing the intelligibility of speech in noise. It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.

131 citations


Journal ArticleDOI
01 Apr 1976
TL;DR: The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.
Abstract: For many applications, it is desirable to be able to convert arbitrary English text to natural and intelligible sounding speech. This transformation between two surface forms is facilitated by first obtaining the common underlying abstract linguistic representation which relates to both text and speech surface representations. Calculation of these abstract bases then permits proper selection of phonetic segments, lexical stress, juncture, and sentence-level stress and intonation. The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.

116 citations


Journal ArticleDOI
TL;DR: Results suggest that LPC analysis/synthesis is fairly immune to the degradation of DPCM quantization, and the effects of DM quantization are more severe and the effect of additive white noise are the most serious.
Abstract: An important problem in some communication systems is the performance of linear prediction (LPC) analysis with speech inputs that have been corrupted by (signal-correlated) quantization distortion or additive white noise. To gain a first insight into this problem, a high-quality speech sample was deliberately degraded by using various degrees (bit rates of 16 kbps and more) of differential PCM (DPCM), and delta modulation (DM) quantization, and by the introduction of additive white noise. The resulting speech samples were then analyzed to obtain the LPC control signals: pitch, gain, and the linear prediction coefficients. These control parameters were then compared to the parameters measured in the original, high quality signal. The measurements of pitch perturbations were assessed on the basis of how many points exceeded an appropriate difference limen. A distance measure proposed by Itakura was used to compare the original LPC coefficients with the coefficients measured from the degraded speech. In addition, the measured control signals were used to synthesize speech for perceptual evaluation. Results suggest that LPC analysis/synthesis is fairly immune to the degradation of DPCM quantization. The effects of DM quantization are more severe and the effects of additive white noise are the most serious.

109 citations


Proceedings ArticleDOI
12 Apr 1976
TL;DR: An LPC vocoder employing the recently developed method of linear predictive warping (LPW), which achieves improved speech quality for the same bit rate in a minimally redundant model.
Abstract: In ordinary linear prediction the speech spectral envelope is modeled by an all-pole spectrum. The error criterion employed guarantees a uniform fit across the whole frequency range. However, we know from speech perception studies that low frequencies are more important than high frequencies for perception. Therefore, a minimally redundant model would strive to achieve a uniform perceptual fit across the spectrum, which means that it should be able to represent low frequencies more accurately than high frequencies. This is achieved in the LPCW vocoder: an LPC vocoder employing our recently developed method of linear predictive warping (LPW). The result is improved speech quality for the same bit rate.

64 citations


Journal ArticleDOI
TL;DR: The system described in this paper is subdivided into three main steps: pitch extraction, segmentation, and formant analysis, which uses an adaptive digital filter in time-domain transforming the speech signal into a signal similar to the glottal waveform.
Abstract: The system described in this paper is subdivided into three main steps: pitch extraction, segmentation, and formant analysis. The pitch extractor uses an adaptive digital filter in time-domain transforming the speech signal into a signal similar to the glottal waveform. Using the levels of the speech signal and the differenced signal as parameters in time domain, the subsequent segmentation algorithm derives a signal parameter which describes the speed of articulatory movement. From this, the signal is divided into "stationary" and "'transitional" segments; one stationary segment is associated to one phoneme. For the formant tracking procedure, a subset of the pitch periods is selected by the segmentation algorithm and is transformed into frequency domain. The formant tracking algorithm uses a maximum detection strategy and continuity criteria for adjacent spectra. After this step, the total parameter set is offered to an adaptive universal pattern classifier which is trained by selected material before working. For stationary phonemes, the recognition rate is about 85 percent when training material and test material are uttered by the same speaker. The recognition rate is increased to about 90 percent when segmentation results are used.

47 citations


Journal ArticleDOI
Harvey F. Silverman1, N. Dixon
TL;DR: Of those evaluated, a linearly mean-corrected minimum distance measure, on a 40-point spectral representation with a square (or cube) norm was consistently superior to the other methods.
Abstract: An important consideration in speech processing involves classification of speech spectra. Several methods for performing this classification are discussed. A number of these were selected for comparative evaluation. Two measures of performance-accuracy and stability-were derived through the use of an automatic performance evaluation system. Over 3000 hand-labeled spectra were used. Of those evaluated, a linearly mean-corrected minimum distance measure, on a 40-point spectral representation with a square (or cube) norm was consistently superior to the other methods.

39 citations


Proceedings ArticleDOI
01 Apr 1976
TL;DR: This report presents results obtained in some experiments on the computer recognition of continuous speech with two simple languages having vocabularies of 11 and 250 words.
Abstract: This report presents results obtained in some experiments on the computer recognition of continuous speech. The experiments deal with two simple languages having vocabularies of 11 and 250 words.

Journal ArticleDOI
TL;DR: In this paper, three methods of extracting resonance information from predictor-coefficient coded speech are compared: finding roots of the polynomial in the denominator of the transfer function using Newton iteration, picking peaks in the spectrum of the transferred function, and picking peaks on the negative of the second derivative of the spectrum.
Abstract: Three methods of extracting resonance information from predictor-coefficient coded speech are compared. The methods are finding roots of the polynomial in the denominator of the transfer function using Newton iteration, picking peaks in the spectrum of the transfer function, and picking peaks in the negative of the second derivative of the spectrum. A relationship was found between the bandwidth of a resonance and the magnitude of the second derivative peak. Data, accumulated from a total of about two minutes of running speech from both female and male talkers, are presented illustrating the relative effectiveness of each method in locating resonances. The second-derivative method was shown to locate about 98 percent of the significant resonances while the simple peak-picking method located about 85 percent.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal.
Abstract: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal. A period of short-time autocorrelation function is sampled and spliced after change of the time scale. Transformed speech is quite natural and free from distortion. Applications of SPAC are expected in many fields such as improvement of speech quality, narrow band transmission, communication aid for hard of hearing, information service for blind, unscrambling of helium speech, stenography and so on.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: This paper considers distance measures for determining the deviation between two smoothed short-time speech spectra, and suggests Flanagan's results on difference limens for formant frequencies as one basis for checking the perceptual consistency of a measure.
Abstract: This paper considers distance measures for determining the deviation between two smoothed short-time speech spectra. Since such distance measures are employed in speech processing applications that either involve or relate to human perceptual judgment, the effectiveness of these measures will be enhanced if they provide results consistent with human speech perception. As a first step, we suggest Flanagan's results on difference limens for formant frequencies as one basis for checking the perceptual consistency of a measure. A general necessary condition for perceptual consistency is derived for a class of spectral distance measures. A class of perceptually consistent measures obtained through experimental investigations is then described, and results obtained using one such measure under Flanagan's test conditions are presented.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: A modified linear prediction method based on the Karhumen-Loeve expansion of the correlation matrix of the speech samples, obtained via a new normalization of the parameters, which shows that, due to some important properties of Toeplitz matrices, the poles of the A. R. model lie on the unit circle.
Abstract: This paper introduces a modified linear prediction method based on the Karhumen-Loeve expansion of the correlation matrix of the speech samples. This result is obtained via a new normalization of the parameters. It is shown that, due to some important properties of Toeplitz matrices, the poles of the A. R. model lie on the unit circle. Consequently, only the formant frequencies are computed and the result can be interpreted as a special discrete Fourier transform. Application to speech analysis is developped with a comparison to the usual linear prediction and ceptrum methods.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: The real time implementation of a Linear Predictive Coding algorithm that has been developed over the past five years is described, using a modification of the Covariance Method for the analyzer and the system for pitch extraction and smoothing.
Abstract: This paper describes the real time implementation of a Linear Predictive Coding algorithm that has been developed over the past five years. The algorithm chosen for the analyzer is a modification of the Covariance Method introduced by B. S. Atal [1],[2] of Bell Labs. The system for pitch extraction uses a minimum distance function correlation technique. A dynamic programming algorithm [3] is used for pitch smoothing and correction of isolated pitch errors. The synthesizer uses a transversal filter. Considerable time has been devoted to optimizing the running time and integer scaling of the different algorithms for real time implementation on a 16 bit mini-computer.

Patent
13 Oct 1976
TL;DR: In linear predictive coding (LPC) as discussed by the authors, the input analog signal is divided by filters into multiple, contiguous, substantially equal bandwidth signal components and each component is digitized and processed by a separate standard LPC transmit-receive system.
Abstract: Improved fidelity and reduced computational requirements are achieved in a linear predictive coding (LPC) system that utilizes multichannel signal processing and reduced sampling rates. The input analog signal is divided by filters into multiple, contiguous, substantially equal bandwidth signal components and each component is digitized and processed by a separate standard LPC transmit-receive system. Each transmit-receive system operates at a sampling rate that is equal to the normal sampling rate for the signal being processed divided by the number of channels or signal components used. The received signal components are filtered, converted to analog, and summed.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: The voice-operated question-answering system for seat reservation is constructed by computer simulation technique and the promising results are obtained.
Abstract: The speech recognition system composing a part of the question-answering system operated by conversational speech is described. The recognition system consists of two stages of process : acoustic processing stage and linguistic processing stage. In the acoustic processing stage, input speech is analyzed and transformed into the phoneme sequence which usually contains ambiguities and errors caused in the segmentation and phoneme recognition. In the linguistic processing stage, the phoneme sequence containing ambiguities and errors is converted into the correct word sequence by the use of the linguistic knowledge such as phoneme rewriting rules, lexicon, syntax, semantics and pragmatics. The voice-operated question-answering system for seat reservation is constructed by computer simulation technique and the promising results are obtained.

Journal ArticleDOI
TL;DR: The goal of this paper is to provide a unified tutorial development of the various algorithms used and proposed for speech data compression by providing sufficient theoretical background to establish the algorithm relationships without stressing mathematical rigor.

Proceedings ArticleDOI
John Makhoul1
01 Apr 1976
TL;DR: This paper presents a general analysis-synthesis scheme for the arbitrary spectral distortion of speech signals without the need for pitch extraction, and linear predictive warping, cepstral Warping, and autocorrelation warping are given as examples of the general scheme.
Abstract: The spectral distortion of speech signals, without affecting the pitch or the speed of the signal, has met with some difficulty due to the need for pitch extraction. This paper presents a general analysis-synthesis scheme for the arbitrary spectral distortion of speech signals without the need for pitch extraction. Linear predictive warping, cepstral warping, and autocorrelation warping, are given as examples of the general scheme. Applications include the unscrambling of helium speech, spectral compression for the hard of hearing, bit rate reduction in speech compression systems, and efficiency of spectral representation for speech recognition systems.

Journal ArticleDOI
TL;DR: This paper considers temporal rearrangement or scrambling of the lpc code sequence, as well as the alternative of perturbing individual samples in the sequence by means of pseudo-random additive or multiplicative noise.
Abstract: This paper discusses several manipulations of lpc (linear predictive coding) parameters for providing speech encryption. Specifically, the paper considers temporal rearrangement or scrambling of the lpc code sequence, as well as the alternative of perturbing individual samples in the sequence by means of pseudo-random additive or multiplicative noise. The latter approach is believed to have greater encryption potential than the temporal scrambling technique, in terms of the time needed to “break the secrecy code.” The encryption techniques are assessed on the basis of perceptual experiments, as well as by means of a quantitative assessment of speech-spectrum distortion, as given by an appropriate “distance” measure.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: A technique that splits the spectrum into two equal halves and performs a piecewise LPC approximation to each half is described, and the fidelity is expected to be higher than standard LPC.
Abstract: A great deal of current research in the area of narrowband digital speech compression makes use of the Linear Prediction Coding (LPC) algorithm to extract the vocal track spectrum. This paper describes a technique that splits the spectrum into two equal halves and performs a piecewise LPC approximation to each half. By taking advantage of the classical benefits of piecewise approximation, the fidelity is expected to be higher than standard LPC. In addition, by making use of under-sampling and spectrum folding, computational requirements are reduced by about 40%. PLPC has been implemented in real time on the CSP-30 computer at the Speech Research and Development Facility of the Communications Security Engineering Office (DCW) at ESD.

Proceedings ArticleDOI
01 Feb 1976
TL;DR: The Institute for Mathematical Studies in the Social Sciences at Stanford (IMSSS) has developed a synthesis system, MISS, designed to test the effectiveness of computer-generated speech in the context of complex CAI programs.
Abstract: The Institute for Mathematical Studies in the Social Sciences at Stanford (IMSSS) has developed a synthesis system, MISS (Microprogrammed Intoned Speech Synthesizer), designed to test the effectiveness of computer-generated speech in the context of complex CAI programs. No one method of computer controlled speech production is completely satisfactory for all the uses of computer-assisted instruction (CAI). The choice of synthesis method is strongly related to the kinds of curriculums and instructional designs that will use speech. We chose to use acoustic modelling by linear predictive coding as the method of synthesis for MISS.(1)In Section 2 we describe criteria appropriate for organizing the comparison of voice response systems for use with instructional computers. Then we describe the particular requirements imposed by curriculums at IMSSS, review general voice synthesis techniques, and finally discuss our actual choice. In Sections 3 and 4 we outline the hardware and software that have been created to support MISS in operational CAI at Stanford.In Section 5 we discuss the applications of audio to CAI.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: Characteristics of common sources of noise and distortion are described in this paper and their effect in shaping the spectrum of speech is discussed.
Abstract: Parameter or feature extraction from speech signal forms the basis for systems designed for speech recognition, speaker verification, speech bandwidth compression etc. The parameters in general are critically dependent upon the short-time spectrum of speech. The input speech waveform is however, subjected to several types of noises and distortions due to background noise sources, reverberation, close speaking into a microphone, telephone system imperfections etc. These factors modify the spectrum of the speech signal and hence the parameters extracted. Characteristics of common sources of noise and distortion are described in this paper and their effect in shaping the spectrum of speech is discussed. Steps to reduce the influence of some noises while producing speech input to a system are suggested. Methods of normalization of spectral distortions due to noise and the effect of such normalization on parametric extraction are also discussed.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: This paper reports the results of an investigation of a computable Quality Comparison Measure (called the QCM) for linear predictive systems, a weighted combination of differences between the input and output speech parameters for a series of spoken sentences.
Abstract: This paper reports the results of an investigation of a computable Quality Comparison Measure (called the QCM) for linear predictive systems. The measure described is easily obtained by a synthesis-analysis procedure. It is a weighted combination of differences between the input and output speech parameters for a series of spoken sentences. Results are presented that demonstrate a high correlation between QCM and listener preference scores. The QCM offers an alternative to costly and time consuming formal listening procedures.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: A new spectral interpretation of this method for waveform matching is developed, describing how the accuracy of such a system can be increased by training on multiple repetitions of the reference vocabulary, and how both the reference parameter storage requirements and computation rate can be reduced in half by omitting redundant speech information.
Abstract: The linear prediction residual has been shown to be an effective technique for isolated word recognition. This paper develops a new spectral interpretation of this method for waveform matching, describes how the accuracy of such a system can be increased by training on multiple repetitions of the reference vocabulary, and how both the reference parameter storage requirements and computation rate can be reduced in half by omitting redundant speech information. Results of experiments vary from a low of 88% when the vocabulary consisted of 25 pairs of rhyming words differing only by the initial consonant sound, to a high of 98.1% when the vocabulary consisted of 107 flight commands having an average of two syllables per word.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: This research has resulted in the development of a new pitch-synchronous analysis technique for the extraction of accurate formant information from speech signals that is an improvement over current methods of analysis in terms of accuracy and temporal resolution.
Abstract: This research has resulted in the development of a new pitch-synchronous analysis technique for the extraction of accurate formant information from speech signals. The method is an improvement over current methods of analysis in terms of accuracy and temporal resolution. This is achieved by extension of the signal from one pitch period into the next, using a speech production model based on linear prediction. The result is higher accuracy in the determination of formant frequencies, bandwidths and amplitudes, and the ability to follow rapid formant transitions. The method performs equally well with nasal and high pitched sounds. The method is applied to the speech recognition and the speaker identification problems.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: Computational and syntactic information is used to resolve ambiguities and to yield higher—order decisions in automatic speech recognition and understanding systems.
Abstract: Automatic speech recognition and understanding are currently receiving considerable attention.1 Most approaches to problems in these areas involve rather complicated systems. Typically, the acoustic waveform is first segmented into units such as phonemes or syl— lables. Semantic and syntactic information is then used to resolve ambiguities and to yield higher—order decisions. This complexity is probably necessary if the most general speech—recognition problems are to be solved.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: The quantitative rules obtained for generating the SSRU's are expected to be useful, at least as a preliminary investigation tool, for synthesis-by-rule.
Abstract: Summary form only given, as follows. The paper deals with the application of linear prediction technique to the speech synthesis of both italian and german languages by Standard Speech Reproducing Units (SSRU), it is by combining elementary speech segments of standardized charac teristics extracted fron utterances of native speakers. The nain feature of the method presented is the possibility of synthesizing in a higly intelligible form any nessage of such languages with a very limited amount of data. So far the use of linear predictive coding of the previously realized SSRU sets allowed a memory occupation less than 16 kb for the synthesis of italian language and less than 32 k-bytes for the combined synthesis of italian and german languages. The data flow rate is about 1 kb/s. A key property of the code with respect to methods previously used (i.e. simple concatenation of original segments ) relies in the possibility of greatly enhancing the naturalness of the synthesized speech by varying pitch, amplitude and duration of the synthetic segments. Further, the quantitative rules obtained for generating the SSRU's are expected to be useful, at least as a preliminary investigation tool, for synthesis-by-rule.