scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 1977"


Journal ArticleDOI
R. Zelinski, P. Noll1
TL;DR: The main result is that this adaptive transform coder performs better than all known nonpitch-tracking coding schemes; it extends the range of speech waveform coding to lower bit rates and closes the gap between vocoders and predictive waveform coders.
Abstract: This paper discusses speech coding systems based upon transform coding (TC). It compares several transforms and shows that the cosine transform leads to a nearly optimum performance for almost all speech sounds. Various adaptive coding strategies are then investigated, and a coding scheme is proposed that is based on a nonadaptive discrete cosine transform (DCT), on an adaptive bit assignment, and on adaptive quantization. The adaptation is controlled by a short-term basis spectrum that is derived from the transform coefficients prior to coding and transmission and that is transmitted as side information to the receiver. The main result is that this adaptive transform coder performs better than all known nonpitch-tracking coding schemes; it extends the range of speech waveform coding to lower bit rates and closes the gap between vocoders and predictive waveform coders.

340 citations


Journal ArticleDOI
TL;DR: A template-matching procedure which uses as its basic waveform features a set of linear prediction coefficients and is used in conjunction with a dynamic-programming time-warp algorithm developed by Bridle and a novel method for using multiple templates.
Abstract: This paper considers the problem of automatically detecting and locating key words in a stream of continuous speech The system described here is a template-matching procedure which uses as its basic waveform features a set of linear prediction coefficients The similarity measure between a segment of the template and a segment of the incoming speech stream is taken to be a ratio of minimum prediction residuals This similarity measure is used in conjunction with a dynamic-programming time-warp algorithm developed by Bridle and a novel method for using multiple templates Using templates and incoming speech spoken by the same person in a quiet room, an accuracy in excess of 99 percent was obtained Further experiments are described which explore cross-speaker word spotting and the effects of noise on system performance The results of these experiments suggest that the technique described in this paper could well form the basis for a practical system

63 citations


Journal ArticleDOI
TL;DR: It is shown by theoretical argument and by experiment that selection of an undriven segment of voiced speech for analysis by linear predictive coding (LPC) gives more accurate estimates of the poles of the vocal-tract model.
Abstract: We show by theoretical argument and by experiment with both synthetic and real data that selection of an undriven segment of voiced speech for analysis by linear predictive coding (LPC) gives more accurate estimates of the poles of the vocal-tract model. In the case of voiced nasal phonemes, this technique provides a simple algorithm for separately determining the poles and the zeros in the model and illustrates the desirability of identifying the portions of the speech wave during which there is a significant driving input. A key problem which remains is the development of a practical algorithm for selecting such segments for analysis.

40 citations


Journal ArticleDOI
TL;DR: The quality of LPC (linear predictive coding) analyzed and synthesized speech was evaluated and subject preference as a function of the pitch range of the speaker and the transmission environment used in the recording is discussed.
Abstract: A subjective evaluation of seven pitch detectors has been carried out using synthetic speech. The evaluation is intended to complement the objective performance evaluation of the same pitch detection algorithms in the investigation of Rabiner et al. [1]. In the earlier study, each of the seven algorithms was evaluated on the basis of its performance with respect to four different types of errors. The standard of comparison was a semiautomatically determined pitch contour of each utterance in the experimental corpus. In the present study, the quality of LPC (linear predictive coding) analyzed and synthesized speech was evaluated. The pitch contour used in the synthesis was obtained either from one of the seven pitch detectors or from the semiautomatic pitch analysis. Using a computer-controlled sort board, an experiment was run in which each of eight listeners was asked to rank the nine versions of each utterance (the natural version was included to provide a stable anchor point). Results are presented on the overall preference for each pitch detector. In addition, subject preference as a function of the pitch range of the speaker and the transmission environment used in the recording is discussed. The present results are compared to those obtained in the earlier objective performance study.

36 citations


Journal ArticleDOI
P. de Souza1
TL;DR: It is shown that Itakura's prediction-residual ratio is intuitively unsatisfactory and theoretically misleading as a distance measure, and two slower, but more accurate statistical means of comparison are suggested.
Abstract: This paper considers the problem of comparing two sets of (LPC) coefficients or, more generally, that of comparing two short segments of speech via LPC techniques. It is shown that Itakura's prediction-residual ratio is intuitively unsatisfactory and theoretically misleading as a distance measure. Two slower, but more accurate statistical means of comparison are suggested, and these are supported by evidence from a simulation study.

34 citations


Journal ArticleDOI
TL;DR: The goal was a low-power, low-cost, compact special-purpose realization of a narrow-band speech terminal, and the resultant design is a general-purpose two-bus structure running at a 150 ns cycle time.
Abstract: A microprocessor realization for a linear predictive vocoder is presented. The goal was a low-power, low-cost, compact special-purpose realization of a narrow-band speech terminal. The resultant design is a general-purpose two-bus structure running at a 150 ns cycle time, using as the basic signal processing element, four of the AMD 2901 CPE chips. This basic structure is augmented by a four-cycle multiplier to allow for sufficient signal processing power. The design concessions that mark the linear predictive coding microprocessor (LPCM) as a special-purpose machine designed to be a speech terminal are: limited I/O and limited memory. The present design requires 162 dual-in-line packages, dissipates less than 45 W and occupies about \frac{1}{3} ft3.

32 citations


Journal ArticleDOI
N. Dixon1, Harvey F. Silverman
TL;DR: The modular acoustic processor (MAP), a complex experimental system for automatic derivation of phonemic string output for continuous speech, has stages dedicated to signal analysis, spectral classification, phonemic segmentation, phonetic (steady state) classified, phoneme boundary placement, dyadic (transitional) classification, and final phoneme string consolidation.
Abstract: The modular acoustic processor (MAP), a complex experimental system for automatic derivation of phonemic string output for continuous speech, has stages dedicated to signal analysis, spectral classification, phonemic segmentation, phonemic (steady state) classification, phoneme boundary placement, dyadic (transitional) classification, and final phoneme string consolidation. This paper presents the concepts of and some details concerning these five stages. Results on a large body of continuous speech data, prepared by an automatic evaluation system, will also be presented.

28 citations


Journal ArticleDOI
TL;DR: This paper describes a speech analysis-synthesis system based on stationary linear prediction formulation that uses a variable analysis frame size concept and the k-parameters are used to represent the spectral information in the speech.
Abstract: This paper describes a speech analysis-synthesis system based on stationary linear prediction formulation. This system uses a variable analysis frame size concept. The k-parameters are used to represent the spectral information in the speech. The statistical and quantization properties of k-parameters are studied in detail. A method for calculating the analysis frame size based on energy and pitch period variations within a speech waveform has been developed. The speech analysis-synthesis system has been implemented on the computing facility of the Signal Processing Laboratory at Case Western Reserve University. Average data rates of 4800, 3600, and 2400 bits/s have been achieved on a limited speech data base of male speakers.

22 citations


Patent
11 May 1977
TL;DR: In this paper, a subscription TV decoder for processing encoded audio and video signals in which the TV program audio information is received as an audio subcarrier includes means for decoding the video signal, a filter for separating the decoded video and audio signals, and a filter that separates the audio signal from the audio information sub-carrier.
Abstract: A subscription TV decoder for processing encoded audio and video signals in which the TV program audio information is received as an audio subcarrier includes means for decoding the video signal, a filter for separating the decoded video and audio signals, and a filter for separating the audio signal from the audio information subcarrier. The audio information subcarrier is multiplied to raise it to the frequency of the audio signal and is then recombined with the decoded video signal.

18 citations



Proceedings ArticleDOI
09 May 1977
TL;DR: A data compression technique, Adaptive Differential PCM with Time Assignment Speech Interpolation (TASI), that is capable of reducing the bit rate required for PCM encoded speech is described and evaluated using computer simulation.
Abstract: A data compression technique, Adaptive Differential PCM (ADPCM) with Time Assignment Speech Interpolation (TASI), that is capable of reducing the bit rate required for PCM encoded speech is described. The particular case of 2:1 compression in a T1 system environment is described in detail and evaluated using computer simulation. The ADPCM/TASI system has wide dynamic range, little degradation under loading, than standard PCM. Signal-to-noise ratios provide an objective metric. An audio tape containing computer processed speech for various ADPCM/TASI systems in various environments accompanies this presentation.

Journal ArticleDOI
TL;DR: This work reports on construction of a TTL hardware multi-path sequential encoder which uses the so-called M algorithm search procedure, and hardware peculiar to this type of encoder is discussed, including architecture of the search algorithm sorter, the squared error calculator, and the code generator.
Abstract: Recent work has shown the usefulness of sequential or "tree" encoding of speech. We report on construction of a TTL hardware multi-path sequential encoder which uses the so-called M algorithm search procedure. The device attains a signal-to-noise ratio of about 20 dB at 16 kbits/s. Hardware peculiar to this type of encoder is discussed, including architecture of the search algorithm sorter, the squared error calculator, and the code generator.

Proceedings ArticleDOI
R. Crochiere1, M. R. Sambur
09 May 1977
TL;DR: The standard fixed sub-band coding scheme has been modified to allow the center frequency of the two upper bands to vary in accordance with the dynamic movement of the vocal tract resonances F2 and F3 to produce moderate-quality, intelligible speech.
Abstract: The standard fixed sub-band coding scheme has been modified to allow the center frequency of the two upper bands to vary in accordance with the dynamic movement of the vocal tract resonances F2 and F3. A relatively simple zero-crossing technique is used to measure the formants F2 and F3. Through the use of this variable band coder, it is possible to produce moderate-quality, intelligible speech at 4.8 kb/s (quality is slightly less than that of a 7.2-kb/s fixed sub-band coder and equal to that of about a 16-kb/s adm coder). The reasonably good intelligibility of the 4.8-kb/s variable-band coded speech can be attributed to the coders attempt to capture and encode those spectral components of the signal that are perceptually most significant (the region around the formants). The major advantage of the variable-band scheme is that its implementation is considerably less complex than other waveform coding schemes or vocoder systems that can produce intelligible, narrowband speech.

Proceedings ArticleDOI
C. Un1, S. Yang
01 May 1977
TL;DR: A new quantization method for coding reflection coefficients in linear predictive coding (LPC) of speech employs piece-wise linear quantization and requires statistical properties of the LPC reflection coefficients.
Abstract: We present a new quantization method for coding reflection coefficients in linear predictive coding (LPC) of speech. It employs piece-wise linear quantization and requires statistical properties of the LPC reflection coefficients. Although the quantization scheme is based on the density of the frequencies of the coefficient values, it does not neglect the importance of spectral sensitivity. In our informal subjective listening tests it was observed that the quality of synthetic speech with the transmission rate of 2.4 kbits/s coded by the piecewise linear quantization method was equivalent to the quality with the rate of 3 kbits/s coded by a linear quantization method.


Journal ArticleDOI
R. Crochiere1, M. R. Sambur
TL;DR: In this article, a variable band coding scheme was proposed to allow the center frequency of the two upper bands to vary in accordance with the dynamic movement of the vocal tract resonances F2 and F3.
Abstract: The standard fixed sub-band coding scheme has been modified to allow the center frequency of the two upper bands to vary in accordance with the dynamic movement of the vocal tract resonances F2 and F3. A relatively simple zero-crossing technique is used to measure the formants F2 and F3. Through the use of this variable band coder, it is possible to produce moderate-quality, intelligible speech at 4.8 kb/s (quality is slightly less than that of a 7.2-kb/s fixed sub-band coder and equal to that of about a 16-kb/s adm coder). The reasonably good intelligibility of the 4.8-kb/s variable-band coded speech can be attributed to the coders attempt to capture and encode those spectral components of the signal that are perceptually most significant (the region around the formants). The major advantage of the variable-band scheme is that its implementation is considerably less complex than other waveform coding schemes or vocoder systems that can produce intelligible, narrowband speech.

Proceedings ArticleDOI
01 May 1977
TL;DR: Variable data rate LPC speech compression schemes are employed to transmit LPC parameters only when speech characteristics have changed sufficiently since the last transmission, yielding improved speech quality relative to fixed-rate schemes for a given average transmission rate.
Abstract: Variable data rate LPC speech compression schemes are employed to transmit LPC parameters only when speech characteristics have changed sufficiently since the last transmission, yielding improved speech quality relative to fixed-rate schemes for a given average transmission rate. Transmission of variable-rate LPC speech over fixed-rate channels is accomplished using transmit and receive buffers, with resulting transmission delays. Development of proper buffer control strategy is essential to minimize losses caused by exhausting either buffer, or by corrective actions, namely, forced or suppressed transmission. Certain aspects of such strategy and their impact on speech quality and data rate are discussed for a narrowband (2400 bps) speech transmission system.

Journal ArticleDOI
01 Oct 1977
TL;DR: The results of an extensive investigation of the properties of 64-point Hadamard transformed speech are presented in this article, where detailed information is given about the probability density functions of the hadamard coefficients, the average power-density spectrum in the Hasamard domain and the logical-autocorrelation function.
Abstract: The results of an extensive investigation of the properties of 64-point Hadamard transformed speech are presented. Detailed information is given about the probability density functions of the Hadamard coefficients, the average power-density spectrum in the Hadamard domain and the logical-autocorrelation function. The results indicate that good-quality speech can be reconstructed from 6 to 8 dominant Hadamard coefficients, but that the use of fewer coefficients is unlikely to lead to the reconstruction of speech of acceptable quality. The results of a preliminary series of listening tests are presented and these confirm conclusions drawn from the statistical properties of the transformed speech. It is shown that the number of bits needed for coefficient labelling constitutes a significant proportion of the total number of bits needed to represent Hadamard transformed speech. A technique is presented for reducing by more than 50% the number of labelling bits needed, and it is explained how, by using this technique, it should be possible to obtain good quality speech when using a transmission bit rate of 8 k bits/s.

Journal ArticleDOI
TL;DR: The strategy is a shift of circuit emphasis from analog to digital in order to take full advantage of low-cost, low-speed digital processing technology such as MOS/LSI to achieve the desired objectives.
Abstract: An Adaptive Delta Modulator and demodulator are used as the first and last stages in a system for coding and decoding telephone signals into \mu = 255 Companded Pulse Code Modulation. The system objectives are to devise an economic coder and decoder that is reliable, free of potentiometer adjustments, and convenient for automated manufacturing for large quantity production. The strategy is a shift of circuit emphasis from analog to digital in order to take full advantage of low-cost, low-speed digital processing technology such as MOS/LSI to achieve the desired objectives. The system structure, digital signal processing, system implementation, and performance of the prototype are discussed in this paper.

01 Oct 1977
TL;DR: Results of an examination of four methods for processing speech so as to enhance its intelligibility in the presence of wideband random noise at the source are described.
Abstract: : This report describes results of an examination of four methods for processing speech so as to enhance its intelligibility in the presence of wideband random noise at the source. The four methods were: (1) INTEL, a method which involves processing in both the first and second order spectral domains; (2) Spectral subtraction, which involves a simple subtraction of the average noise spectrum from the first-order spectrum; (3) Minimum mean square error filtering, which involves filtering speech in such a way as to minimize the mean square error between a signal and its expected value in noise; and (4) Methods based upon suppressing the frequency content of a speech plus noise signal between pitch harmonics of the speech signal. To carry out a study of methods of enhance speech intelligibility in noise, two general-purpose computer processing systems were implemented. The first, a terminal interactive system for generation, analysis, and graphic display of synthetic voiced speech sounds, provided considerable insight into the effect of various processing algorithms upon speech and upon speech in noise. The second computer processing system has been developed for the processing of real speech. It involves use of a DDP-116 data converter and a Honeywell 6000 Computer.

Journal ArticleDOI
TL;DR: It is shown that in addition to speech interpolation, substantial savings can be achieved through the use of redundancy-reducing transcoding schemes, one for voiced and one for unvoiced sounds.
Abstract: This paper focuses on the efficient transmission of digital telephone-quality speech in a multichannel situation. It is shown that in addition to speech interpolation, substantial savings can be achieved through the use of redundancy-reducing transcoding schemes, one for voiced and one for unvoiced sounds. Specific waveform properties and coding schemes for both types of sounds, as well as the means for the discrimination, are discussed. The performance of two experimental systems which, respectively, triple and quadruple digital carrier capacity are presented.

Proceedings ArticleDOI
01 May 1977
TL;DR: The theoretical and practical limits for voiced and unvoiced segmentation using two distinct bit patterns in blockquantization are outlined, and useful constraints for implementing a transform encoder on a fast digital signal processor are derived with computer simulations.
Abstract: Fourier Transformation and blockquantization offers a good tool for decorrelation and data compression of speech signals as well as for correlated random processes. Polar plane quantization versus cartesian quantization is presented and the objective and subjective performance is discussed according to the effects of introducing phase errors. The theoretical and practical limits for voiced and unvoiced segmentation using two distinct bit patterns in blockquantization are outlined. Useful constraints for implementing a transform encoder on a fast digital signal processor are derived with computer simulations, demonstrating high speech qualities for data rates downto 8-16 kbit/s .

Journal ArticleDOI
01 Dec 1977
TL;DR: The paper describes the extension of the area function to an areagraph display for continuous speech developed for the training of continuous speech, and various forms of the areagraph are described and compared with the spectrograph.
Abstract: The classical displays used in speech analysis are the spectrum for single sounds and the spectrograph for continuous speech. Recent work using linear prediction analysis has led to the display of the vocal tract area function and this has been found useful in the speech training of single sounds for the deaf. The paper describes the extension of the area function to an areagraph display for continuous speech. This has been developed for the training of continuous speech, and various forms of the areagraph are described and compared with the spectrograph. Areagraph displays are also thought to be potentially useful in applications other than speech training.


Proceedings ArticleDOI
Subhro Das1, Charles C. Tappert
01 May 1977
TL;DR: A real-time Adaptive Differential Pulse Code Modulation system is described which employs an inexpensive, currently about $25, stack-architecture microprocessor which performs all processing between the taking of input speech samples.
Abstract: A real-time Adaptive Differential Pulse Code Modulation (ADPCM) system is described which employs an inexpensive, currently about $25, stack-architecture microprocessor. The coder operates at 3 bits/sample and a 10 KHz rate and performs all processing between the taking of input speech samples. The ADPCM tables are stored in the stack for rapid manipulation of table column and row pointers allowing implementation of real-time operation.

Proceedings ArticleDOI
N. Dixon1
01 May 1977
TL;DR: The rationale for and some examples from an application hierarchy and a recognition-then-segmentation approach will be presented; this approach has been used fairly successfully in phonemic segmentation of continuous speech.
Abstract: In Automatic Recognition of Continuous Speech (ARCS), one approach is to segment the speech continuum approximately at the phoneme level as an initial step in abstracting lexical and/or sementic content. If heuristic rules are used for this segmentation, the order of rule application and the character of the data to be used by the rules become important considerations. The rationale for and some examples from an application hierarchy and a recognition-then-segmentation approach will be presented; this approach has been used fairly successfully in phonemic segmentation of continuous speech.

Proceedings ArticleDOI
07 Nov 1977
TL;DR: A single data line zero-crossing detector input to a microprocessor or other computer is evaluated for use in lieu of analog/digital conversion in the processing of complex continuous-time waveforms.
Abstract: A single data line zero-crossing detector input to a microprocessor or other computer is evaluated for use in lieu of analog/digital conversion in the processing of complex continuous-time waveforms.

Proceedings ArticleDOI
01 May 1977
TL;DR: An efficient yet simple way to resolve the problem of on-line speech/data-modem identification for a class of FSK and PSK modems by using a pattern classifier technique based on four parameters extracted from 8ms- block analysis.
Abstract: The application of efficient speech compression schemes to digital telephony is hindered by the potential transit of non-speech signal generated by data modems. This obstacle can be overcome through on-line speech/data-modem identification. This paper presents an efficient yet simple way to resolve this problem for a class of FSK and PSK modems. The approach uses a pattern classifier technique based on four parameters extracted from 8ms- block analysis. These are the average log magnitude, the extremum count, the zero-crossing count and max-to-mean amplitude deviation. The speech/data-modem decision is obtained through a piecewise-linear partitionning of the parameter space.

Proceedings ArticleDOI
01 Dec 1977
TL;DR: The covariance lattice linear prediction method is shown to have some advantages over the other methods for segmenting speech into regions where the spectrum is approximately stationary.
Abstract: Linear predictive analysis/synthesis methods offer an efficient means of low bit rate encoding of speech signals. This method involves the time segmentation of the speech into regions where the spectrum is approximately stationary. The analysis/ synthesis technique is discussed as well as four methods of determining linear predictive approximations A comparison of these methods on a synthetic speech-like signal and on real speech is presented. The covariance lattice linear prediction method is shown to have some advantages over the other methods for segmenting speech into regions where the spectrum is approximately stationary.

DOI
01 Jan 1977
TL;DR: This thesis reviews recent research into the speech perception process and revises the analysis by synthesis model of speech perception, revealing that the human auditory system is innately equipped to divide s t i m u l i (both speech and non-speech) that vary along c e r t a i n acoustic dimensions into d i s c r e t e classes.
Abstract: During the time when a c h i l d learns the sound system of h i s language, there i s much evidence that the c h i l d can perceive phonological d i s t i n c t i o n s and therefore detect phonetic differences before he can produce these d i s t i n c t i o n s . This evidence i s often provided to disprove the hypothesis that the c h i l d could be using an " a c t i v e " model of speech perception. One such model, the analysis by synthesis model of speech perception, supposes that decoding of the acoustic s i g n a l employs the a r t i c u l a t o r y representation that would be required to produce the hypothesized i d e n t i t y of the incoming s i g n a l . The model proposes that while the human auditory system i s innately equipped to handle the segments contained i n speech, that the c o r r e l a t i o n s between the acoustic information and a r t i c u l a t i o n are learned with experience and form the basis for the d i v i s i o n of the continuous acoustic s i g n a l into d i s c r e t e categories of speech sounds. This thesis reviews recent research into the speech perception process and revises the analysis by synthesis model. I t reveals that the human auditory system i s innately equipped to divide s t i m u l i (both speech and non-speech) that vary along c e r t a i n acoustic dimensions into d i s c r e t e classes. The unique processing that r e s u l t s f o r speech s t i m u l i , occurs when the s t i m u l i i s recognized as having a function i n the system of language. Hence the requirements for phonetic processing involve the psychological r e a l i z a t i o n that stimulus originated i n the human vocal t r a c t . This i n v e s t i g a t i o n then reviewed the av a i l a b l e l i t e r a t u r e on the perception and production of ch i l d r e n acquiring language to determine