scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 1976"


Journal ArticleDOI
TL;DR: A pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal, which has been found to provide reliable classification with speech segments as short as 10 ms.
Abstract: In speech analysis, the voiced-unvoiced decision is usually performed in conjunction with pitch analysis The linking of voiced-unvoiced (V-UV) decision to pitch analysis not only results in unnecessary complexity, but makes it difficult to classify short speech segments which are less than a few pitch periods in duration In this paper, we describe a pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal In this method, five different measurements are made on the speech segment to be classified The measured parameters are the zero-crossing rate, the speech energy, the correlation between adjacent speech samples, the first predictor coefficient from a 12-pole linear predictive coding (LPC) analysis, and the energy in the prediction error The speech segment is assigned to a particular class based on a minimum-distance rule obtained under the assumption that the measured parameters are distributed according to the multidimensional Gaussian probability density function The means and covariances for the Gaussian distribution are determined from manually classified speech data included in a training set The method has been found to provide reliable classification with speech segments as short as 10 ms and has been used for both speech analysis-synthesis and recognition applications A simple nonlinear smoothing algorithm is described to provide a smooth 3-level contour of an utterance for use in speech recognition applications Quantitative results and several examples illustrating the performance of the method are included in the paper

479 citations


Proceedings ArticleDOI
12 Apr 1976
TL;DR: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum, which provides a means for controlling and reducing quantizing noise in the coding.
Abstract: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum. The approach provides a means for controlling and reducing quantizing noise in the coding. Each sub-band is quantized with an accuracy (bit allocation) based upon perceptual criteria. As a result, the quality of the coded signal is improved over that obtained from a single full-band coding of the total spectrum. In one implementation, the individual sub-bands are low-pass translated before coding. In another, "integer-band" sampling is employed to alias the signal in an advantageous way before coding. Other possibilities extend to complex demodulation of the sub-bands, and to representing the subband signals in terms of envelopes and phase-derivatives. In all techniques, adaptive quantization is used for the coding, and a parsimonious allocation of bits is made across the bands. Computer simulations are made to demonstrate the signal qualities obtained for codings at 16 and 9.6 Kbits/sec.

276 citations


Journal ArticleDOI
TL;DR: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum, which provides a means for controlling and reducing quantizing noise in the coding.
Abstract: A rationale is advanced for digitally coding speech signals in terms of sub-bands of the total spectrum. The approach provides a means for controlling and reducing quantizing noise in the coding. Each sub-band is quantized with an accuracy (bit allocation) based upon perceptual criteria. As a result, the quality of the coded signal is improved over that obtained from a single full-band coding of the total spectrum. In one implementation, the individual sub-bands are low-pass translated before coding. In another, “integer-band” sampling is employed to alias the signal in an advantageous way before coding. Other possibilities extend to complex demodulation of the sub-bands, and to representing the sub-band signals in terms of envelopes and phase-derivatives. In all techniques, adaptive quantization is used for the coding, and a parsimonious allocation of bits is made across the bands. Computer simulations are made to demonstrate the signal qualities obtained for codings at 16 and 9.6 kb/s.

252 citations



Journal ArticleDOI
TL;DR: It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.
Abstract: This paper presents the results of an examination of rapid amplitude compression following high-pass filtering as a method for processing speech, prior to reception by the listener, as a means of enhancing the intelligibility of speech in high noise levels. Arguments supporting this particular signal processing method are based on the results of previous perceptual studies of speech in noise. In these previous studies, it has been shown that high-pass filtered/clipped speech offers a significant gain in the intelligibility of speech in white noise over that for unprocessed speech at the same signal-to-noise ratios. Similar results have also been obtained for speech processed by high-pass filtering alone. The present paper explores these effects and it proposes the use of high-pass filtering followed by rapid amplitude compression as a signal processing method for enhancing the intelligibility of speech in noise. It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.

131 citations


Journal ArticleDOI
01 Apr 1976
TL;DR: The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.
Abstract: For many applications, it is desirable to be able to convert arbitrary English text to natural and intelligible sounding speech. This transformation between two surface forms is facilitated by first obtaining the common underlying abstract linguistic representation which relates to both text and speech surface representations. Calculation of these abstract bases then permits proper selection of phonetic segments, lexical stress, juncture, and sentence-level stress and intonation. The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.

116 citations


PatentDOI
TL;DR: In this paper, the transposed formant frequencies are determined at successive intervals in the speech signal using a fixed value, greater than 1, and added to this fixed value is another fixed value to obtain what are called transposed fundamental frequencies.
Abstract: A hearing aid system and method includes apparatus for receiving a spoken speech signal, apparatus coupled to the receiving apparatus for determining at successive intervals in the speech signal the frequency and amplitude of the largest formants, apparatus for determining at successive intervals the fundamental frequency of the speech signal, and apparatus for determining at successive intervals whether or not the speed signal is voiced or unvoiced. Each successively determined formant frequency is divided by a fixed value, greater than 1, and added thereto is another fixed value, to obtain what are called transposed formant frequencies. The fundamental frequency is also divided by a fixed value, greater than 1, to obtain a transposed fundamental frequency. At the successive intervals, sine waves having frequencies corresponding to the transposed formant frequencies and the transposed fundamental frequency are generated, and these sine waves are combined to obtain an output signal which is applied to a transducer for producing an auditory signal. The amplitudes of the sine waves are functions of the amplitudes of corresponding formants. If it is determined that the speech signal is unvoiced, then no sine wave corresponding to the transposed fundamental frequency is produced and the other sine waves are noise modulated. The auditory signal produced by the transducer in effect constitutes a coded signal occupying a frequency range lower than the frequency range of normal speech and yet which is in the residual-hearing range of many hearing-impaired persons.

74 citations


Journal ArticleDOI
TL;DR: An alternate approach that uses the least mean square (LMS) gradient, stochastic-approximation algorithm, commonly used in many other adaptive systems is described, and a complete 8-coefficient hardward system has been designed and constructed and is described in this paper.
Abstract: Adaptive linear prediction (ALP) recently has received a great deal of attention for spectral analysis, system modeling, and speech encoding. The conventional approach used to implement ALP involves the computation of a sample covariance matrix for a block of data and solution of an associated set of simultaneous equations to obtain the predictor coefficients. This paper describes an alternate approach that uses the least mean square (LMS) gradient, stochastic-approximation algorithm, commonly used in many other adaptive systems. A complete 8-coefficient hardward system based on this approach has been designed and constructed and is described in this paper. The system consists of an analyzer that computes the eight ALP coefficients in real time and a reconstructor that forms an all-pole model filter using the computed coefficients. Several examples are presented to illustrate the concepts introduced. Each example includes an analytical discussion followed by experimental verification. Applications of ALP for spectral analysis, instantaneous frequency measurement, and speech encoding are discussed and experimental results obtained with the real-time hardware are presented.

63 citations


Journal ArticleDOI
TL;DR: The system described in this paper is subdivided into three main steps: pitch extraction, segmentation, and formant analysis, which uses an adaptive digital filter in time-domain transforming the speech signal into a signal similar to the glottal waveform.
Abstract: The system described in this paper is subdivided into three main steps: pitch extraction, segmentation, and formant analysis. The pitch extractor uses an adaptive digital filter in time-domain transforming the speech signal into a signal similar to the glottal waveform. Using the levels of the speech signal and the differenced signal as parameters in time domain, the subsequent segmentation algorithm derives a signal parameter which describes the speed of articulatory movement. From this, the signal is divided into "stationary" and "'transitional" segments; one stationary segment is associated to one phoneme. For the formant tracking procedure, a subset of the pitch periods is selected by the segmentation algorithm and is transformed into frequency domain. The formant tracking algorithm uses a maximum detection strategy and continuity criteria for adjacent spectra. After this step, the total parameter set is offered to an adaptive universal pattern classifier which is trained by selected material before working. For stationary phonemes, the recognition rate is about 85 percent when training material and test material are uttered by the same speaker. The recognition rate is increased to about 90 percent when segmentation results are used.

47 citations


Journal ArticleDOI
Harvey F. Silverman1, N. Dixon
TL;DR: Of those evaluated, a linearly mean-corrected minimum distance measure, on a 40-point spectral representation with a square (or cube) norm was consistently superior to the other methods.
Abstract: An important consideration in speech processing involves classification of speech spectra. Several methods for performing this classification are discussed. A number of these were selected for comparative evaluation. Two measures of performance-accuracy and stability-were derived through the use of an automatic performance evaluation system. Over 3000 hand-labeled spectra were used. Of those evaluated, a linearly mean-corrected minimum distance measure, on a 40-point spectral representation with a square (or cube) norm was consistently superior to the other methods.

39 citations


Journal ArticleDOI
TL;DR: It is found that delayed encoding allows a fairly general predictor to be used without causing instability problems andSimulations indicate that considerable improvement can be achieved by matching the feedback filter to the input process.
Abstract: This concise paper is concerned with the problem of improved delta-coding by using delayed decision instead of bit-bit-by-bit decision. It is found that delayed encoding allows a fairly general predictor to be used without causing instability problems. Simulations indicate that considerable improvement can be achieved by matching the feedback filter to the input process. Delayed encoding requires a search algorithm for making the decisions. Some proposals of algorithms that are efficient from a computational point of view are presented. Particular interest is attached to a highly truncated version of the Viterbi algorithm, which seems very promising.

Journal ArticleDOI
TL;DR: In this paper, three methods of extracting resonance information from predictor-coefficient coded speech are compared: finding roots of the polynomial in the denominator of the transfer function using Newton iteration, picking peaks in the spectrum of the transferred function, and picking peaks on the negative of the second derivative of the spectrum.
Abstract: Three methods of extracting resonance information from predictor-coefficient coded speech are compared. The methods are finding roots of the polynomial in the denominator of the transfer function using Newton iteration, picking peaks in the spectrum of the transfer function, and picking peaks in the negative of the second derivative of the spectrum. A relationship was found between the bandwidth of a resonance and the magnitude of the second derivative peak. Data, accumulated from a total of about two minutes of running speech from both female and male talkers, are presented illustrating the relative effectiveness of each method in locating resonances. The second-derivative method was shown to locate about 98 percent of the significant resonances while the simple peak-picking method located about 85 percent.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal.
Abstract: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal. A period of short-time autocorrelation function is sampled and spliced after change of the time scale. Transformed speech is quite natural and free from distortion. Applications of SPAC are expected in many fields such as improvement of speech quality, narrow band transmission, communication aid for hard of hearing, information service for blind, unscrambling of helium speech, stenography and so on.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: The real time implementation of a Linear Predictive Coding algorithm that has been developed over the past five years is described, using a modification of the Covariance Method for the analyzer and the system for pitch extraction and smoothing.
Abstract: This paper describes the real time implementation of a Linear Predictive Coding algorithm that has been developed over the past five years. The algorithm chosen for the analyzer is a modification of the Covariance Method introduced by B. S. Atal [1],[2] of Bell Labs. The system for pitch extraction uses a minimum distance function correlation technique. A dynamic programming algorithm [3] is used for pitch smoothing and correction of isolated pitch errors. The synthesizer uses a transversal filter. Considerable time has been devoted to optimizing the running time and integer scaling of the different algorithms for real time implementation on a 16 bit mini-computer.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: Subjective quality ratings of pcm coded speech were obtained with the aims of determining the effects of certain coder parameters and their interactions on speech quality, finding objective measures for predicting perceived distortions, and providing guidelines for optimizing coder design.
Abstract: An experiment was performed to investigate: (a) the influence of PCM code parameters on subjective speech quality, (b) objective measures for predicting perceived distortions and (c) optimum combinations of code parameters. The results indicate that listener opinions depended strongly on coder clipping level and step size but only weakly on bandwidth. Clipping noise power proved a poor predictor of perceived overload distortion; clipping percentage was more useful. Granular noise power was a good predictor of granular distortion. For a given bit rate, the coder with the highest quality rating was not the one with minimum total (clipping + granular) noise power, contrary to traditional wisdom.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: The voice-operated question-answering system for seat reservation is constructed by computer simulation technique and the promising results are obtained.
Abstract: The speech recognition system composing a part of the question-answering system operated by conversational speech is described. The recognition system consists of two stages of process : acoustic processing stage and linguistic processing stage. In the acoustic processing stage, input speech is analyzed and transformed into the phoneme sequence which usually contains ambiguities and errors caused in the segmentation and phoneme recognition. In the linguistic processing stage, the phoneme sequence containing ambiguities and errors is converted into the correct word sequence by the use of the linguistic knowledge such as phoneme rewriting rules, lexicon, syntax, semantics and pragmatics. The voice-operated question-answering system for seat reservation is constructed by computer simulation technique and the promising results are obtained.

Book ChapterDOI
01 Jan 1976
TL;DR: The fundamental frequency (F0) is the rate at which glottal volume velocity pulses are applied to the vocal tract, i.e., the driving function to the model is periodic with a period of 1/F0.
Abstract: The fundamental frequency (F0) is a basic parameter in acoustical studies of speech. It is also a necessary parameter for low bit rate speech coding systems. It is generally considered to be one of the acoustical correlates to the perceived intonation pattern of speech. If the fundamental frequency of a speaker is constant, the speech would be perceived as being machine-like or monotone. If the speaker is excited, the fundamental frequency generally increases. It is the acoustical correlate to the rate at which the vocal folds open and close (or vibrate). If the folds are vibrating rapidly, a high fundamental frequency will be measured. In the linear speech production model, the fundamental frequency is the rate at which glottal volume velocity pulses are applied to the vocal tract, i.e., the driving function to the model is periodic with a period of 1/F0.

Proceedings ArticleDOI
John Makhoul1
01 Apr 1976
TL;DR: This paper presents a general analysis-synthesis scheme for the arbitrary spectral distortion of speech signals without the need for pitch extraction, and linear predictive warping, cepstral Warping, and autocorrelation warping are given as examples of the general scheme.
Abstract: The spectral distortion of speech signals, without affecting the pitch or the speed of the signal, has met with some difficulty due to the need for pitch extraction. This paper presents a general analysis-synthesis scheme for the arbitrary spectral distortion of speech signals without the need for pitch extraction. Linear predictive warping, cepstral warping, and autocorrelation warping, are given as examples of the general scheme. Applications include the unscrambling of helium speech, spectral compression for the hard of hearing, bit rate reduction in speech compression systems, and efficiency of spectral representation for speech recognition systems.

Journal ArticleDOI
TL;DR: A simple algorithm for locating the beginning and end of a speech utterance has been developed that has been tested in computer simulations and has been constructed with standard integrated circuit technology.
Abstract: When speech is coded using a differential pulse-code modulation system with an adaptive quantizer, the digital code words exhibit considerable variation among all quantization levels during both voiced and unvoiced speech intervals. However, because of limits on the range of step sizes, during silent intervals the code words vary only slightly among the smallest quantization steps. Based on this principle, a simple algorithm for locating the beginning and end of a speech utterance has been developed. This algorithm has been tested in computer simulations and has been constructed with standard integrated circuit technology.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: A technique that splits the spectrum into two equal halves and performs a piecewise LPC approximation to each half is described, and the fidelity is expected to be higher than standard LPC.
Abstract: A great deal of current research in the area of narrowband digital speech compression makes use of the Linear Prediction Coding (LPC) algorithm to extract the vocal track spectrum. This paper describes a technique that splits the spectrum into two equal halves and performs a piecewise LPC approximation to each half. By taking advantage of the classical benefits of piecewise approximation, the fidelity is expected to be higher than standard LPC. In addition, by making use of under-sampling and spectrum folding, computational requirements are reduced by about 40%. PLPC has been implemented in real time on the CSP-30 computer at the Speech Research and Development Facility of the Communications Security Engineering Office (DCW) at ESD.

Journal ArticleDOI
TL;DR: Spelled Speech can be used as feedback by a blind typist to monitor her typing and correct her typing mistakes.
Abstract: Spelled Speech can be used as feedback by a blind typist to monitor her typing and correct her typing mistakes. The speech can be produced using a computer or using a small portable digital apparatus.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: Characteristics of common sources of noise and distortion are described in this paper and their effect in shaping the spectrum of speech is discussed.
Abstract: Parameter or feature extraction from speech signal forms the basis for systems designed for speech recognition, speaker verification, speech bandwidth compression etc. The parameters in general are critically dependent upon the short-time spectrum of speech. The input speech waveform is however, subjected to several types of noises and distortions due to background noise sources, reverberation, close speaking into a microphone, telephone system imperfections etc. These factors modify the spectrum of the speech signal and hence the parameters extracted. Characteristics of common sources of noise and distortion are described in this paper and their effect in shaping the spectrum of speech is discussed. Steps to reduce the influence of some noises while producing speech input to a system are suggested. Methods of normalization of spectral distortions due to noise and the effect of such normalization on parametric extraction are also discussed.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: This research has resulted in the development of a new pitch-synchronous analysis technique for the extraction of accurate formant information from speech signals that is an improvement over current methods of analysis in terms of accuracy and temporal resolution.
Abstract: This research has resulted in the development of a new pitch-synchronous analysis technique for the extraction of accurate formant information from speech signals. The method is an improvement over current methods of analysis in terms of accuracy and temporal resolution. This is achieved by extension of the signal from one pitch period into the next, using a speech production model based on linear prediction. The result is higher accuracy in the determination of formant frequencies, bandwidths and amplitudes, and the ability to follow rapid formant transitions. The method performs equally well with nasal and high pitched sounds. The method is applied to the speech recognition and the speaker identification problems.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: A new adaptive quantizer for speech digitization has been derived that adjusts its dynamic range to match that of the speech waveform and further adjusts its range to compensate for the increased signal strength that follows a pitch pulse.
Abstract: A new adaptive quantizer for speech digitization has been derived. It is similar to known adaptive quantizers in that it adjusts its dynamic range to match that of the speech waveform. In addition, it further adjusts its range to compensate for the increased signal strength that follows a pitch pulse. The new quantizer bases its adaptation on its own output and no side information is required. When combined with a variable length source coding scheme, the new quantizer offers a significant improvement in signal-to-noise ratio and in subjective speech quality. The technique is applicable to a broad range of digitization methods including adaptive delta modulation and various forms of ADPCM.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: Computational and syntactic information is used to resolve ambiguities and to yield higher—order decisions in automatic speech recognition and understanding systems.
Abstract: Automatic speech recognition and understanding are currently receiving considerable attention.1 Most approaches to problems in these areas involve rather complicated systems. Typically, the acoustic waveform is first segmented into units such as phonemes or syl— lables. Semantic and syntactic information is then used to resolve ambiguities and to yield higher—order decisions. This complexity is probably necessary if the most general speech—recognition problems are to be solved.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: The quantitative rules obtained for generating the SSRU's are expected to be useful, at least as a preliminary investigation tool, for synthesis-by-rule.
Abstract: Summary form only given, as follows. The paper deals with the application of linear prediction technique to the speech synthesis of both italian and german languages by Standard Speech Reproducing Units (SSRU), it is by combining elementary speech segments of standardized charac teristics extracted fron utterances of native speakers. The nain feature of the method presented is the possibility of synthesizing in a higly intelligible form any nessage of such languages with a very limited amount of data. So far the use of linear predictive coding of the previously realized SSRU sets allowed a memory occupation less than 16 kb for the synthesis of italian language and less than 32 k-bytes for the combined synthesis of italian and german languages. The data flow rate is about 1 kb/s. A key property of the code with respect to methods previously used (i.e. simple concatenation of original segments ) relies in the possibility of greatly enhancing the naturalness of the synthesized speech by varying pitch, amplitude and duration of the synthetic segments. Further, the quantitative rules obtained for generating the SSRU's are expected to be useful, at least as a preliminary investigation tool, for synthesis-by-rule.


Journal ArticleDOI
TL;DR: As an alternative to the spectrograph technique for speech analysis, an areagraph technique is presented in which the instantaneous vocal-tract area function is plotted against time with distance along the tract as the y-ordinate and area denoted by intensity modulation.
Abstract: As an alternative to the spectrograph technique for speech analysis, an areagraph technique is presented in which the instantaneous vocal-tract area function (derived from linear prediction analysis) is plotted against time with distance along the tract as the y-ordinate and area denoted by intensity modulation. Since the display is related to a physical quantity, it has a number of advantages over the spectrograph. An application to speech training is described.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: It's proving necessary to find a simple set of intonation patterns without taking the complexity of syntactic sentence structuration and the many derived rules into account.
Abstract: Speech Synthesis by dyads concatenation produces an intelligible speech, but the lack of prosodic features like rythm and intonation gives the speech an unnatural and unpleasant sound. Having regard to the short-dated applied objectives, It's proving necessary to find a simple set of intonation patterns without taking the complexity of syntactic sentence structuration and the many derived rules into account. Intrinsic characteristics of each dyad are stored and a very simplified grammar is used to surimpose them automatically a pitch pattern, function of the following parameters: Type of sentences, End of each kind of syntagms, words boundaries, and words position inside a sentence.

Journal ArticleDOI
TL;DR: A series of listening and communicability tests has been undertaken using speech in which anomaly effects have been introduced by simulation techniques, introduced at controlled rates in simulated networks using a variety of speech encoding techniques and packetization strategies.
Abstract: When speech is transmitted in a packet‐switched network the variability in packet delays inherent in such a net tends to produce occasional anomalies or “glitches” in the output speech when packets fail to arrive at the destination in a timely fashion. While the frequency of occurrence of these anomalies can be minimized at the expense of buffering and increased overall speech delay, it is likely that a practical network design would represent a compromise which allowed some degradation of the output speech under worst case load conditions. To provide some basic data on the subjective effects of such anomalies a series of listening and communicability tests has been undertaken using speech in which anomaly effects have been introduced by simulation techniques. Anomalies resulting from packet losses due to delay dispersion as well as variation in average delay are introduced at controlled rates in simulated networks using a variety of speech encoding techniques and packetization strategies. Preliminary tes...