scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Acoustics, Speech, and Signal Processing in 1976"


Journal ArticleDOI
TL;DR: In this paper, a maximum likelihood estimator is developed for determining time delay between signals received at two spatially separated sensors in the presence of uncorrelated noise, where the role of the prefilters is to accentuate the signal passed to the correlator at frequencies for which the signal-to-noise (S/N) ratio is highest and suppress the noise power.
Abstract: A maximum likelihood (ML) estimator is developed for determining time delay between signals received at two spatially separated sensors in the presence of uncorrelated noise. This ML estimator can be realized as a pair of receiver prefilters followed by a cross correlator. The time argument at which the correlator achieves a maximum is the delay estimate. The ML estimator is compared with several other proposed processors of similar form. Under certain conditions the ML estimator is shown to be identical to one proposed by Hannan and Thomson [10] and MacDonald and Schultheiss [21]. Qualitatively, the role of the prefilters is to accentuate the signal passed to the correlator at frequencies for which the signal-to-noise (S/N) ratio is highest and, simultaneously, to suppress the noise power. The same type of prefiltering is provided by the generalized Eckart filter, which maximizes the S/N ratio of the correlator output. For low S/N ratio, the ML estimator is shown to be equivalent to Eckart prefiltering.

4,317 citations


Journal ArticleDOI
TL;DR: A comparative performance study of seven pitch detection algorithms was conducted, consisting of eight utterances spoken by three males, three females, and one child, to assess their relative performance as a function of recording condition, and pitch range of the various speakers.
Abstract: A comparative performance study of seven pitch detection algorithms was conducted. A speech data base, consisting of eight utterances spoken by three males, three females, and one child was constructed. Telephone, close talking microphone, and wideband recordings were made of each of the utterances. For each of the utterances in the data base; a "standard" pitch contour was semiautomatically measured using a highly sophisticated interactive pitch detection program. The "standard" pitch contour was then compared with the pitch contour that was obtained from each of the seven programmed pitch detectors. The algorithms used in this study were 1) a center clipping, infinite-peak clipping, modified autocorrelation method (AUTOC), 2) the cepstral method (CEP), 3) the simplified inverse filtering technique (SIFT) method, 4) the parallel processing time-domain method (PPROC), 5) the data reduction method (DARD), 6) a spectral flattening linear predictive coding (LPC) method, and 7) the average magnitude difference function (AMDF) method. A set of measurements was made on the pitch contours to quantify the various types of errors which occur in each of the above methods. Included among the error measurements were the average and standard deviation of the error in pitch period during voiced regions, the number of gross errors in the pitch period, and the average number of voiced-unvoiced classification errors. For each of the error measurements, the individual pitch detectors could be rank ordered as a measure of their relative performance as a function of recording condition, and pitch range of the various speakers. Performance scores are presented for each of the seven pitch detectors based on each of the categories of error.

793 citations


Journal ArticleDOI
TL;DR: The likelihood ratio, cepstral measure, and cosh measure are easily evaluated recursively from linear prediction filter coefficients, and each has a meaningful and interrelated frequency domain interpretation.
Abstract: The properties and interrelationships among four measures of distance in speech processing are theoretically and experimentally discussed. The root mean square (rms) log spectral distance, cepstral distance, likelihood ratio (minimum residual principle or delta coding (DELCO) algorithm), and a cosh measure (based upon two nonsymmetrical likelihood ratios) are considered. It is shown that the cepstral measure bounds the rms log spectral measure from below, while the cosh measure bounds it from above. A simple nonlinear transformation of the likelihood ratio is shown to be highly correlated with the rms log spectral measure over expected ranges. Relationships between distance measure values and perception are also considered. The likelihood ratio, cepstral measure, and cosh measure are easily evaluated recursively from linear prediction filter coefficients, and each has a meaningful and interrelated frequency domain interpretation. Fortran programs are presented for computing the recursively evaluated distance measures.

653 citations


Journal ArticleDOI
TL;DR: In this paper, the frequency response of a two-dimensional spatially invariant linear system through which an image has been passed and blurred is estimated for the cases of uniform linear camera motion.
Abstract: This paper is concerned with the digital estimation of the frequency response of a two-dimensional spatially invariant linear system through which an image has been passed and blurred. For the cases of uniform linear camera motion and an out-of-focus lens system it is shown that the power cepstrum of the image contains sufficient information to identify the blur. Methods for deblurring are presented, including restoration of the density version of the image. The restoration procedure consumes only a modest amount of computation time. Results are demonstrated on images blurred in the camera.

489 citations


Journal ArticleDOI
TL;DR: A pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal, which has been found to provide reliable classification with speech segments as short as 10 ms.
Abstract: In speech analysis, the voiced-unvoiced decision is usually performed in conjunction with pitch analysis The linking of voiced-unvoiced (V-UV) decision to pitch analysis not only results in unnecessary complexity, but makes it difficult to classify short speech segments which are less than a few pitch periods in duration In this paper, we describe a pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal In this method, five different measurements are made on the speech segment to be classified The measured parameters are the zero-crossing rate, the speech energy, the correlation between adjacent speech samples, the first predictor coefficient from a 12-pole linear predictive coding (LPC) analysis, and the energy in the prediction error The speech segment is assigned to a particular class based on a minimum-distance rule obtained under the assumption that the measured parameters are distributed according to the multidimensional Gaussian probability density function The means and covariances for the Gaussian distribution are determined from manually classified speech data included in a training set The method has been found to provide reliable classification with speech segments as short as 10 ms and has been used for both speech analysis-synthesis and recognition applications A simple nonlinear smoothing algorithm is described to provide a smooth 3-level contour of an utterance for use in speech recognition applications Quantitative results and several examples illustrating the performance of the method are included in the paper

479 citations


Journal ArticleDOI
TL;DR: A more substantial gain can be obtained in the direct realization of a uniform bank of recursive filters through combination of the polyphase network with a discrete Fourier transform (DFT) computer; savings in hardware result from the low sensitivity of the structure to coefficient word lengths.
Abstract: The digital filtering process can be achieved by a set of phase shifters with suitable characteristics. A particular set, named polyphase network, is defined and analyzed. It permits the use of recursive devices for efficient sample-rate alteration. The comparison with conventional filters shows that, with the same active memory, a reduction of computation rate approaching a factor of 2 can be achieved when the alteration factor increases. A more substantial gain can be obtained in the direct realization of a uniform bank of recursive filters through combination of the polyphase network with a discrete Fourier transform (DFT) computer; savings in hardware also result from the low sensitivity of the structure to coefficient word lengths.

420 citations


Journal ArticleDOI
TL;DR: In this article, singular value decomposition (SVD) and pseudoinverse techniques are used for image restoration in space-variant point spread functions (SVPSF).
Abstract: The use of singular value decomposition (SVD) techniques in digital image processing is of considerable interest for those facilities with large computing power and stringent imaging requirements. The SVD methods are useful for image as well as quite general point spread function (impulse response) representations. The methods represent simple extensions of the theory of linear filtering. Image enhancement examples will be developed illustrating these principles. The most interesting cases of image restoration are those which involve space variant imaging systems. The SVD, combined with pseudoinverse techniques, provides insight into these types of restorations. Illustrations of large scale N2× N2point spread function matrix representations are discussed along with separable space variant N2× N2point spread function matrix examples. Finally, analysis and methods for obtaining a pseudoinverse of separable space variant point spread functions (SVPSF's) are presented with a variety of object and imaging system dagradations.

362 citations


Journal ArticleDOI
TL;DR: In this article, the concept of spectral factorization is extended to two dimensions in such a way as to preserve the analytic characteristics of the factors, and the resulting factors are shown to be recursively computable and stable in agreement with one-dimensional (1-D) spectral factorisation.
Abstract: The concept of spectral factorization is extended to two dimensions in such a way as to preserve the analytic characteristics of the factors. The factorization makes use of a homomorphic transform procedure due to Wiener. The resulting factors are shown to be recursively computable and stable in agreement with one-dimensional (1-D) spectral factorization. The factors are not generally two-dimensional (2-D) polynomials, but can be approximated as such. These results are applied to 2-D recursive filtering, filter design, and a computationally attractive stability test for recursive filters.

255 citations


Journal ArticleDOI
TL;DR: This paper discusses a digital formulation of the phase vocoder, an analysis-synthesis system providing a parametric representation of a speech waveform by its short-time Fourier transform, designed to be an identity system in the absence of any parameter modifications.
Abstract: This paper discusses a digital formulation of the phase vocoder, an analysis-synthesis system providing a parametric representation of a speech waveform by its short-time Fourier transform. Such a system is of interest both for data-rate reduction and for manipulating basic speech parameters. The system is designed to be an identity system in the absence of any parameter modifications. Computational efficiency is achieved by employing the fast Fourier transform (FFT) algorithm to perform the bulk of the computation in both the analysis and synthesis procedures, thereby making the formulation attractive for implementation on a minicomputer.

240 citations


Journal ArticleDOI
TL;DR: A binary arithmetic that permits the exact computation of the Fermat number transform (FNT) is described and the general multiplication of two integers modulo F t required in the computation of FNT convolution is discussed.
Abstract: A binary arithmetic that permits the exact computation of the Fermat number transform (FNT) is described. This technique involves arithmetic in a binary code corresponding to the simplest one of a set of code translations from the normal binary representation of each integer in the ring of integers modulo a Fermat number F t = 2b+ 1, b = 2t. The resulting FNT binary arithmetic operations are of the complexity of 1's complement arithmetic as in the case of a previously proposed technique which corresponds to another one of the set of code translations. The general multiplication of two integers modulo F t required in the computation of FNT convolution is discussed.

233 citations


Journal ArticleDOI
TL;DR: An efficient algorithm for obtaining solutions is given and shown to be closely related to a well-known algorithm of Levinson and the Jury stability test, which suggests that they are fundamental in the numerical analysis of stable discrete-time linear systems.
Abstract: It is common practice to partially characterize a filter with a finite portion of its impulse response, with the objective of generating a recursive approximation. This paper discusses the use of mixed first and second information, in the form of a finite portion of the impulse response and autocorrelation sequences. The discussion encompasses a number of techniques and algorithms for this purpose. Two approximation problems are studied: an interpolation problem and a least squares problem. These are shown to be closely related. The linear systems which form the solutions to these problems are shown to be stable. An efficient algorithm for obtaining solutions is given and shown to be closely related to a well-known algorithm of Levinson and the Jury stability test. The close connection between these algorithms suggests that they are fundamental in the numerical analysis of stable discrete-time linear systems.

Journal ArticleDOI
TL;DR: Computing of the autocorrelation function of the clipped speech is easily implemented in digital hardware using simple combinatorial logic, i.e., an up-down counter can be used to compute each correlation point.
Abstract: A high-quality pitch detector has been built in digital hard-ware and operates in real time at a 10 kHz sampling rate. The hardware is capable of providing energy as well as pitch-period estimates. The pitch and energy computations are performed 100 times/s (i.e., once per 10 ms interval). The algorithm to estimate the pitch period uses center clipping, infinite peak clipping, and a simplified autocorrelation analysis. The analysis is performed on a 300 sample section of speech which is both center clipped and infinite peak clipped, yielding a three-level speech signal where the levels are -1, 0, and +1 depending on the relation of the original speech sample to the clipping threshold. Thus computation of the autocorrelation function of the clipped speech is easily implemented in digital hardware using simple combinatorial logic, i.e., an up-down counter can be used to compute each correlation point. The pitch detector has been interfaced to the NOVA computer facility of the Acoustics Research Department at Bell Laboratories.

Journal ArticleDOI
TL;DR: A classification is given of the various possible nonlinear effects that can occur in recursive digital filters due to signal quantization and adder overflow, which include limit cycles, overflow oscillations, and quantization noise.
Abstract: A classification is given of the various possible nonlinear effects that can occur in recursive digital filters due to signal quantization and adder overflow. The effects include limit cycles, overflow oscillations, and quantization noise. A review is given of recent literature on this subject. Alternative methods of avoiding some of these nonlinear phenomena are discussed.

Journal ArticleDOI
TL;DR: The family of filters generated from a prototype filter H(z) is shown to possess certain common properties which are coordinate-free quantities which are invariant under frequency transformation, which is significant in the design of low-noise fixed-point digital filter structures.
Abstract: The family of filters {H(F(z)):F(z) a frequency transformation} generated from a prototype filter H(z) is shown to possess certain common properties. These are coordinate-free quantities (called second-order modes) which are invariant under frequency transformation. The invariance is significant in the design of low-noise fixed-point digital filter structures since the second-order modes characterize the minimum attainable noise. Filter structures (including parallel, cascade, and ladder configurations) are studied whose output noise is essentially independent of bandwidth and center frequency. An analysis of direct form structures (whether isolated or as one section within a cascade or parallel configuration) results in an expression giving the dominant term in the output noise as a function of the parameter in the low-pass-low-pass transformation. This noise term approaches infinity as bandwidth approaches zero. Thus, for narrowband filters, a difference of several orders of magnitude in the output noise can exist between a scaled direct form (having six multiplications per two-pole section) and the optimal form (having nine multiplications per two-pole section).

Journal ArticleDOI
G. White1, R. Neely2
TL;DR: Automatic speech recognition experiments are described in which several popular preprocessing and classification strategies are compared and it is shown that dynamic programming is of major importance for recognition of polysyllabic words.
Abstract: Automatic speech recognition experiments are described in which several popular preprocessing and classification strategies are compared. Preprocessing is done either by linear predictive analysis or by bandpass filtering. The two approaches are shown to produce similar recognition scores. The classifier uses either linear time stretching or dynamic programming to achieve time alignment. It is shown that dynamic programming is of major importance for recognition of polysyllabic words. The speech is compressed into a quasi-phoneme character string or preserved uncompressed. Best results are obtained with uncompressed data, using nonlinear time registration for multisyllabic words.

Journal ArticleDOI
TL;DR: In this paper, an alternative form of the fast Fourier transform (FFT) is developed, which has the peculiarity that none of the multiplying constants required are complex-most are pure imaginary.
Abstract: An alternative form of the fast Fourier transform (FFT) is developed. The new algorithm has the peculiarity that none of the multiplying constants required are complex-most are pure imaginary. The advantages of the new form would, therefore, seem to be most pronounced in systems for which multiplication are most costly.

Journal ArticleDOI
J. Wise1, J. Caprio, T. Parks1
TL;DR: In this article, a method for estimating the pitch period of voiced speech sounds based on a maximum likelihood (ML) formulation was developed, which is capable of resolution finer than one sampling period and is shown to perform better in the presence of noise than the cepstrum method.
Abstract: A method for estimating the pitch period of voiced speech sounds is developed based on a maximum likelihood (ML) formulation. It is capable of resolution finer than one sampling period and is shown to perform better in the presence of noise than the cepstrum method.


Journal ArticleDOI
TL;DR: Significant time-saving can be achieved by a simple modification to the radix-2 decimation in-time fast Fourier transform (FFT) algorithm when the data sequence to be transformed contains a large number of zero-valued samples.
Abstract: Significant time-saving can be achieved by a simple modification to the radix-2 decimation in-time fast Fourier transform (FFT) algorithm when the data sequence to be transformed contains a large number of zero-valued samples. The time-saving is accomplished by replacing M - L stages of the FFT computation with a simple recopying procedure where 2Mis the total number of points to be transformed of which only 2Lare nonzero.

Journal ArticleDOI
TL;DR: It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.
Abstract: This paper presents the results of an examination of rapid amplitude compression following high-pass filtering as a method for processing speech, prior to reception by the listener, as a means of enhancing the intelligibility of speech in high noise levels. Arguments supporting this particular signal processing method are based on the results of previous perceptual studies of speech in noise. In these previous studies, it has been shown that high-pass filtered/clipped speech offers a significant gain in the intelligibility of speech in white noise over that for unprocessed speech at the same signal-to-noise ratios. Similar results have also been obtained for speech processed by high-pass filtering alone. The present paper explores these effects and it proposes the use of high-pass filtering followed by rapid amplitude compression as a signal processing method for enhancing the intelligibility of speech in noise. It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.

Journal ArticleDOI
TL;DR: This report describes the technical approach used and the support hardware and software developed, and gives overall performance figures, detailed statistics showing the importance of each rule, and listings of a translation program and another used in rule development.
Abstract: Speech synthesizers for computer voice output are most useful when not restricted to a prestored vocabulary. The simplest approach to unrestricted text-to-speech translation uses a small set of letter-to-sound rules, each specifying a pronunciation for one or more letters in some context. Unless this approach yields sufficient intelligibility, routine addition of text-to-speech translation to computer systems is unlikely, since more elaborate approaches, embodying large pronunciation dictionaries or linguistic analysis, require too much of the available computing resources. The work here described demonstrates the practicality of routine text-to-speech translation. A set of 329 letter-to-sound rules has been developed. These translate English text into the international phonetic alphabet (IPA), producing correct pronunciations for approximately 90 percent of the words, or nearly 97 percent of the phonemes, in an average text sample. Most of the remaining words have single errors easily correctable by the listener. Another set of rules translates IPA into the phonetic coding for a particular commercial speech synthesizer. This report describes the technical approach used and the support hardware and software developed. It gives overall performance figures, detailed statistics showing the importance of each rule, and listings of a translation program and another used in rule development.

Journal ArticleDOI
TL;DR: In this article, a theorem dealing with interesting properties of a polynomial D(z), being the denominator of the transfer function of a stable discrete system, is presented and proven.
Abstract: A theorem is presented and proven, dealing with interesting properties of a polynomial D(z), being the denominator of the transfer function of a stable discrete system. The relationships to equivalent properties of a Hurwitz polynomial are considered.

Journal ArticleDOI
TL;DR: It is shown theoretically that the two-pair quantization scheme has a 10-bit superiority over other above-mentioned quantization schemes in the sense of theoretically assuring that a maximum overall log spectral deviation will not be exceeded.
Abstract: The topic of quantization and bit allocation in speech processing is studied using an L 2 norm. Closed-form expressions are derived for the root mean square (rms) spectral deviation due to variations in one, two, or multiple parameters. For one-parameter variation, the reflection coefficients, log area ratios, and inverse sine coefficients are studied. It is shown that, depending upon the criterion chosen, either log area ratios or inverse sine quantization can be viewed as optimal. From a practical point of view, it is shown experimentally that very little difference exists among the various quantization methods beyond the second coefficient. Two-parameter variations are studied in terms of formant frequency and bandwidth movement and in terms of a two-pair quantization scheme. A lower bound on the number of quantization levels required to satisfy a given maximum spectral deviation is derived along with the two-pair quantization scheme which approximately satisfies the bound. It is shown theoretically that the two-pair quantization scheme has a 10-bit superiority over other above-mentioned quantization schemes in the sense of theoretically assuring that a maximum overall log spectral deviation will not be exceeded.

Journal ArticleDOI
TL;DR: In this article, the authors compared the effectiveness of the discrete cosine and Fourier transforms in decorrelating sampled signals with Markov-1 statistics, and showed that the DCT offers a higher (or equal) effectiveness than the discrete Fourier transform for all values of the correlation coefficient.
Abstract: This correspondence compares the effectiveness of the discrete cosine and Fourier transforms in decorrelating sampled signals with Markov-1 statistics. It is shown that the discrete cosine transform (DCT) offers a higher (or equal) effectiveness than the discrete Fourier transform (DFT) for all values of the correlation coefficient. The mean residual correlation is shown to vanish as the inverse square root of the sample size.

Journal ArticleDOI
TL;DR: Results suggest that LPC analysis/synthesis is fairly immune to the degradation of DPCM quantization, and the effects of DM quantization are more severe and the effect of additive white noise are the most serious.
Abstract: An important problem in some communication systems is the performance of linear prediction (LPC) analysis with speech inputs that have been corrupted by (signal-correlated) quantization distortion or additive white noise. To gain a first insight into this problem, a high-quality speech sample was deliberately degraded by using various degrees (bit rates of 16 kbps and more) of differential PCM (DPCM), and delta modulation (DM) quantization, and by the introduction of additive white noise. The resulting speech samples were then analyzed to obtain the LPC control signals: pitch, gain, and the linear prediction coefficients. These control parameters were then compared to the parameters measured in the original, high quality signal. The measurements of pitch perturbations were assessed on the basis of how many points exceeded an appropriate difference limen. A distance measure proposed by Itakura was used to compare the original LPC coefficients with the coefficients measured from the degraded speech. In addition, the measured control signals were used to synthesize speech for perceptual evaluation. Results suggest that LPC analysis/synthesis is fairly immune to the degradation of DPCM quantization. The effects of DM quantization are more severe and the effects of additive white noise are the most serious.

Journal ArticleDOI
TL;DR: A particularly simple way to control fast Fourier transform (FFT) hardware that allows parallel organization of the memory such that at any stage the two inputs and outputs of each butterfly belong to different memory units, hence can always be accessed in parallel.
Abstract: A particularly simple way to control fast Fourier transform (FFT) hardware is described. The method produces the indices both for inputs of each butterfly operation and for the appropriate W. In addition, this method allows parallel organization of the memory such that at any stage the two inputs and outputs of each butterfly belong to different memory units, hence can always be accessed in parallel.

Journal ArticleDOI
TL;DR: In this paper, a new noise expression for the class of fixed-point digital filters described by the state equations is formulated, and two methods of its computation are discussed, and the effects of possible structure transformation and state-amplitude scalings are then incorporated in this expression, and results have been analyzed.
Abstract: A new noise expression is formulated for the class of fixed-point digital filters described by the state equations, and two methods of its computation are discussed. The effects of possible structure transformation and state-amplitude scalings are then incorporated in this expression, and the results have been analyzed. In particular, it is shown that the output noise and state amplitudes are inversely proportional, and that an elementary transformation is well suited for a step-by-step generation of a low-noise filter.

Journal ArticleDOI
TL;DR: This paper deals with two's complement arithmetic with either rounding or chopping with eitherRoundoff errors for radix-2 FFT's and mixed-radix FFTs.
Abstract: A statistical model for roundoff errors is used to predict the output noise of the two common forms of the fast Fourier transform (FFT) algorithm, the decimations in-time and in-frequency. This paper deals with two's complement arithmetic with either rounding or chopping. The total mean-square errors and the mean-square errors for the individual points are derived for radix-2 FFT's. Results for mixed-radix FFT are also given.

Journal ArticleDOI
TL;DR: It is shown that a design which uses finite impulse response (FIR) filters for each stage, and which is minimized for storage is essentially minimized in terms of computation rate as well, and that multistage IIR designs can be somewhat more efficient computationally than single-stage designs; however, the storage efficiency is worse than that of the single- stage IIR design.
Abstract: In this paper several issues concerning the design and implementation of multistage decimators, interpolators, and narrow-band filters are discussed. In particular, the question of designing these systems in terms of minimum storage rather than minimum computation rate is examined. It is shown that a design which uses finite impulse response (FIR) filters for each stage, and which is minimized for storage is essentially minimized in terms of computation rate as well. The problem of further improvements in designing decimators and interpolators by taking advantage of DON'T CARE frequency bands is also discussed. For the early stages in a multistage design it is shown that fairly significant reductions in filter order can be achieved in this manner. A third issue in the design process is the question of practical schemes for efficient implementation of multistage decimators and interpolators in both hardware and software. One such efficient implementation is discussed in this paper. Finally, the problem of designing multistage decimators and interpolators using elliptic infinite impulse response (IIR) filters is discussed. It is shown that multistage IIR designs can be somewhat more efficient computationally than single-stage designs; however, the storage efficiency of the multistage IIR design is worse than that of the single-stage IIR design.

Journal ArticleDOI
TL;DR: The hardware design and implementation of aFermat number transform (FNT) is described, the arithmetic logic design is treated in detail and a new data representation for integers modulo a Fermat number is derived.
Abstract: The hardware design and implementation of a Fermat number transform (FNT) is described. The arithmetic logic design is treated in detail and a new data representation for integers modulo a Fermat number is derived. In addition, the FNT is compared with the fast Fourier transform (FFT) on the basis of hardware required for a pipeline convolver.