scispace - formally typeset
Search or ask a question

Showing papers on "Speech enhancement published in 1984"


Journal ArticleDOI
TL;DR: In this article, a system which utilizes a minimum mean square error (MMSE) estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm.
Abstract: This paper focuses on the class of speech enhancement systems which capitalize on the major importance of the short-time spectral amplitude (STSA) of the speech signal in its perception. A system which utilizes a minimum mean-square error (MMSE) STSA estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm. In this paper we derive the MMSE STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables. We analyze the performance of the proposed STSA estimator and compare it with a STSA estimator derived from the Wiener estimator. We also examine the MMSE STSA estimator under uncertainty of signal presence in the noisy observations. In constructing the enhanced signal, the MMSE STSA estimator is combined with the complex exponential of the noisy phase. It is shown here that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. The proposed approach results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise. The complexity of the proposed algorithm is approximately that of other systems in the discussed class.

3,905 citations


Journal Article
TL;DR: This paper derives a minimum mean-square error STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables, which results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise.
Abstract: Absstroct-This paper focuses on the class of speech enhancement systems which capitalize on the major importance of the short-time spectral amplitude (STSA) of the speech signal in its perception. A system which utilizes a minimum mean-square error (MMSE) STSA estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the \" spectral subtraction \" algorithm. In this paper we derive the MMSE STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables. We analyze the performance of the proposed STSA estimator and compare it with a STSA estimator derived from the Wiener estimator. We also examine the MMSE STSA estimator under uncertainty of signal presence in the noisy observations. In constructing the enhanced signal, the MMSE STSA estimator is combined with the complex exponential of the noisy phase. It is shown here that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. The proposed approach results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise. The complexity of the proposed algorithm is approximately that of other systems in the discussed class.

2,714 citations


Journal ArticleDOI
TL;DR: An algorithm to estimate a signal from its modified short-time Fourier transform (STFT) by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT magnitude is presented.
Abstract: In this paper, we present an algorithm to estimate a signal from its modified short-time Fourier transform (STFT). This algorithm is computationally simple and is obtained by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT. Using this algorithm, we also develop an iterative algorithm to estimate a signal from its modified STFT magnitude. The iterative algorithm is shown to decrease, in each iteration, the mean squared error between the STFT magnitude of the estimated signal and the modified STFT magnitude. The major computation involved in the iterative algorithm is the discrete Fourier transform (DFT) computation, and the algorithm appears to be real-time implementable with current hardware technology. The algorithm developed in this paper has been applied to the time-scale modification of speech. The resulting system generates very high-quality speech, and appears to be better in performance than any existing method.

1,899 citations


Proceedings ArticleDOI
19 Mar 1984
TL;DR: Results for a speaker dependent connected digit speech recognition task with a base error rate of 1.6%, show that preprocessing the noisy unknown speech with a 10 dB signal-to-noise ratio reduces the error rate from 42% to 10%.
Abstract: Acoustic noise suppression is treated as a problem of finding the minimum mean square error estimate of the speech spectrum from a noisy version. This estimate equals the expected value of its conditional distribution given the noisy spectral value, the mean noise power and the mean speech power. It is shown that speech is not Gaussian. This results in an optimal estimate which is a non-linear function of the spectral magnitude. This function differs from the Wiener filter, especially at high instantaneous signal-to-noise ratios. Since both speech and Gaussian noise have a uniform phase distribution, the optimal estimator of the phase equals the noisy phase. The paper describes how the estimator can be calculated directly from noise-free speech. It describes how to find the optimal estimator for the complex spectrum, the magnitude, the squared magnitude, the log magnitude, and the root-magnitude spectra. Results for a speaker dependent connected digit speech recognition task with a base error rate of 1.6%, show that preprocessing the noisy unknown speech with a 10 dB signal-to-noise ratio reduces the error rate from 42% to 10%. If the template data are also preprocessed in the same way, the error rate reduces to 2.1%, thus recovering 99% of the recognition performance lost due to noise.

138 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: Algorithms based on spectral subtraction are developed for improving the intelligibility of speech that has been interfered by a second talker's voice, and significant gain in intelligibility for low signal-to-noise ratio conditions is achieved.
Abstract: Algorithms based on spectral subtraction are developed for improving the intelligibility of speech that has been interfered by a second talker's voice. A number of new properties of spectral subtraction are shown, including the effects of phase on the output speech intelligibility, and the choice of magnitude spectral differences for best results. A harmonic extraction algorithm is also developed. Results of formal testing on the final system show that significant gain in intelligibility for low signal-to-noise ratio conditions is achieved.

45 citations


Proceedings ArticleDOI
19 Mar 1984
TL;DR: It is shown, on analyses of both synthetic and natural speech, that the averaged parabolic approximation between harmonic peaks of voiced speech spectrum reduces the sensitivity of the LP analysis to changes in the fundamental frequency Fo and to noise.
Abstract: In spite of its extensive use, speech analysis based on linear prediction (LP) is liable to various causes of inaccuracy. This paper presents a novel approach to improve the accuracy in the estimation of the voiced speech production model based on the LP method. The presented method uses interpolation between spectral points which are least influenced by artifacts in the spectral analysis and by noise in the signal. We show, on analyses of both synthetic and natural speech, that the averaged parabolic approximation between harmonic peaks of voiced speech spectrum reduces the sensitivity of the LP analysis to changes in the fundamental frequency Fo and to noise. The method is well suited for combination with the Spectral Transform LP method, previously proposed by the authors [1].

34 citations


01 Jan 1984
TL;DR: In this article, a system which utilizes a minimum mean square error (MMSE) estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the spectral subtraction algorithm.
Abstract: Absstroct-This paper focuses on the class of speech enhancement systems which capitalize on the major importance of the short-time spectral amplitude (STSA) of the speech signal in its perception. A system which utilizes a minimum mean-square error (MMSE) STSA estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the “spectral subtraction” algorithm. In this paper we derive the MMSE STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables. We analyze the performance of the proposed STSA estimator and compare it with a STSA estimator derived from the Wiener estimator. We also examine the MMSE STSA estimator under uncertainty of signal presence in the noisy observations. In constructing the enhanced signal, the MMSE STSA estimator is combined with the complex exponential of the noisy phase. It is shown here that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. The proposed approach results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise. The complexity of the proposed algorithm is approximately that of other systems in the discussed class.

31 citations


Proceedings ArticleDOI
K. Oh1, C. Un
19 Mar 1984
TL;DR: It has been found that for pitch detection of noisy speech the algorithm that uses an AMDF or an autocorrelation function yields relatively good performance than others.
Abstract: Results of a performance comparison study of eight pitch extraction algorithms for noisy as well as clean speech are presented. These algorithms are the autocorrelation method with center clipping, the autocorrelation method with modified center clipping, the simplified inverse filter tracking (SIFT) method, the average magnitude difference function (AMDF) method, the pitch detection method based on LPC inverse filtering and AMDF, the data reduction method, the parallel processing method and the cepstrum method. It has been found that for pitch detection of noisy speech the algorithm that uses an AMDF or an autocorrelation function yields relatively good performance than others. A pitch detector that uses center clipped speech as an input signal is effective in pitch extraction of noisy speech. In general, preprocessing such as LPC inverse filtering or center clipping of input speech yields remarkable improvement in pitch detection.

26 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: A system for speech analysis and enhancement which combines signal processing and symbolic processing in a closely coupled manner and attempts to reconstruct the original speech waveform using symbolic processing to help model the signal and to guide reconstruction.
Abstract: This paper describes a system for speech analysis and enhancement which combines signal processing and symbolic processing in a closely coupled manner. The system takes as input both a noisy speech signal and a symbolic description of the speech signal. The system attempts to reconstruct the original speech waveform using symbolic processing to help model the signal and to guide reconstruction. The system uses various signal processing algorithms for parameter estimation and reconstruction.

20 citations


Proceedings ArticleDOI
19 Mar 1984
TL;DR: For the applications of speech synthesis from speech model parameters, time-scale modification of clean speech, speech enhancement by spectral subtraction, and helium speech enhancement, significant improvement is not gained by using the LSEE-MSTFTM algorithm.
Abstract: In this paper, speech synthesis directly from the processed Short-Time Fourier Transform Magnitude (STFTM) using the LSEE-MSTFTM algorithm [6,7] is compared to more conventional algorithms for several speech processing applications. For the applications considered, the most improvement occurs for time-scale modification of multiple speaker speech and noisy speech since these input signals are not well modeled by the analysis/synthesis system used for comparison. However, for the applications of speech synthesis from speech model parameters, time-scale modification of clean speech, speech enhancement by spectral subtraction, and helium speech enhancement, significant improvement is not gained by using the LSEE-MSTFTM algorithm. Significantly better results are not obtained since a good STFT phase estimate is available and employed in the conventional approaches to these applications.

15 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: An end-point detector for LPC speech using squared prediction error look-ahead and automatic/manual threshold determination is described, which is relatively immune to transient pulses and various low-level noises, yet preserves low- level speech sounds such as weak fricatives to a significant extent under moderate noise conditions.
Abstract: An end-point detector for LPC speech using squared prediction error look-ahead and automatic/manual threshold determination is described. The detector is algorithmically simple, computationally efficient,and uses only one decision parameter. Preliminary tests indicate that it is relatively immune to transient pulses and various low-level noises, yet preserves low-level speech sounds such as weak fricatives to a significant extent under moderate noise conditions. Tests indicate that 93.8% of automatically determined endpoints agree to within two frames of manually determined endpoints. The detector is especially suitable for use in vector-quantization based LPC systems, where the squared prediction error is easily available.

Proceedings ArticleDOI
19 Mar 1984
TL;DR: It is demonstrated that good (>10 dB) cancellation of the additive noise and little speech distortion can be achieved by having the reference microphone attached to the outside of the facemask and by updating the filter coefficients only during silence intervals.
Abstract: In this paper we discuss some preliminary results on using Widrow's Adaptive Noise Cancelling (ANC) algorithm to reduce the background noise present in a fighter pilot's speech. With a dominant noise source present and with the pilot wearing an oxygen facemask, we demonstrate that good (>10 dB) cancellation of the additive noise and little speech distortion can be achieved by having the reference microphone attached to the outside of the facemask and by updating the filter coefficients only during silence intervals.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A simple method is presented for extracting the amplitudes and locations of a multiple impulse excitation model which allows a more accurate recomputation of the autoregressive coefficients based upon incorporating the multipulse excitation.
Abstract: One of the sources of degradation in LPC-synthesized speech is the mechanical quality due to a single impulse excitation per pitch period. This paper presents a simple method for extracting the amplitudes and locations of a multiple impulse excitation model. These multipulse parameters are obtained very easily from the autoregressive (LPC) residual. Additionally, a method is developed which allows a more accurate recomputation of the autoregressive coefficients based upon incorporating the multipulse excitation.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show clearly that each of the two-sensor signals under test outperforms the signal from the gradient microphone alone and that the performance improvement generally increases with the noise level.
Abstract: The aim of this work is to develop multisensor configurations of sensors for transducing speech in order to achieve enhanced immunity to acoustic background noise. We performed detailed measurements of the sound field in the vicinity of the mouth and neck during speech using pressure and pressure gradient microphones and an accelerometer. We investigated the properties of the measured signals from the various sensor types and positions through long-term and short-term spectral analyses and from articulation index scores computed assuming ambient noise typical of that in a fighter aircraft cockpit. From the results of this investigation, we developed a two-sensor configuration involving an accelerometer and a gradient microphone. Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show clearly that each of the two-sensor signals under test outperforms the signal from the gradient microphone alone and that the performance improvement generally increases with the noise level.

Proceedings ArticleDOI
T. Eger1, J. Su, L. Varner
01 Mar 1984
TL;DR: The algorithm is an extension of the spectral subtraction concept, the soft nonlinearity provides less distortion of low-amplitude speech components and less sensitivity to the subtraction level than previously reported techniques.
Abstract: This paper discusses a nonlinear spectrum processing approach to speech enhancement. The technique incorporates a "soft" nonlinearity which suppresses low-level noise components while passing higher-level speech components. Although the algorithm is an extension of the spectral subtraction concept, the soft nonlinearity provides less distortion of low-amplitude speech components and less sensitivity to the subtraction level than previously reported techniques. In addition to noticeable improvement in perceptual quality, the algorithm offers a substantial increase in SNR.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A simplified multi-pulse analysis is proposed here with particular emphasis on the speech model developed as a result of a two-pass method, incorporating knowledge of the estimated pulse locations and amplitudes along with perceptual synthesis error weighting.
Abstract: The introduction of multi-pulse excitation for LPC coders has increased the quality achievable for digitally coded speech at bit rates in the 9.6 Kbs range. A simplified multi-pulse analysis is proposed here with particular emphasis on the speech model developed as a result of a two-pass method. In the first pass, estimated LPC parameters generated by conventional covariance analysis are used to generate the forward prediction error; the multi-pulse sequence is then detected by thresholding the residual. The second pass generates the final LPC parameters by a covariance analysis incorporating knowledge of the estimated pulse locations and amplitudes along with perceptual synthesis error weighting. Experimental results are presented to demonstrate the method.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: The results indicate that smoothing in addition to twicing provides significant performance improvement in high noise background, and the twicing algorithm reduces the error rate in the noisy environments by a significant amount.
Abstract: In this paper we present a study on the performance of a speaker-dependent continuous speech recognition algorithm in various background noise levels, including the mismatch tolerance of the algorithm. This mismatch exists in most applications where the user is trained in one noise level and does the recognition in different and highly variable noise levels. Finally, we introduce a pre-processing technique called 'twicing' and a simple 3-point moving average post-processor. The twicing algorithm, while still maintaining the high performance in the quiet background, reduces the error rate in the noisy environments by a significant amount. The results indicate that smoothing in addition to twicing provides significant performance improvement in high noise background.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: In this paper, an automatic gain control (AGC) is used to self-adjust the front-end gain of the LPC analyzer in such a way that the speech waveform is more accurately quantized by the analog-to-digital converter.
Abstract: The purpose of an automatic gain control (AGC) is to self-adjust the front-end gain of the LPC analyzer in such a way that the speech waveform is more accurately quantized by the analog-to-digital converter. Tests in the past have indicated that properly amplified speech produces higher intelligibility scores at the narrowband LPC output because both filter and excitation parameters are more accurately estimated. In addition, properly amplified input speech results in properly amplified speech at the receiver, which is highly desirable for listening in a noisy environment.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: Modifications are presented which improve the quality of the synthesized speech without requiring the transmission of any additional data and show an increase of up to 5 points in overall speech quality with the implementation of each of these improvements.
Abstract: The major weakness of the current narrowband LPC synthesizer lies in the use of a "canned" invariant excitation signal. The use of such an excitation signal is based on three primary assumptions, namely, (1) that the amplitude spectrum of the excitation signal is flat and time-invariant, (2) that the phase spectrum of the voiced excitation signal is a time-invariant function of frequency, and (3) that the probability density function (PDF) of the phase spectrum of the unvoiced excitation signal is also time-invariant. This paper critically examines these assumptions and presents modifications which improve the quality of the synthesized speech without requiring the transmission of any additional data. Diagnostic Acceptability Measure (DAM) tests show an increase of up to 5 points in overall speech quality with the implementation of each of these improvements.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: This paper proposes a recognition scheme which adapts itself to mild degradations in speech, and suggests techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculations.
Abstract: The performance of an isolated word speech recognition (IWSR) system is known to drop rapidly with increase in the degradation of the input speech. In this paper we propose a recognition scheme which adapts itself to mild degradations in speech. The scheme does not need apriori information regarding the nature and extent of noise. We suggest techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculations. A suitable index is used to study the performance of the recognition system for small data sets. Our scheme lends itself to greater flexibility in handling degradations in speech input than do the existing recognition schemes. We illustrate our scheme by simulating an adaptive differential pulse code modulated (ADPCM) speech, where the main distortion is contributed by the quatization noise.

Journal ArticleDOI
J. Asmuth1, Jerry D. Gibson
TL;DR: Subjective listening tests and spectrograms show the Kalman algorithm to be a viable alternative to the block-adaptive algorithms.
Abstract: The output speech from a fixed-tap differential pulse code modulation system with adaptive quantization and adaptive noise spectral shaping (NSS) is compared for block-adaptive and sequentially adaptive NSS filters. Block-adaptive systems incorporate a delay which can build up in analog communications systems and cause echoing problems. The buildup of delay can be eliminated by implementing the sequentially adaptive Kalman algorithm in the NSS filter. Simulations are performed for 4-, 8-, and 16-level quantizers with fourth- and ninth-order Kalman adaptation of the NSS filter. A block-adaptive system is implemented as a reference. Subjective listening tests and spectrograms show the Kalman algorithm to be a viable alternative to the block-adaptive algorithms. Signal-to-noise ratios are also given.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: The lossy electric transmission-line analog model of the vocal tract is used to study the acoustics of speech produced in a hyperbaric helium-oxygen atmosphere and shows that the classic Fant and Lindquist formula somewhat overstates the formant frequency shift when glottal and radiation effects are included.
Abstract: We use the lossy electric transmission-line analog model of the vocal tract to study the acoustics of speech produced in a hyperbaric helium-oxygen atmosphere The analysis extends previous work by including more completely the effects of the wall vibration, glottal, and radiation impedances, and by analyzing the formant bandwidths and amplitudes in addition to the formant frequencies It shows that (1) the classic Fant and Lindquist formula somewhat overstates the formant frequency shift when glottal and radiation effects are included; (2) the lower formant bandwidths increase by much more than commonly assumed; and (3) the upper formant amplitudes are higher relative to the lower formants in helium speech than in normal speech These results are useful in developing advanced helium speech enhancement algorithms

15 Nov 1984
TL;DR: In this article, a new application of Widrow's adaptive noise cancellation (ANC) algorithm is presented, where the ambient environment is generalized to include the case where an acoustic barrier exists between the primary and reference microphones.
Abstract: : A new application of Widrow's Adaptive Noise Cancelling (ANC) algorithm is presented. Specifically, the ambient environment is generalized to include the case where an acoustic barrier exists between the primary and reference microphones. By updating the coefficients of the noise estimation filter only during silence, it is shown that the ANC technique can provide substantial noise reduction with little speech distortion even when the acoustic barrier provides only moderate attenuation of acoustic signals. The use of the modified ANC method is evaluated using an oxygen facemask worn by fighter aircraft pilots. Experiments demonstrate that if a noise field is created using a single source, 11 dB signal-to-noise ratio improvement can be achieved by attaching the reference microphone to the exterior of the facemask. The length of the ANC filter required for this particular environment is only about 50 points long.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: This paper presents the results of a two year effort to develop an adaptive predictive coder for transmission of high quality speech over 16 Kbps channels with up to a 5% bit error rate to maximize intelligibility and maintain bandwidth compatability with existing speech compression systems.
Abstract: This paper presents the results of a two year effort to develop an adaptive predictive coder for transmission of high quality speech over 16 Kbps channels with up to a 5% bit error rate. To maximize intelligibility and maintain bandwidth compatability with existing speech compression systems, an 8 KHz A-D sampling rate constraint was imposed. Our algorithm, called APC-HQ for "hybrid quantization," concentrates on improving the residual coding and uses both segmental quantization and center clipping quantization in a perceptually optimum manner. The result is an algorithm running in real time on an AP-120B [4] which yields high quality under the input bandwidth and noisy channel constraints, and whose speech quality exceeds that attainable with either technique alone.


Proceedings ArticleDOI
01 Mar 1984
TL;DR: A new approach to the recognition of consonants is presented based on the modelling of the vocal tract by two acoustic cavities with the voicing source at the input of the first cavity and the fricative noise or pulse sources at the junction of the two cavities.
Abstract: In this paper we present a new approach to the recognition of consonants. This approach is based on the modelling of the vocal tract by two acoustic cavities with the voicing source at the input of the first cavity and the fricative noise or pulse source at the junction of the two cavities. The separation of the system into these dual cavities results in an ARMA structure for the acoustic signal.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: The performance results of the adaptive DPCM speech codecs which have been used in conjunction with an Adaptive Frequency Mapping system (AFMAP) significantly enhances the recovered speech compared to other bandwidth compression systems.
Abstract: The performance results of the adaptive DPCM speech codecs which have been used in conjunction with an Adaptive Frequency Mapping system (AFMAP) are presented. The AFMAP preprocessor operates on a wideband speech (0.3 - 7.6 KHz) and compresses the speech signal into 0.3 - 3.3 KHz telephone channel. This signal is subsequently digitized by DPCM codecs employing simple, but efficient prediction algorithms in order to reproduce wideband speech from AFMAP postprocessor output. The informal listening tests, SNRSEG measures indicate that AFMAP system in tandem with DPCM codecs significantly enhances the recovered speech compared to other bandwidth compression systems. The reproduced AFMAP speech is also preferable to 0.3 - 3.3 KHz telephone speech, digitized by the some codecs at the same transmission bit rates.