scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A statistical model-based voice activity detection

01 Jan 1999-IEEE Signal Processing Letters (IEEE)-Vol. 6, Iss: 1, pp 1-3
TL;DR: An effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences is proposed which shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.
Abstract: In this letter, we develop a robust voice activity detector (VAD) for the application to variable-rate speech coding. The developed VAD employs the decision-directed parameter estimation method for the likelihood ratio test. In addition, we propose an effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences. According to our simulation results, the proposed VAD shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article, an improved minima controlled recursive averaging (IMCRA) approach is proposed for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR).
Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

902 citations

01 Jan 2002
TL;DR: It is shown that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.
Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. In this paper, we present an Improved Minima Con- trolled Recursive Averaging (IMCRA) approach, for noise es- timation in adverse environments involving non-stationary noise, weak speech components, and low input signal-to- noise ratio (SNR). The noise estimate is obtained by av- eraging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iter- ations of smoothing and minimum tracking. The rst it- eration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in non-stationary noise environments and under low SNR conditions, the IMCRA approach is very eectiv e. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

834 citations


Cites background from "A statistical model-based voice act..."

  • ...Additionally, VADs are generally difficult to tune and their reliability severely deteriorates for weak speech components and low input SNR [15], [16], [20]....

    [...]

Journal ArticleDOI
TL;DR: A noisy speech corpus is developed suitable for evaluation of speech enhancement algorithms encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms.

634 citations


Cites methods from "A statistical model-based voice act..."

  • ...More precisely, a statistical-model based voice activity detector (VAD) (Sohn et al., 1999) was used to update the noise spectrum during speech-absent periods....

    [...]

  • ...This was surprising at first, but close analysis indicated that the logMMSE-SPU algorithm was sensitive to the noise spectrum estimate, which in our case was obtained with a VAD algorithm....

    [...]

  • ...The frame windowing scheme proposed in (Jabloun and Champagne, 2003) was adopted in both VAD methods....

    [...]

  • ...The following VAD decision rule was used: 1 L XL 1 k¼1 log Kk ?...

    [...]

  • ...(3) Incorporating noise estimation algorithms in place of VAD algorithms for updating the noise spectrum did not produce significant improvements in performance....

    [...]

Journal ArticleDOI
TL;DR: An optimally-modi#ed log-spectral amplitude (OM-LSA) speech estimator and a minima controlled recursive averaging (MCRA) noise estimation approach for robust speech enhancement are presented.

569 citations


Cites methods from "A statistical model-based voice act..."

  • ...soft-decision speech pause detection is either implemented on a frame-by-frame basis [12, 22 ] or estimated independentlyfor individual subbands using an a posteriori signal-to-noise ratio (SNR) [11,13]....

    [...]

Journal ArticleDOI
TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.
Abstract: Identifying a person by his or her voice is an important human trait most take for granted in natural human-to-human interaction/communication. Speaking to someone over the telephone usually begins by identifying who is speaking and, at least in cases of familiar speakers, a subjective verification by the listener that the identity is correct and the conversation can proceed. Automatic speaker-recognition systems have emerged as an important means of verifying identity in many e-commerce applications as well as in general business interactions, forensics, and law enforcement. Human experts trained in forensic speaker recognition can perform this task even better by examining a set of acoustic, prosodic, and linguistic characteristics of speech in a general approach referred to as structured listening. Techniques in forensic speaker recognition have been developed for many years by forensic speech scientists and linguists to help reduce any potential bias or preconceived understanding as to the validity of an unknown audio sample and a reference template from a potential suspect. Experienced researchers in signal processing and machine learning continue to develop automatic algorithms to effectively perform speaker recognition?with ever-improving performance?to the point where automatic systems start to perform on par with human listeners. In this article, we review the literature on speaker recognition by machines and humans, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems. We discuss different aspects of automatic systems, including voice-activity detection (VAD), features, speaker models, standard evaluation data sets, and performance metrics. Human speaker recognition is discussed in two parts?the first part involves forensic speaker-recognition methods, and the second illustrates how a na?ve listener performs this task from a neuroscience perspective. We conclude this review with a comparative study of human versus machine speaker recognition and attempt to point out strengths and weaknesses of each.

554 citations

References
More filters
Book
01 Jan 1993
TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.

8,442 citations


"A statistical model-based voice act..." refers methods in this paper

  • ...…frame state model, the current state depends on the previous observations as well as the current one, which is reflected on the decision rule in the following way: Based on the above formulations, a recursive formula for is obtained as (11) where denotes the likelihood ratio in (4) atth frame....

    [...]

Journal ArticleDOI
TL;DR: In this article, a system which utilizes a minimum mean square error (MMSE) estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm.
Abstract: This paper focuses on the class of speech enhancement systems which capitalize on the major importance of the short-time spectral amplitude (STSA) of the speech signal in its perception. A system which utilizes a minimum mean-square error (MMSE) STSA estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm. In this paper we derive the MMSE STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables. We analyze the performance of the proposed STSA estimator and compare it with a STSA estimator derived from the Wiener estimator. We also examine the MMSE STSA estimator under uncertainty of signal presence in the noisy observations. In constructing the enhanced signal, the MMSE STSA estimator is combined with the complex exponential of the noisy phase. It is shown here that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. The proposed approach results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise. The complexity of the proposed algorithm is approximately that of other systems in the discussed class.

3,905 citations

Journal Article
TL;DR: This paper derives a minimum mean-square error STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables, which results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise.
Abstract: Absstroct-This paper focuses on the class of speech enhancement systems which capitalize on the major importance of the short-time spectral amplitude (STSA) of the speech signal in its perception. A system which utilizes a minimum mean-square error (MMSE) STSA estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the \" spectral subtraction \" algorithm. In this paper we derive the MMSE STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables. We analyze the performance of the proposed STSA estimator and compare it with a STSA estimator derived from the Wiener estimator. We also examine the MMSE STSA estimator under uncertainty of signal presence in the noisy observations. In constructing the enhanced signal, the MMSE STSA estimator is combined with the complex exponential of the noisy phase. It is shown here that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. The proposed approach results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise. The complexity of the proposed algorithm is approximately that of other systems in the discussed class.

2,714 citations


"A statistical model-based voice act..." refers background or methods in this paper

  • ...…as follows: (5) Substituting (5) into (4) and applying the LRT yields the Itakura–Saito distortion (ISD) based decision rule [2], i.e., (6) Note that the left-hand side of (6) can not be smaller than zero, which is the well-known property of ISD and implies that the likelihood ratio is biased to ....

    [...]

  • ...The likelihood ratio for theth frequency band is (3) where and , and they are called thea priori and a posteriori signal-to-noise ratios (SNR’s), respectively [3]....

    [...]

  • ...In this letter, we further optimize the decision rule by employing the decision-directed (DD) method for the estimation of the unknown parameters [3]....

    [...]

  • ...We adopt the Gaussian statistical model that the DFT coefficients of each process are asymptotically independent Gaussian random variables [3]....

    [...]

Journal ArticleDOI
TL;DR: A study of the noise suppression technique proposed by Ephraim and Malah (1984,1985) and it is demonstrated how this artifact is actually eliminated without bringing distortion to the recorded signal even if the noise is only poorly stationary.
Abstract: Presents a study of the noise suppression technique proposed by Ephraim and Malah (1984,1985). This technique has been used previously for the restoration of degraded audio recordings because it is free of the frequently encountered 'musical noise' artifact. It is demonstrated how this artifact is actually eliminated without bringing distortion to the recorded signal even if the noise is only poorly stationary. >

578 citations


"A statistical model-based voice act..." refers methods in this paper

  • ...The DD method of (7) provides smoother estimates of the a priori SNR than the ML method [4], and consequently reduces the fluctuation of the estimated likelihood ratios during noise-only periods....

    [...]

Proceedings ArticleDOI
12 May 1998
TL;DR: A novel noise spectrum adaptation algorithm using the soft decision information of the proposed decision rule is developed, which is robust, especially for the time-varying noise such as babble noise.
Abstract: In this paper, a voice activity detector (VAD) for variable rate speech coding is decomposed into two parts, a decision rule and a background noise statistic estimator, which are analysed separately by applying a statistical model. A robust decision rule is derived from the generalized likelihood ratio test by assuming that the noise statistics are known a priori. To estimate the time-varying noise statistics, allowing for the occasional presence of the speech signal, a novel noise spectrum adaptation algorithm using the soft decision information of the proposed decision rule is developed. The algorithm is robust, especially for the time-varying noise such as babble noise.

196 citations