A statistical model-based voice activity detection

doi:10.1109/97.736233

Home
/
Papers
/
A statistical model-based voice activity detection

Journal Article•DOI•

A statistical model-based voice activity detection

Jongseo Sohn¹, Nam Soo Kim, Wonyong Sung¹•Institutions (1)

Seoul National University¹

01 Jan 1999-IEEE Signal Processing Letters (IEEE)-Vol. 6, Iss: 1, pp 1-3

TL;DR: An effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences is proposed which shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.

read less

Abstract: In this letter, we develop a robust voice activity detector (VAD) for the application to variable-rate speech coding. The developed VAD employs the decision-directed parameter estimation method for the likelihood ratio test. In addition, we propose an effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences. According to our simulation results, the proposed VAD shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging

[...]

Israel Cohen¹•Institutions (1)

Technion – Israel Institute of Technology¹

26 Aug 2003-IEEE Transactions on Speech and Audio Processing

TL;DR: In this article, an improved minima controlled recursive averaging (IMCRA) approach is proposed for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR).

...read moreread less

Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

...read moreread less

902 citations

Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled

[...]

Recursive Averaging, Israel Cohen

01 Jan 2002

TL;DR: It is shown that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

...read moreread less

Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. In this paper, we present an Improved Minima Con- trolled Recursive Averaging (IMCRA) approach, for noise es- timation in adverse environments involving non-stationary noise, weak speech components, and low input signal-to- noise ratio (SNR). The noise estimate is obtained by av- eraging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iter- ations of smoothing and minimum tracking. The rst it- eration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in non-stationary noise environments and under low SNR conditions, the IMCRA approach is very eectiv e. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

...read moreread less

834 citations

Cites background from "A statistical model-based voice act..."

...Additionally, VADs are generally difficult to tune and their reliability severely deteriorates for weak speech components and low input SNR [15], [16], [20]....
[...]

Journal Article•DOI•

Subjective comparison and evaluation of speech enhancement algorithms

[...]

Yi Hu¹, Philipos C. Loizou¹•Institutions (1)

University of Texas at Dallas¹

01 Jul 2007-Speech Communication

TL;DR: A noisy speech corpus is developed suitable for evaluation of speech enhancement algorithms encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms.

...read moreread less

634 citations

Cites methods from "A statistical model-based voice act..."

...More precisely, a statistical-model based voice activity detector (VAD) (Sohn et al., 1999) was used to update the noise spectrum during speech-absent periods....
[...]
...This was surprising at first, but close analysis indicated that the logMMSE-SPU algorithm was sensitive to the noise spectrum estimate, which in our case was obtained with a VAD algorithm....
[...]
...The frame windowing scheme proposed in (Jabloun and Champagne, 2003) was adopted in both VAD methods....
[...]
...The following VAD decision rule was used: 1 L XL 1 k¼1 log Kk ?...
[...]
...(3) Incorporating noise estimation algorithms in place of VAD algorithms for updating the noise spectrum did not produce significant improvements in performance....
[...]

Journal Article•DOI•

Speech enhancement for non-stationary noise environments

[...]

Israel Cohen, Baruch Berdugo

01 Nov 2001-Signal Processing

TL;DR: An optimally-modi#ed log-spectral amplitude (OM-LSA) speech estimator and a minima controlled recursive averaging (MCRA) noise estimation approach for robust speech enhancement are presented.

...read moreread less

569 citations

Cites methods from "A statistical model-based voice act..."

...soft-decision speech pause detection is either implemented on a frame-by-frame basis [12, 22 ] or estimated independentlyfor individual subbands using an a posteriori signal-to-noise ratio (SNR) [11,13]....
[...]

Journal Article•DOI•

Speaker Recognition by Machines and Humans: A tutorial review

[...]

John H. L. Hansen¹, Taufiq Hasan¹•Institutions (1)

University of Texas at Dallas¹

14 Oct 2015-IEEE Signal Processing Magazine

TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.

...read moreread less

Abstract: Identifying a person by his or her voice is an important human trait most take for granted in natural human-to-human interaction/communication. Speaking to someone over the telephone usually begins by identifying who is speaking and, at least in cases of familiar speakers, a subjective verification by the listener that the identity is correct and the conversation can proceed. Automatic speaker-recognition systems have emerged as an important means of verifying identity in many e-commerce applications as well as in general business interactions, forensics, and law enforcement. Human experts trained in forensic speaker recognition can perform this task even better by examining a set of acoustic, prosodic, and linguistic characteristics of speech in a general approach referred to as structured listening. Techniques in forensic speaker recognition have been developed for many years by forensic speech scientists and linguists to help reduce any potential bias or preconceived understanding as to the validity of an unknown audio sample and a reference template from a potential suspect. Experienced researchers in signal processing and machine learning continue to develop automatic algorithms to effectively perform speaker recognition?with ever-improving performance?to the point where automatic systems start to perform on par with human listeners. In this article, we review the literature on speaker recognition by machines and humans, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems. We discuss different aspects of automatic systems, including voice-activity detection (VAD), features, speaker models, standard evaluation data sets, and performance metrics. Human speaker recognition is discussed in two parts?the first part involves forensic speaker-recognition methods, and the second illustrates how a na?ve listener performs this task from a neuroscience perspective. We conclude this review with a comparative study of human versus machine speaker recognition and attempt to point out strengths and weaknesses of each.

...read moreread less

554 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Fundamentals of speech recognition

[...]

Lawrence R. Rabiner, Biing-Hwang Juang

01 Jan 1993

TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.

...read moreread less

Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.

...read moreread less

8,442 citations

"A statistical model-based voice act..." refers methods in this paper

...…frame state model, the current state depends on the previous observations as well as the current one, which is reflected on the decision rule in the following way: Based on the above formulations, a recursive formula for is obtained as (11) where denotes the likelihood ratio in (4) atth frame....
[...]

Journal Article•DOI•

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

[...]

Yariv Ephraim¹, David Malah²•Institutions (2)

Stanford University¹, Technion – Israel Institute of Technology²

01 Dec 1984-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: In this article, a system which utilizes a minimum mean square error (MMSE) estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm.

...read moreread less

Abstract: This paper focuses on the class of speech enhancement systems which capitalize on the major importance of the short-time spectral amplitude (STSA) of the speech signal in its perception. A system which utilizes a minimum mean-square error (MMSE) STSA estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm. In this paper we derive the MMSE STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables. We analyze the performance of the proposed STSA estimator and compare it with a STSA estimator derived from the Wiener estimator. We also examine the MMSE STSA estimator under uncertainty of signal presence in the noisy observations. In constructing the enhanced signal, the MMSE STSA estimator is combined with the complex exponential of the noisy phase. It is shown here that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. The proposed approach results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise. The complexity of the proposed algorithm is approximately that of other systems in the discussed class.

...read moreread less

3,905 citations

Journal Article•

Speech enhancement using a minimum mean square error short-time spectral amplitude estimator

[...]

Ephraim

01 Jan 1984-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: This paper derives a minimum mean-square error STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables, which results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise.

...read moreread less

Abstract: Absstroct-This paper focuses on the class of speech enhancement systems which capitalize on the major importance of the short-time spectral amplitude (STSA) of the speech signal in its perception. A system which utilizes a minimum mean-square error (MMSE) STSA estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the \" spectral subtraction \" algorithm. In this paper we derive the MMSE STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables. We analyze the performance of the proposed STSA estimator and compare it with a STSA estimator derived from the Wiener estimator. We also examine the MMSE STSA estimator under uncertainty of signal presence in the noisy observations. In constructing the enhanced signal, the MMSE STSA estimator is combined with the complex exponential of the noisy phase. It is shown here that the latter is the MMSE estimator of the complex exponential of the original phase, which does not affect the STSA estimation. The proposed approach results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise. The complexity of the proposed algorithm is approximately that of other systems in the discussed class.

...read moreread less

2,714 citations

"A statistical model-based voice act..." refers background or methods in this paper

...…as follows: (5) Substituting (5) into (4) and applying the LRT yields the Itakura–Saito distortion (ISD) based decision rule [2], i.e., (6) Note that the left-hand side of (6) can not be smaller than zero, which is the well-known property of ISD and implies that the likelihood ratio is biased to ....
[...]
...The likelihood ratio for theth frequency band is (3) where and , and they are called thea priori and a posteriori signal-to-noise ratios (SNR’s), respectively [3]....
[...]
...In this letter, we further optimize the decision rule by employing the decision-directed (DD) method for the estimation of the unknown parameters [3]....
[...]
...We adopt the Gaussian statistical model that the DFT coefficients of each process are asymptotically independent Gaussian random variables [3]....
[...]

Journal Article•DOI•

Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor

[...]

Olivier Cappé¹•Institutions (1)

Télécom ParisTech¹

01 Apr 1994-IEEE Transactions on Speech and Audio Processing

TL;DR: A study of the noise suppression technique proposed by Ephraim and Malah (1984,1985) and it is demonstrated how this artifact is actually eliminated without bringing distortion to the recorded signal even if the noise is only poorly stationary.

...read moreread less

Abstract: Presents a study of the noise suppression technique proposed by Ephraim and Malah (1984,1985). This technique has been used previously for the restoration of degraded audio recordings because it is free of the frequently encountered 'musical noise' artifact. It is demonstrated how this artifact is actually eliminated without bringing distortion to the recorded signal even if the noise is only poorly stationary. >

...read moreread less

578 citations

"A statistical model-based voice act..." refers methods in this paper

...The DD method of (7) provides smoother estimates of the a priori SNR than the ML method [4], and consequently reduces the fluctuation of the estimated likelihood ratios during noise-only periods....
[...]

Proceedings Article•DOI•

A voice activity detector employing soft decision based noise spectrum adaptation

[...]

Jongseo Sohn¹, Wonyong Sung¹•Institutions (1)

Seoul National University¹

12 May 1998

TL;DR: A novel noise spectrum adaptation algorithm using the soft decision information of the proposed decision rule is developed, which is robust, especially for the time-varying noise such as babble noise.

...read moreread less

Abstract: In this paper, a voice activity detector (VAD) for variable rate speech coding is decomposed into two parts, a decision rule and a background noise statistic estimator, which are analysed separately by applying a statistical model. A robust decision rule is derived from the generalized likelihood ratio test by assuming that the noise statistics are known a priori. To estimate the time-varying noise statistics, allowing for the occasional presence of the speech signal, a novel noise spectrum adaptation algorithm using the soft decision information of the proposed decision rule is developed. The algorithm is robust, especially for the time-varying noise such as babble noise.

...read moreread less

196 citations