scispace - formally typeset
Search or ask a question
Author

Kun Jung Park

Bio: Kun Jung Park is an academic researcher. The author has contributed to research in topics: Noise & Frequency domain. The author has an hindex of 1, co-authored 1 publications receiving 144 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A new voice activity detection (VAD) algorithm is proposed for estimating the spectrum of car noise in which noise is filtered out in the frequency domain, which can prevent incorrect detections caused by unvoiced or nasal sounds with high frequency components being covered by car noise with low frequency components.
Abstract: A new voice activity detection (VAD) algorithm is proposed for estimating the spectrum of car noise in which noise is filtered out in the frequency domain. The proposed algorithm uses the log energy parameters which are composed of two parts in the critical band. The algorithm detects the noise period by applying two adaptive thresholds to each part. Using the noise period we can reliably estimate the time-varying noise characteristics. The advantage of the proposed technique is that it can prevent incorrect detections caused by unvoiced or nasal sounds with high frequency components being covered by car noise with low frequency components. The algorithm is suitable for real time implementation with one microphone. Also, a speaker independent speech recognition system has been implemented for car navigation using a fixed point Oak DSP system, which incorporates the proposed VAD algorithm. The system enhanced the recognition rates for 12 isolated command words to 94.52%, compared with the 80.7% of the baseline recogniser.

157 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.

412 citations

Book ChapterDOI
01 Jun 2007
TL;DR: This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used.
Abstract: An important drawback affecting most of the speech processing systems is the environmental noise and its harmful effect on the system performance. Examples of such systems are the new wireless communications voice services or digital hearing aid devices. In speech recognition, there are still technical barriers inhibiting such systems from meeting the demands of modern applications. Numerous noise reduction techniques have been developed to palliate the effect of the noise on the system performance and often require an estimate of the noise statistics obtained by means of a precise voice activity detector (VAD). Speech/non-speech detection is an unsolved problem in speech processing and affects numerous applications including robust speech recognition (Karray and Marting, 2003; Ramirez et al. 2003), discontinuous transmission (ITU, 1996; ETSI, 1999), real-time speech transmission on the Internet (Sangwan et al., 2002) or combined noise reduction and echo cancellation schemes in the context of telephony (Basbug et al., 2004; Gustafsson et al., 2002). The speech/non-speech classification task is not as trivial as it appears, and most of the VAD algorithms fail when the level of background noise increases. During the last decade, numerous researchers have developed different strategies for detecting speech on a noisy signal (Sohn et al., 1999; Cho and Kondoz, 2001; Gazor and Zhang, 2003, Armani et al., 2003) and have evaluated the influence of the VAD effectiveness on the performance of speech processing systems (Bouquin-Jeannes and Faucon, 1995). Most of the approaches have focussed on the development of robust algorithms with special attention being paid to the derivation and study of noise robust features and decision rules (Woo et al., 2000; Li et al., 2002; Marzinzik and Kollmeier, 2002). The different VAD methods include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik and Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measure (Tucker, 1992), higher order statistics in the LPC residual domain (Nemer et al., 2001) or combinations of different features (ITU, 1993; ETSI, 1999; Tanyer and Ozer, 2000). This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used. The application of VADs for speech coding, speech enhancement and robust speech recognition systems is shown and discussed. Three different VAD methods are described and compared to standardized and

256 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: A novel, data-driven approach to voice activity detection based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features clearly outperforming three state-of-the-art reference algorithms under the same conditions.
Abstract: A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye corpora, and adding real long term recordings of diverse noise types. The approach is evaluated on unseen synthetically mixed test data as well as a real-life test set consisting of four full-length Hollywood movies. A frame-wise Equal Error Rate (EER) of 33.2% is obtained for the four movies and an EER of 9.6% is obtained for the synthetic test data at a peak SNR of 0 dB, clearly outperforming three state-of-the-art reference algorithms under the same conditions.

236 citations

Journal ArticleDOI
TL;DR: This letter presents a new voice activity detector (VAD) for improving speech detection robustness in noisy environments and the performance of speech recognition systems using an optimum likelihood ratio test (LRT) involving multiple and independent observations.
Abstract: Currently, there are technology barriers inhibiting speech processing systems that work in extremely noisy conditions from meeting the demands of modern applications. This letter presents a new voice activity detector (VAD) for improving speech detection robustness in noisy environments and the performance of speech recognition systems. The algorithm defines an optimum likelihood ratio test (LRT) involving multiple and independent observations. The so-defined decision rule reports significant improvements in speech/nonspeech discrimination accuracy over existing VAD methods that are defined on a single observation and need empirically tuned hangover mechanisms. The algorithm has an inherent delay that, for several applications, including robust speech recognition, does not represent a serious implementation obstacle. An analysis of the overlap between the distributions of the decision variable shows the improved robustness of the proposed approach by means of a clear reduction of the classification error as the number of observations is increased. The proposed strategy is also compared to different VAD methods, including the G.729, AMR, and AFE standards, as well as recently reported algorithms showing a sustained advantage in speech/nonspeech detection accuracy and speech recognition performance.

191 citations

Proceedings ArticleDOI
04 May 2014
TL;DR: CNNs are used as acoustic models for speech activity detection (SAD) on data collected over noisy radio communication channels to illustrate that CNNs have a considerable advantage in fast adaptation for acoustic modeling in these settings.
Abstract: Convolutional neural networks (CNN) are extensions to deep neural networks (DNN) which are used as alternate acoustic models with state-of-the-art performances for speech recognition. In this paper, CNNs are used as acoustic models for speech activity detection (SAD) on data collected over noisy radio communication channels. When these SAD models are tested on audio recorded from radio channels not seen during training, there is severe performance degradation. We attribute this degradation to mismatches between the two dimensional filters learnt in the initial CNN layers and the novel channel data. Using a small amount of supervised data from the novel channels, the filters can be adapted to provide significant improvements in SAD performance. In mismatched acoustic conditions, the adapted models provide significant improvements (about 10-25%) relative to conventional DNN-based SAD systems. These results illustrate that CNNs have a considerable advantage in fast adaptation for acoustic modeling in these settings.

110 citations