scispace - formally typeset
Search or ask a question
Author

Tae Young Yang

Bio: Tae Young Yang is an academic researcher from Yonsei University. The author has contributed to research in topics: Formant & Codebook. The author has an hindex of 1, co-authored 5 publications receiving 145 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A new voice activity detection (VAD) algorithm is proposed for estimating the spectrum of car noise in which noise is filtered out in the frequency domain, which can prevent incorrect detections caused by unvoiced or nasal sounds with high frequency components being covered by car noise with low frequency components.
Abstract: A new voice activity detection (VAD) algorithm is proposed for estimating the spectrum of car noise in which noise is filtered out in the frequency domain. The proposed algorithm uses the log energy parameters which are composed of two parts in the critical band. The algorithm detects the noise period by applying two adaptive thresholds to each part. Using the noise period we can reliably estimate the time-varying noise characteristics. The advantage of the proposed technique is that it can prevent incorrect detections caused by unvoiced or nasal sounds with high frequency components being covered by car noise with low frequency components. The algorithm is suitable for real time implementation with one microphone. Also, a speaker independent speech recognition system has been implemented for car navigation using a fixed point Oak DSP system, which incorporates the proposed VAD algorithm. The system enhanced the recognition rates for 12 isolated command words to 94.52%, compared with the 80.7% of the baseline recogniser.

157 citations

Journal Article
TL;DR: The speech subtraction technique is combined with the noise robust feature to get more performance enhancement and the experimental result shows that SMC and root based mel cepstrum(root_mel cepStrum) show 9.68% recognition enhancement at 10dB in compare to the LPCC(Linear Prediction Cepstral Coefficient).
Abstract: This paper compares the recognition performances of feature vectors known to be robust to the environmental noise. And, the speech subtraction technique is combined with the noise robust feature to get more performance enhancement. The experiments using SMC(Short time Modified Coherence) analysis, root cepstral analysis, LDA(Linear Discriminant Analysis), PLP(Perceptual Linear Prediction), RASTA(RelAtive SpecTrAl) processing are carried out. An isolated word recognition system is composed using semi-continuous HMM. Noisy environment experiments usign two types of noises:exhibition hall, computer room are carried out at 0, 10, 20dB SNRs. The experimental result shows that SMC and root based mel cepstrum(root_mel cepstrum) show 9.86% and 12.68% recognition enhancement at 10dB in compare to the LPCC(Linear Prediction Cepstral Coefficient). And when combined with spectral subtraction, mel cepstrum and root_mel cepstrum show 16.7% and 8.4% enhanced recognition rate of 94.91% and 94.28% at 10dB.

1 citations

Proceedings ArticleDOI
03 Oct 1996
TL;DR: The proposed method is effective particularly when there exists a large mismatch between the reference codebook and a target speaker in feature space and the average recognition rate of Bayesian adaptation is improved.
Abstract: The paper describes a codebook adaptation process improving the performance of speaker adaptation. The proposed method is performed prior to the Bayesian speaker adaptation method using the formant distribution of adaptation data. The reference codebook is adapted to represent the formant distribution of a new speaker. The average recognition rate of Bayesian adaptation is improved from 91.4% to 95.1% using the proposed method. The proposed method is effective particularly when there exists a large mismatch between the reference codebook and a target speaker in feature space. In this cases the average recognition rate is 95.0% while 89.9% is obtained when only Bayesian adaptation is performed.
Proceedings Article
01 Jan 1998
TL;DR: To alleviate the problems due to fast or slow speech, a modification to the bounded duration modeling which accounts for speaking rate is described and the effectiveness of the proposed duration modeling scheme and the speaking rate compensation technique is shown.
Abstract: A duration modeling scheme and a speaking rate compensation technique are presented for the HMM based connected digit recognizer The proposed duration modeling technique uses a cumulative duration probability The cumulative duration probability also can be used to obtain the duration bounds for the bounded duration modeling One of the advantages of proposed technique is that the cumulative duration probability can be applied directly to the Viterbi decoding procedure without additional postprocessing Therefore, it rules the state and word transition at each frame To alleviate the problems due to fast or slow speech, a modification to the bounded duration modeling which accounts for speaking rate is described The experimental results on Korean connected digit recognition show the effectiveness of the proposed duration modeling scheme and the speaking rate compensation technique
Journal ArticleDOI
TL;DR: The proposed speaker adaptation algorithm using formant frequencies was implemented in two schemes and evaluated by speaker-independent, male speaker dependent, and female speaker-dependent recognition experiments.
Abstract: A speaker adaptation algorithm using formant frequencies is proposed. The formants extracted from the cepstral means in the reference codebook are iteratively shifted toward the formants of a test speaker. The number of cepstral means selected at each iteration decreases as the iteration increases. The decision of the number of selected cepstral means and a formant based distance measure are formulated. The proposed algorithm was implemented in two schemes and evaluated by speaker-independent, male speaker dependent, and female speaker-dependent recognition experiments. A combined scheme with the Bayesian adaptation obtained 9.7% enhancement for the average recognition accuracy in speaker-independent experiments and 52.6% in speaker-dependent recognition experiments.

Cited by
More filters
Journal ArticleDOI
TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.

412 citations

Book ChapterDOI
01 Jun 2007
TL;DR: This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used.
Abstract: An important drawback affecting most of the speech processing systems is the environmental noise and its harmful effect on the system performance. Examples of such systems are the new wireless communications voice services or digital hearing aid devices. In speech recognition, there are still technical barriers inhibiting such systems from meeting the demands of modern applications. Numerous noise reduction techniques have been developed to palliate the effect of the noise on the system performance and often require an estimate of the noise statistics obtained by means of a precise voice activity detector (VAD). Speech/non-speech detection is an unsolved problem in speech processing and affects numerous applications including robust speech recognition (Karray and Marting, 2003; Ramirez et al. 2003), discontinuous transmission (ITU, 1996; ETSI, 1999), real-time speech transmission on the Internet (Sangwan et al., 2002) or combined noise reduction and echo cancellation schemes in the context of telephony (Basbug et al., 2004; Gustafsson et al., 2002). The speech/non-speech classification task is not as trivial as it appears, and most of the VAD algorithms fail when the level of background noise increases. During the last decade, numerous researchers have developed different strategies for detecting speech on a noisy signal (Sohn et al., 1999; Cho and Kondoz, 2001; Gazor and Zhang, 2003, Armani et al., 2003) and have evaluated the influence of the VAD effectiveness on the performance of speech processing systems (Bouquin-Jeannes and Faucon, 1995). Most of the approaches have focussed on the development of robust algorithms with special attention being paid to the derivation and study of noise robust features and decision rules (Woo et al., 2000; Li et al., 2002; Marzinzik and Kollmeier, 2002). The different VAD methods include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik and Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measure (Tucker, 1992), higher order statistics in the LPC residual domain (Nemer et al., 2001) or combinations of different features (ITU, 1993; ETSI, 1999; Tanyer and Ozer, 2000). This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used. The application of VADs for speech coding, speech enhancement and robust speech recognition systems is shown and discussed. Three different VAD methods are described and compared to standardized and

256 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: A novel, data-driven approach to voice activity detection based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features clearly outperforming three state-of-the-art reference algorithms under the same conditions.
Abstract: A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye corpora, and adding real long term recordings of diverse noise types. The approach is evaluated on unseen synthetically mixed test data as well as a real-life test set consisting of four full-length Hollywood movies. A frame-wise Equal Error Rate (EER) of 33.2% is obtained for the four movies and an EER of 9.6% is obtained for the synthetic test data at a peak SNR of 0 dB, clearly outperforming three state-of-the-art reference algorithms under the same conditions.

236 citations

Journal ArticleDOI
TL;DR: This letter presents a new voice activity detector (VAD) for improving speech detection robustness in noisy environments and the performance of speech recognition systems using an optimum likelihood ratio test (LRT) involving multiple and independent observations.
Abstract: Currently, there are technology barriers inhibiting speech processing systems that work in extremely noisy conditions from meeting the demands of modern applications. This letter presents a new voice activity detector (VAD) for improving speech detection robustness in noisy environments and the performance of speech recognition systems. The algorithm defines an optimum likelihood ratio test (LRT) involving multiple and independent observations. The so-defined decision rule reports significant improvements in speech/nonspeech discrimination accuracy over existing VAD methods that are defined on a single observation and need empirically tuned hangover mechanisms. The algorithm has an inherent delay that, for several applications, including robust speech recognition, does not represent a serious implementation obstacle. An analysis of the overlap between the distributions of the decision variable shows the improved robustness of the proposed approach by means of a clear reduction of the classification error as the number of observations is increased. The proposed strategy is also compared to different VAD methods, including the G.729, AMR, and AFE standards, as well as recently reported algorithms showing a sustained advantage in speech/nonspeech detection accuracy and speech recognition performance.

191 citations

Proceedings ArticleDOI
04 May 2014
TL;DR: CNNs are used as acoustic models for speech activity detection (SAD) on data collected over noisy radio communication channels to illustrate that CNNs have a considerable advantage in fast adaptation for acoustic modeling in these settings.
Abstract: Convolutional neural networks (CNN) are extensions to deep neural networks (DNN) which are used as alternate acoustic models with state-of-the-art performances for speech recognition. In this paper, CNNs are used as acoustic models for speech activity detection (SAD) on data collected over noisy radio communication channels. When these SAD models are tested on audio recorded from radio channels not seen during training, there is severe performance degradation. We attribute this degradation to mismatches between the two dimensional filters learnt in the initial CNN layers and the novel channel data. Using a small amount of supervised data from the novel channels, the filters can be adapted to provide significant improvements in SAD performance. In mismatched acoustic conditions, the adapted models provide significant improvements (about 10-25%) relative to conventional DNN-based SAD systems. These results illustrate that CNNs have a considerable advantage in fast adaptation for acoustic modeling in these settings.

110 citations