scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2012"


Book
01 Jan 2012
TL;DR: This paper presents a systematic review of Mathematics for Speech Processing and its applications in Speech Communications, focusing on the areas of coding of Speech Signals and Speech Synthesis.
Abstract: Preface. Acknowledgments. Acronyms in Speech Communications. Important Developments in Speech Communications. Introduction. Review of Mathematics for Speech Processing. Speech Production and Acoustic Phonetics. Hearing. Speech Perception. Speech Analysis. Coding of Speech Signals. Speech Enhancement. Speech Synthesis. Automatic Speech Recognition. Speaker Recognition. Appendix: Computer Sites for Help on Speech Communication. References. Index. About the Author.

271 citations


Journal ArticleDOI
TL;DR: For a number of unexplored but important applications, distant microphones are a prerequisite for extending the availability of speech recognizers as well as enhancing the convenience of existing speech recognition applications.
Abstract: Speech recognition technology has left the research laboratory and is increasingly coming into practical use, enabling a wide spectrum of innovative and exciting voice-driven applications that are radically changing our way of accessing digital services and information. Most of today's applications still require a microphone located near the talker. However, almost all of these applications would benefit from distant-talking speech capturing, where talkers are able to speak at some distance from the microphones without the encumbrance of handheld or body-worn equipment [1]. For example, applications such as meeting speech recognition, automatic annotation of consumer-generated videos, speech-to-speech translation in teleconferencing, and hands-free interfaces for controlling consumer-products, like interactive TV, will greatly benefit from distant-talking operation. Furthermore, for a number of unexplored but important applications, distant microphones are a prerequisite. This means that distant talking speech recognition technology is essential for extending the availability of speech recognizers as well as enhancing the convenience of existing speech recognition applications.

251 citations


Journal ArticleDOI
TL;DR: In this paper, five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers.
Abstract: The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the glottal closure instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation.

241 citations


Journal ArticleDOI
TL;DR: A new feature based on relative phase shift (RPS) is proposed, demonstrated reliable detection of synthetic speech, and shown how this classifier can be used to improve security of SV systems.
Abstract: In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.

229 citations


Patent
09 Jul 2012
TL;DR: In this paper, a method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system, is presented.
Abstract: A method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system; determining if the parsed speech input unambiguously corresponds to a command and is sufficiently complete for reliable processing, then processing the command; if the speech input ambiguously corresponds to a single command or is not sufficiently complete for reliable processing, then prompting a user for further speech input to reduce ambiguity or increase completeness, in dependence on a relationship of previously received speech input and at least one command grammar of the plurality of predetermined command grammars, reparsing the further speech input in conjunction with previously parsed speech input, and iterating as necessary. The system also monitors abort, fail or cancel conditions in the speech input.

226 citations


Patent
06 Jan 2012
TL;DR: In this article, the results of the local and remote speech recognition engines are combined based, at least in part, on logic stored by one or more components of the client/server architecture.
Abstract: Techniques for combining the results of multiple recognizers in a distributed speech recognition architecture. Speech data input to a client device is encoded and processed both locally and remotely by different recognizers configured to be proficient at different speech recognition tasks. The client/server architecture is configurable to enable network providers to specify a policy directed to a trade-off between reducing recognition latency perceived by a user and usage of network resources. The results of the local and remote speech recognition engines are combined based, at least in part, on logic stored by one or more components of the client/server architecture.

192 citations


Journal ArticleDOI
TL;DR: Three speaking-aid systems are proposed that enhance three different types of EL speech signals: EL speech, EL speech using an air-pressure sensor (EL-air speech), and silent EL speech which is produced with a new sound source unit that generates signals with extremely low energy.

190 citations


Journal ArticleDOI
TL;DR: Voice conversion methods from NAM to normal speech and to a whispered voice (NAM-to-Whisper) are proposed, where the acoustic features of body-conducted unvoiced speech are converted into those of natural voices in a probabilistic manner using Gaussian mixture models (GMMs).
Abstract: In this paper, we present statistical approaches to enhance body-conducted unvoiced speech for silent speech communication. A body-conductive microphone called nonaudible murmur (NAM) microphone is effectively used to detect very soft unvoiced speech such as NAM or a whispered voice while keeping speech sounds emitted outside almost inaudible. However, body-conducted unvoiced speech is difficult to use in human-to-human speech communication because it sounds unnatural and less intelligible owing to the acoustic change caused by body conduction. To address this issue, voice conversion (VC) methods from NAM to normal speech (NAM-to-Speech) and to a whispered voice (NAM-to-Whisper) are proposed, where the acoustic features of body-conducted unvoiced speech are converted into those of natural voices in a probabilistic manner using Gaussian mixture models (GMMs). Moreover, these methods are extended to convert not only NAM but also a body-conducted whispered voice (BCW) as another type of body-conducted unvoiced speech. Several experimental evaluations are conducted to demonstrate the effectiveness of the proposed methods. The experimental results show that 1) NAM-to-Speech effectively improves intelligibility but it causes degradation of naturalness owing to the difficulty of estimating natural fundamental frequency contours from unvoiced speech; 2) NAM-to-Whisper significantly outperforms NAM-to-Speech in terms of both intelligibility and naturalness; and 3) a single conversion model capable of converting both NAM and BCW is effectively developed in our proposed VC methods.

187 citations


Proceedings Article
01 Jan 2012
TL;DR: Experiments show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.
Abstract: Voice conversion techniques present a threat to speaker verification systems. To enhance the security of speaker verification systems, We study how to automatically distinguish natural speech and synthetic/converted speech. Motivated by the research on phase spectrum in speech perception, in this study, we propose to use features derived from phase spectrum to detect converted speech. The features are tested under three different training situations of the converted speech detector: a) only Gaussian mixture model (GMM) based converted speech data are available; b) only unit-selection based converted speech data are available; c) no converted speech data are available for training converted speech model. Experiments conducted on the National Institute of Standards and Technology (NIST) 2006 speaker recognition evaluation (SRE) corpus show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.

170 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper presents the strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework, and shows that DNNs provide the flexibility of using arbitrary features.
Abstract: Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework. We show that DNNs provide the flexibility of using arbitrary features. By using the Mel-scale log-filter bank features we not only achieve higher recognition accuracy than using MFCCs, but also can formulate the mixed-bandwidth training problem as a missing feature problem, in which several feature dimensions have no value when narrowband speech is presented. This treatment makes training CD-DNN-HMMs with mixed-bandwidth data an easy task since no bandwidth extension is needed. Our experiments on voice search data indicate that the proposed solution not only provides higher recognition accuracy for the wideband speech but also allows the same CD-DNN-HMM to recognize mixed-bandwidth speech. By exploiting mixed-bandwidth training data CD-DNN-HMM outperforms fMPE+BMMI trained GMM-HMM, which cannot benefit from using narrowband data, by 18.4%.

143 citations


Patent
11 Dec 2012
TL;DR: In this article, power consumption for a computing device may be managed by one or more keywords, such as a keyword, network interface module and/or application processing module of the computing device.
Abstract: Power consumption for a computing device may be managed by one or more keywords. For example, if an audio input obtained by the computing device includes a keyword, a network interface module and/or an application processing module of the computing device may be activated. The audio input may then be transmitted via the network interface module to a remote computing device, such as a speech recognition server. Alternately, the computing device may be provided with a speech recognition engine configured to process the audio input for on-device speech recognition.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A voice conversion technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal, which is confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.
Abstract: This paper presents a voice conversion (VC) technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The parallel exemplars (dictionary) consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The input source signal is decomposed into the source exemplars, noise exemplars obtained from the input signal, and their weights (activities). Then, by using the weights of the source exemplars, the converted signal is constructed from the target exemplars. We carried out speaker conversion tasks using clean speech data and noise-added speech data. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

Journal ArticleDOI
TL;DR: The aim of this article is to describe some of the technological underpinnings of modern LVCSR systems, which are not robust to mismatched training and test conditions and cannot handle context as well as human listeners despite being trained on thousands of hours of speech and billions of words of text.
Abstract: Over the past decade or so, several advances have been made to the design of modern large vocabulary continuous speech recognition (LVCSR) systems to the point where their application has broadened from early speaker dependent dictation systems to speaker-independent automatic broadcast news transcription and indexing, lectures and meetings transcription, conversational telephone speech transcription, open-domain voice search, medical and legal speech recognition, and call center applications, to name a few. The commercial success of these systems is an impressive testimony to how far research in LVCSR has come, and the aim of this article is to describe some of the technological underpinnings of modern systems. It must be said, however, that, despite the commercial success and widespread adoption, the problem of large-vocabulary speech recognition is far from being solved: background noise, channel distortions, foreign accents, casual and disfluent speech, or unexpected topic change can cause automated systems to make egregious recognition errors. This is because current LVCSR systems are not robust to mismatched training and test conditions and cannot handle context as well as human listeners despite being trained on thousands of hours of speech and billions of words of text.

Proceedings Article
01 Jan 2012
TL;DR: It is shown that significant gains in SAD accuracy can be obtained by careful design of acoustic front end, feature normalization, incorporation of long span features via data-driven dimensionality reducing transforms, and channel dependent modeling.
Abstract: This paper describes the speech activity detection (SAD) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present two approaches to SAD, one based on Gaussian mixture models, and one based on multi-layer perceptrons. We show that significant gains in SAD accuracy can be obtained by careful design of acoustic front end, feature normalization, incorporation of long span features via data-driven dimensionality reducing transforms, and channel dependent modeling. We also present a novel technique for normalizing detection scores from different systems for the purpose of system combination.

Book
28 Nov 2012
TL;DR: A comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences can be found in this article.
Abstract: Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments. Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. Includes contributions from top ASR researchers from leading research units in the field

Patent
06 Nov 2012
TL;DR: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR as discussed by the authors.
Abstract: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR Adaptive weights may be applied on the SNRs per band before computing the average SNR The weighting function can be a function of noise level, noise type, and/or instantaneous SNR value Another weighting mechanism applies a null filtering or outlier filtering which sets the weight in a particular band to be zero This particular band may be characterized as the one that exhibits an SNR that is several times higher than the SNRs in other bands

Journal ArticleDOI
TL;DR: The method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.
Abstract: The enhancement of speech degraded by real-world interferers is a highly relevant and difficult task. Its importance arises from the multitude of practical applications, whereas the difficulty is due to the fact that interferers are often nonstationary and potentially similar to speech. The goal of monaural speech enhancement is to separate a single mixture into its underlying clean speech and interferer components. This under-determined problem is solved by incorporating prior knowledge in the form of learned speech and interferer dictionaries. The clean speech is recovered from the degraded speech by sparse coding of the mixture in a composite dictionary consisting of the concatenation of a speech and an interferer dictionary. Enhancement performance is measured using objective measures and is limited by two effects. A too sparse coding of the mixture causes the speech component to be explained with too few speech dictionary atoms, which induces an approximation error we denote source distortion. However, a too dense coding of the mixture results in source confusion, where parts of the speech component are explained by interferer dictionary atoms and vice-versa. Our method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.

Journal ArticleDOI
TL;DR: A new algorithm is proposed for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding, thus maintaining synchronization between information hiding and speech encoding.
Abstract: Low bit-rate speech codecs have been widely used in audio communications like VoIP and mobile communications, so that steganography in low bit-rate audio streams would have broad applications in practice. In this paper, the authors propose a new algorithm for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding. The proposed algorithm performs data embedding while pitch period prediction is conducted during low bit-rate speech encoding, thus maintaining synchronization between information hiding and speech encoding. The steganography algorithm can achieve high quality of speech and prevent detection of steganalysis, but also has great compatibility with a standard low bit-rate speech codec without causing further delay by data embedding and extraction. Testing shows, with the proposed algorithm, the data embedding rate of the secret message can attain 4 bits/frame (133.3 bits/second).

Patent
Elizabeth V. Woodward1, Shunguo Yan1
26 Sep 2012
TL;DR: In this article, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the audio track corresponding to the speaker.
Abstract: Mechanisms for performing dynamic automatic speech recognition on a portion of multimedia content are provided. Multimedia content is segmented into homogeneous segments of content with regard to speakers and background sounds. For the at least one segment, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source. A speech profile for the speaker is generated using information retrieved from the social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the at least one segment based on the acoustic profile. Automatic speech recognition operations are performed on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.

Patent
11 May 2012
TL;DR: In this paper, a voice recognition system includes a microphone for receiving speech from a user and processing electronics that automatically determine and set an expertise level in response to and based on the evaluation.
Abstract: A voice recognition system includes a microphone for receiving speech from a user and processing electronics. The processing electronics are in communication with the microphone and are configured to use a plurality of rules to evaluate user interactions with the voice recognition system. The processing electronics automatically determine and set an expertise level in response to and based on the evaluation. The processing electronics are configured to automatically adjust at least one setting of the voice recognition system in response to the set expertise level.

Patent
Liang-yu (Tom) Chi1
05 Jul 2012
TL;DR: In this paper, a wearable computing device can be used for processing speech input at the user's request, either as a command or a search request, depending on the context of the speech-related text.
Abstract: Methods and apparatus related to processing speech input at a wearable computing device are disclosed. Speech input can be received at the wearable computing device. Speech-related text corresponding to the speech input can be generated. A context can be determined based on database(s) and/or a history of accessed documents. An action can be determined based on an evaluation of at least a portion of the speech-related text and the context. The action can be a command or a search request. If the action is a command, then the wearable computing device can generate output for the command. If the action is a search request, then the wearable computing device can: communicate the search request to a search engine, receive search results from the search engine, and generate output based on the search results. The output can be provided using output component(s) of the wearable computing device.

Patent
12 Dec 2012
TL;DR: In this paper, features for managing the use of speech recognition models and data in automated speech recognition systems are disclosed, including pre-caching and pre-processing models and statistics.
Abstract: Features are disclosed for managing the use of speech recognition models and data in automated speech recognition systems. Models and data may be retrieved asynchronously and used as they are received or after an utterance is initially processed with more general or different models. Once received, the models and statistics can be cached. Statistics needed to update models and data may also be retrieved asynchronously so that it may be used to update the models and data as it becomes available. The updated models and data may be immediately used to re-process an utterance, or saved for use in processing subsequently received utterances. User interactions with the automated speech recognition system may be tracked in order to predict when a user is likely to utilize the system. Models and data may be pre-cached based on such predictions.

Proceedings Article
01 Dec 2012
TL;DR: To reduce false acceptance rate caused by spoofing attack, a general anti-spoofing attack framework is proposed for the speaker verification systems, where a converted speech detector is adopted as a post-processing module for the Speaker verification system's acceptance decision.
Abstract: Voice conversion technique, which modifies one speaker's (source) voice to sound like another speaker (target), presents a threat to automatic speaker verification. In this paper, we first present new results of evaluating the vulnerability of current state-of-the-art speaker verification systems: Gaussian mixture model with joint factor analysis (GMM-JFA) and probabilistic linear discriminant analysis (PLDA) systems, against spoofing attacks. The spoofing attacks are simulated by two voice conversion techniques: Gaussian mixture model based conversion and unit selection based conversion. To reduce false acceptance rate caused by spoofing attack, we propose a general anti-spoofing attack framework for the speaker verification systems, where a converted speech detector is adopted as a post-processing module for the speaker verification system's acceptance decision. The detector decides whether the accepted claim is human speech or converted speech. A subset of the core task in the NIST SRE 2006 corpus is used to evaluate the vulnerability of speaker verification system and the performance of converted speech detector. The results indicate that both conversion techniques can increase the false acceptance rate of GMM-JFA and PLDA system, while the converted speech detector can reduce the false acceptance rate from 31.54% and 41.25% to 1.64% and 1.71% for GMM-JFA and PLDA system on unit-selection based converted speech, respectively.

Journal ArticleDOI
TL;DR: This paper presents and analyzes an algorithm that estimates the noise correlation matrix without using a VAD, based on measuring the correlation of the noisy input and a noise reference which can be obtained, e.g., by steering a null towards the target source.
Abstract: For multi-channel noise reduction algorithms like the minimum variance distortionless response (MVDR) beamformer, or the multi-channel Wiener filter, an estimate of the noise correlation matrix is needed. For its estimation, it is often proposed in the literature to use a voice activity detector (VAD). However, using a VAD the estimated matrix can only be updated in speech absence. As a result, during speech presence the noise correlation matrix estimate does not follow changing noise fields with an appropriate accuracy. This effect is further increased, as in nonstationary noise voice activity detection is a rather difficult task, and false-alarms are likely to occur. In this paper, we present and analyze an algorithm that estimates the noise correlation matrix without using a VAD. This algorithm is based on measuring the correlation of the noisy input and a noise reference which can be obtained, e.g., by steering a null towards the target source. When applied in combination with an MVDR beamformer, it is shown that the proposed noise correlation matrix estimate results in a more accurate beamformer response, a larger signal-to-noise ratio improvement and a larger instrumentally predicted speech intelligibility when compared to competing algorithms such as the generalized sidelobe canceler, a VAD-based MVDR beamformer, and an MVDR based on the noisy correlation matrix.

Journal ArticleDOI
TL;DR: A binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation is presented.
Abstract: In this study, we present a binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation. In contrast to many other binaural cocktail party processors, the proposed system does not require a priori knowledge about the azimuth position of the target speakers. The proposed system consists of three main building blocks: binaural localization, speech source detection, and automatic speaker identification. First, a binaural front-end is used to robustly localize relevant sound source activity. Second, a speech detection module based on missing data classification is employed to determine whether detected sound source activity corresponds to a speaker or to an interfering noise source using a binary mask that is based on spatial evidence supplied by the binaural front-end. Third, a second missing data classifier is used to recognize the speaker identities of all detected speech sources. The proposed system is systematically evaluated in simulated adverse acoustic scenarios. Compared to state-of-the art MFCC recognizers, the proposed model achieves significant speaker recognition accuracy improvements.

Journal ArticleDOI
TL;DR: The proposed VOP detection method has shown significant improvement in the performance compared to the existing method under clean as well as coded cases and is analyzed in CV recognition by using VOP as an anchor point.
Abstract: In this paper, we propose a method for detecting the vowel onset points (VOPs) for low bit rate coded speech. VOP is the instant at which the onset of the vowel takes place in the speech signal. VOP plays an important role for the applications, such as consonant-vowel (CV) unit recognition and speech rate modification. The proposed VOP detection method is based on the spectral energy present in the glottal closure region of the speech signal. Speech coders considered to carry out this study are Global System for Mobile Communications (GSM) full rate, code-excited linear prediction (CELP), and mixed-excitation linear prediction (MELP). TIMIT database and CV units collected from the broadcast news corpus are used for evaluation. Performance of the proposed method is compared with existing methods, which uses the combination of evidence from the excitation source, spectral peaks energy, and modulation spectrum. The proposed VOP detection method has shown significant improvement in the performance compared to the existing method under clean as well as coded cases. The effectiveness of the proposed VOP detection method is analyzed in CV recognition by using VOP as an anchor point.

Journal ArticleDOI
TL;DR: The inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.
Abstract: The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper, we propose a new confidence measure, compare it to existing ones, and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.

Patent
13 Sep 2012
TL;DR: In this article, a speech recognition system and a voice activity detection unit are coupled to the speech recognition, and the VADU is used to detect whether an audio signal is a voice signal and accordingly generate a voice-activity detection result to control whether the system should perform speech recognition upon the audio signal.
Abstract: A signal processing apparatus includes a speech recognition system and a voice activity detection unit. The voice activity detection unit is coupled to the speech recognition system, and arranged for detecting whether an audio signal is a voice signal and accordingly generating a voice activity detection result to the speech recognition system to control whether the speech recognition system should perform speech recognition upon the audio signal.

Journal ArticleDOI
TL;DR: The results of this study revealed that incorporating voice onset and offset information leads to efficient automatic voice disordered detection.

Patent
06 Jun 2012
TL;DR: In this article, a method for improving speech recognition by a speech recognition system includes obtaining a voice sample from a speaker, storing the voice sample of the speaker as a voice model, identifying an area from which sound matching the voice model for the speaker is coming, and providing one or more audio signals corresponding to sound received from the identified area to the system for processing.
Abstract: A method for improving speech recognition by a speech recognition system includes obtaining a voice sample from a speaker; storing the voice sample of the speaker as a voice model in a voice model database; identifying an area from which sound matching the voice model for the speaker is coming; providing one or more audio signals corresponding to sound received from the identified area to the speech recognition system for processing.