Showing papers on "Voice activity detection published in 2012"

PDF

Open Access

Book•

Speech Communications: Human and Machine

[...]

01 Jan 2012

TL;DR: This paper presents a systematic review of Mathematics for Speech Processing and its applications in Speech Communications, focusing on the areas of coding of Speech Signals and Speech Synthesis.

...read moreread less

Abstract: Preface. Acknowledgments. Acronyms in Speech Communications. Important Developments in Speech Communications. Introduction. Review of Mathematics for Speech Processing. Speech Production and Acoustic Phonetics. Hearing. Speech Perception. Speech Analysis. Coding of Speech Signals. Speech Enhancement. Speech Synthesis. Automatic Speech Recognition. Speaker Recognition. Appendix: Computer Sites for Help on Speech Communication. References. Index. About the Author.

...read moreread less

271 citations

Journal Article•DOI•

Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition

[...]

Takuya Yoshioka¹, Armin Sehr², Marc Delcroix¹, Keisuke Kinoshita¹, Roland Maas², Tomohiro Nakatani¹, Walter Kellermann² - Show less +3 more•Institutions (2)

Nippon Telegraph and Telephone¹, University of Erlangen-Nuremberg²

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: For a number of unexplored but important applications, distant microphones are a prerequisite for extending the availability of speech recognizers as well as enhancing the convenience of existing speech recognition applications.

...read moreread less

Abstract: Speech recognition technology has left the research laboratory and is increasingly coming into practical use, enabling a wide spectrum of innovative and exciting voice-driven applications that are radically changing our way of accessing digital services and information. Most of today's applications still require a microphone located near the talker. However, almost all of these applications would benefit from distant-talking speech capturing, where talkers are able to speak at some distance from the microphones without the encumbrance of handheld or body-worn equipment [1]. For example, applications such as meeting speech recognition, automatic annotation of consumer-generated videos, speech-to-speech translation in teleconferencing, and hands-free interfaces for controlling consumer-products, like interactive TV, will greatly benefit from distant-talking operation. Furthermore, for a number of unexplored but important applications, distant microphones are a prerequisite. This means that distant talking speech recognition technology is essential for extending the availability of speech recognizers as well as enhancing the convenience of existing speech recognition applications.

...read moreread less

251 citations

Journal Article•DOI•

Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review

[...]

Thomas Drugman¹, Mark R. P. Thomas², Jon Gudnason³, Patrick A. Naylor², Thierry Dutoit¹ - Show less +1 more•Institutions (3)

University of Mons¹, Imperial College London², Reykjavík University³

01 Mar 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers.

...read moreread less

Abstract: The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the glottal closure instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation.

...read moreread less

241 citations

Journal Article•DOI•

Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech

[...]

Phillip L. De Leon¹, Michael Pucher, Junichi Yamagishi², Inma Hernaez³, Ibon Saratxaga³ - Show less +1 more•Institutions (3)

New Mexico State University¹, University of Edinburgh², University of the Basque Country³

01 Oct 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new feature based on relative phase shift (RPS) is proposed, demonstrated reliable detection of synthetic speech, and shown how this classifier can be used to improve security of SV systems.

...read moreread less

Abstract: In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.

...read moreread less

229 citations

Patent•

Method for processing the output of a speech recognizer

[...]

Philippe Roy, Paul J. Lagassey

09 Jul 2012

TL;DR: In this paper, a method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system, is presented.

...read moreread less

Abstract: A method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system; determining if the parsed speech input unambiguously corresponds to a command and is sufficiently complete for reliable processing, then processing the command; if the speech input ambiguously corresponds to a single command or is not sufficiently complete for reliable processing, then prompting a user for further speech input to reduce ambiguity or increase completeness, in dependence on a relationship of previously received speech input and at least one command grammar of the plurality of predetermined command grammars, reparsing the further speech input in conjunction with previously parsed speech input, and iterating as necessary. The system also monitors abort, fail or cancel conditions in the speech input.

...read moreread less

226 citations

Patent•

Configurable speech recognition system using multiple recognizers

[...]

Michael J. Newman¹, Anthony Gillet¹, David Mark Krowitz¹, Michael D. Edgington¹•Institutions (1)

Nuance Communications¹

06 Jan 2012

TL;DR: In this article, the results of the local and remote speech recognition engines are combined based, at least in part, on logic stored by one or more components of the client/server architecture.

...read moreread less

Abstract: Techniques for combining the results of multiple recognizers in a distributed speech recognition architecture. Speech data input to a client device is encoded and processed both locally and remotely by different recognizers configured to be proficient at different speech recognition tasks. The client/server architecture is configurable to enable network providers to specify a policy directed to a trade-off between reducing recognition latency perceived by a user and usage of network resources. The results of the local and remote speech recognition engines are combined based, at least in part, on logic stored by one or more components of the client/server architecture.

...read moreread less

192 citations

Journal Article•DOI•

Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech

[...]

Keigo Nakamura¹, Tomoki Toda¹, Hiroshi Saruwatari¹, Kiyohiro Shikano¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Jan 2012-Speech Communication

TL;DR: Three speaking-aid systems are proposed that enhance three different types of EL speech signals: EL speech, EL speech using an air-pressure sensor (EL-air speech), and silent EL speech which is produced with a new sound source unit that generates signals with extremely low energy.

...read moreread less

190 citations

Journal Article•DOI•

Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement

[...]

Tomoki Toda¹, M. Nakagiri², Kiyohiro Shikano¹•Institutions (2)

Nara Institute of Science and Technology¹, Panasonic²

01 Nov 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Voice conversion methods from NAM to normal speech and to a whispered voice (NAM-to-Whisper) are proposed, where the acoustic features of body-conducted unvoiced speech are converted into those of natural voices in a probabilistic manner using Gaussian mixture models (GMMs).

...read moreread less

Abstract: In this paper, we present statistical approaches to enhance body-conducted unvoiced speech for silent speech communication. A body-conductive microphone called nonaudible murmur (NAM) microphone is effectively used to detect very soft unvoiced speech such as NAM or a whispered voice while keeping speech sounds emitted outside almost inaudible. However, body-conducted unvoiced speech is difficult to use in human-to-human speech communication because it sounds unnatural and less intelligible owing to the acoustic change caused by body conduction. To address this issue, voice conversion (VC) methods from NAM to normal speech (NAM-to-Speech) and to a whispered voice (NAM-to-Whisper) are proposed, where the acoustic features of body-conducted unvoiced speech are converted into those of natural voices in a probabilistic manner using Gaussian mixture models (GMMs). Moreover, these methods are extended to convert not only NAM but also a body-conducted whispered voice (BCW) as another type of body-conducted unvoiced speech. Several experimental evaluations are conducted to demonstrate the effectiveness of the proposed methods. The experimental results show that 1) NAM-to-Speech effectively improves intelligibility but it causes degradation of naturalness owing to the difficulty of estimating natural fundamental frequency contours from unvoiced speech; 2) NAM-to-Whisper significantly outperforms NAM-to-Speech in terms of both intelligibility and naturalness; and 3) a single conversion model capable of converting both NAM and BCW is effectively developed in our proposed VC methods.

...read moreread less

187 citations

Proceedings Article•

Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition

[...]

Zhizheng Wu¹, Chng Eng Siong¹, Haizhou Li²•Institutions (2)

Nanyang Technological University¹, Agency for Science, Technology and Research²

01 Jan 2012

TL;DR: Experiments show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.

...read moreread less

Abstract: Voice conversion techniques present a threat to speaker verification systems. To enhance the security of speaker verification systems, We study how to automatically distinguish natural speech and synthetic/converted speech. Motivated by the research on phase spectrum in speech perception, in this study, we propose to use features derived from phase spectrum to detect converted speech. The features are tested under three different training situations of the converted speech detector: a) only Gaussian mixture model (GMM) based converted speech data are available; b) only unit-selection based converted speech data are available; c) no converted speech data are available for training converted speech model. Experiments conducted on the National Institute of Standards and Technology (NIST) 2006 speaker recognition evaluation (SRE) corpus show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.

...read moreread less

170 citations

Proceedings Article•DOI•

Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM

[...]

Jinyu Li¹, Dong Yu¹, Jui-Ting Huang¹, Yifan Gong¹•Institutions (1)

Microsoft¹

01 Dec 2012

TL;DR: This paper presents the strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework, and shows that DNNs provide the flexibility of using arbitrary features.

...read moreread less

Abstract: Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework. We show that DNNs provide the flexibility of using arbitrary features. By using the Mel-scale log-filter bank features we not only achieve higher recognition accuracy than using MFCCs, but also can formulate the mixed-bandwidth training problem as a missing feature problem, in which several feature dimensions have no value when narrowband speech is presented. This treatment makes training CD-DNN-HMMs with mixed-bandwidth data an easy task since no bandwidth extension is needed. Our experiments on voice search data indicate that the proposed solution not only provides higher recognition accuracy for the wideband speech but also allows the same CD-DNN-HMM to recognize mixed-bandwidth speech. By exploiting mixed-bandwidth training data CD-DNN-HMM outperforms fMPE+BMMI trained GMM-HMM, which cannot benefit from using narrowband data, by 18.4%.

...read moreread less

143 citations

Patent•

Speech recognition power management

[...]

Kenneth John Basye¹, Hugh Evan Secker-Walker¹, Tony David¹, Reinhard Kneser¹, Jeffrey Penrod Adams¹, Stan Weidner Salvador¹, Mahesh Krishnamoorthy¹ - Show less +3 more•Institutions (1)

Amazon.com¹

11 Dec 2012

TL;DR: In this article, power consumption for a computing device may be managed by one or more keywords, such as a keyword, network interface module and/or application processing module of the computing device.

...read moreread less

Abstract: Power consumption for a computing device may be managed by one or more keywords. For example, if an audio input obtained by the computing device includes a keyword, a network interface module and/or an application processing module of the computing device may be activated. The audio input may then be transmitted via the network interface module to a remote computing device, such as a speech recognition server. Alternately, the computing device may be provided with a speech recognition engine configured to process the audio input for on-device speech recognition.

...read moreread less

Proceedings Article•DOI•

Exemplar-based voice conversion in noisy environment

[...]

Ryoichi Takashima¹, Tetsuya Takiguchi¹, Yasuo Ariki¹•Institutions (1)

Kobe University¹

01 Dec 2012

TL;DR: A voice conversion technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal, which is confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

...read moreread less

Abstract: This paper presents a voice conversion (VC) technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The parallel exemplars (dictionary) consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The input source signal is decomposed into the source exemplars, noise exemplars obtained from the input signal, and their weights (activities). Then, by using the weights of the source exemplars, the converted signal is constructed from the target exemplars. We carried out speaker conversion tasks using clean speech data and noise-added speech data. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

...read moreread less

Journal Article•DOI•

Large-Vocabulary Continuous Speech Recognition Systems: A Look at Some Recent Advances

[...]

George Saon¹, Jen-Tzung Chien²•Institutions (2)

IBM¹, National Cheng Kung University²

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: The aim of this article is to describe some of the technological underpinnings of modern LVCSR systems, which are not robust to mismatched training and test conditions and cannot handle context as well as human listeners despite being trained on thousands of hours of speech and billions of words of text.

...read moreread less

Abstract: Over the past decade or so, several advances have been made to the design of modern large vocabulary continuous speech recognition (LVCSR) systems to the point where their application has broadened from early speaker dependent dictation systems to speaker-independent automatic broadcast news transcription and indexing, lectures and meetings transcription, conversational telephone speech transcription, open-domain voice search, medical and legal speech recognition, and call center applications, to name a few. The commercial success of these systems is an impressive testimony to how far research in LVCSR has come, and the aim of this article is to describe some of the technological underpinnings of modern systems. It must be said, however, that, despite the commercial success and widespread adoption, the problem of large-vocabulary speech recognition is far from being solved: background noise, channel distortions, foreign accents, casual and disfluent speech, or unexpected topic change can cause automated systems to make egregious recognition errors. This is because current LVCSR systems are not robust to mismatched training and test conditions and cannot handle context as well as human listeners despite being trained on thousands of hours of speech and billions of words of text.

...read moreread less

Proceedings Article•

Developing a Speech Activity Detection System for the DARPA RATS Program.

[...]

Tim Ng¹, Bing Zhang¹, Long Nguyen¹, Spyros Matsoukas¹, Xinhui Zhou², Nima Mesgarani², Karel Veselý³, Pavel Matejka³ - Show less +4 more•Institutions (3)

BBN Technologies¹, University of Maryland, College Park², Brno University of Technology³

01 Jan 2012

TL;DR: It is shown that significant gains in SAD accuracy can be obtained by careful design of acoustic front end, feature normalization, incorporation of long span features via data-driven dimensionality reducing transforms, and channel dependent modeling.

...read moreread less

Abstract: This paper describes the speech activity detection (SAD) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present two approaches to SAD, one based on Gaussian mixture models, and one based on multi-layer perceptrons. We show that significant gains in SAD accuracy can be obtained by careful design of acoustic front end, feature normalization, incorporation of long span features via data-driven dimensionality reducing transforms, and channel dependent modeling. We also present a novel technique for normalizing detection scores from different systems for the purpose of system combination.

...read moreread less

Book•

Techniques for Noise Robustness in Automatic Speech Recognition

[...]

Tuomas Virtanen, Rita Singh, Bhiksha Raj

28 Nov 2012

TL;DR: A comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences can be found in this article.

...read moreread less

Abstract: Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments. Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. Includes contributions from top ASR researchers from leading research units in the field

...read moreread less

Patent•

Voice activity detection in presence of background noise

[...]

Atti Venkatraman S¹, Venkatesh Krishnan¹•Institutions (1)

Qualcomm¹

06 Nov 2012

TL;DR: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR as discussed by the authors.

...read moreread less

Abstract: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR Adaptive weights may be applied on the SNRs per band before computing the average SNR The weighting function can be a function of noise level, noise type, and/or instantaneous SNR value Another weighting mechanism applies a null filtering or outlier filtering which sets the weight in a particular band to be zero This particular band may be characterized as the one that exhibits an SNR that is several times higher than the SNRs in other bands

...read moreread less

Journal Article•DOI•

Speech Enhancement Using Generative Dictionary Learning

[...]

Christian Sigg¹, Tomas Dikk², Joachim M. Buhmann²•Institutions (2)

MeteoSwiss¹, ETH Zurich²

01 Aug 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.

...read moreread less

Abstract: The enhancement of speech degraded by real-world interferers is a highly relevant and difficult task. Its importance arises from the multitude of practical applications, whereas the difficulty is due to the fact that interferers are often nonstationary and potentially similar to speech. The goal of monaural speech enhancement is to separate a single mixture into its underlying clean speech and interferer components. This under-determined problem is solved by incorporating prior knowledge in the form of learned speech and interferer dictionaries. The clean speech is recovered from the degraded speech by sparse coding of the mixture in a composite dictionary consisting of the concatenation of a speech and an interferer dictionary. Enhancement performance is measured using objective measures and is limited by two effects. A too sparse coding of the mixture causes the speech component to be explained with too few speech dictionary atoms, which induces an approximation error we denote source distortion. However, a too dense coding of the mixture results in source confusion, where parts of the speech component are explained by interferer dictionary atoms and vice-versa. Our method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.

...read moreread less

Journal Article•DOI•

Steganography Integration Into a Low-Bit Rate Speech Codec

[...]

Yongfeng Huang¹, Chenghao Liu², Shanyu Tang, Sen Bai²•Institutions (2)

Tsinghua University¹, Chongqing Communication Institute²

01 Dec 2012-IEEE Transactions on Information Forensics and Security

TL;DR: A new algorithm is proposed for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding, thus maintaining synchronization between information hiding and speech encoding.

...read moreread less

Abstract: Low bit-rate speech codecs have been widely used in audio communications like VoIP and mobile communications, so that steganography in low bit-rate audio streams would have broad applications in practice. In this paper, the authors propose a new algorithm for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding. The proposed algorithm performs data embedding while pitch period prediction is conducted during low bit-rate speech encoding, thus maintaining synchronization between information hiding and speech encoding. The steganography algorithm can achieve high quality of speech and prevent detection of steganalysis, but also has great compatibility with a standard low bit-rate speech codec without causing further delay by data embedding and extraction. Testing shows, with the proposed algorithm, the data embedding rate of the secret message can attain 4 bits/frame (133.3 bits/second).

...read moreread less

Patent•

Captioning using socially derived acoustic profiles

[...]

Elizabeth V. Woodward¹, Shunguo Yan¹•Institutions (1)

IBM¹

26 Sep 2012

TL;DR: In this article, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the audio track corresponding to the speaker.

...read moreread less

Abstract: Mechanisms for performing dynamic automatic speech recognition on a portion of multimedia content are provided. Multimedia content is segmented into homogeneous segments of content with regard to speakers and background sounds. For the at least one segment, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source. A speech profile for the speaker is generated using information retrieved from the social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the at least one segment based on the acoustic profile. Automatic speech recognition operations are performed on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.

...read moreread less

Patent•

Adaptive voice recognition systems and methods

[...]

William Fay, Brian L. Douthitt, David J. Hughes, Brian Hannum

11 May 2012

TL;DR: In this paper, a voice recognition system includes a microphone for receiving speech from a user and processing electronics that automatically determine and set an expertise level in response to and based on the evaluation.

...read moreread less

Abstract: A voice recognition system includes a microphone for receiving speech from a user and processing electronics. The processing electronics are in communication with the microphone and are configured to use a plurality of rules to evaluate user interactions with the voice recognition system. The processing electronics automatically determine and set an expertise level in response to and based on the evaluation. The processing electronics are configured to automatically adjust at least one setting of the voice recognition system in response to the set expertise level.

...read moreread less

Patent•

Systems and Methods for Speech Command Processing

[...]

Liang-yu (Tom) Chi¹•Institutions (1)

Google¹

05 Jul 2012

TL;DR: In this paper, a wearable computing device can be used for processing speech input at the user's request, either as a command or a search request, depending on the context of the speech-related text.

...read moreread less

Abstract: Methods and apparatus related to processing speech input at a wearable computing device are disclosed. Speech input can be received at the wearable computing device. Speech-related text corresponding to the speech input can be generated. A context can be determined based on database(s) and/or a history of accessed documents. An action can be determined based on an evaluation of at least a portion of the speech-related text and the context. The action can be a command or a search request. If the action is a command, then the wearable computing device can generate output for the command. If the action is a search request, then the wearable computing device can: communicate the search request to a search engine, receive search results from the search engine, and generate output based on the search results. The output can be provided using output component(s) of the wearable computing device.

...read moreread less

Patent•

Speech model retrieval in distributed speech recognition systems

[...]

Bjorn Hoffmeister¹, Hugh Evan Secker-Walker¹, Jeffrey C. O'Neill¹•Institutions (1)

Amazon.com¹

12 Dec 2012

TL;DR: In this paper, features for managing the use of speech recognition models and data in automated speech recognition systems are disclosed, including pre-caching and pre-processing models and statistics.

...read moreread less

Abstract: Features are disclosed for managing the use of speech recognition models and data in automated speech recognition systems. Models and data may be retrieved asynchronously and used as they are received or after an utterance is initially processed with more general or different models. Once received, the models and statistics can be cached. Statistics needed to update models and data may also be retrieved asynchronously so that it may be used to update the models and data as it becomes available. The updated models and data may be immediately used to re-process an utterance, or saved for use in processing subsequently received utterances. User interactions with the automated speech recognition system may be tracked in order to predict when a user is likely to utilize the system. Models and data may be pre-cached based on such predictions.

...read moreread less

Proceedings Article•

A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case

[...]

Zhizheng Wu¹, Tomi Kinnunen², Eng Siong Chng¹, Haizhou Li¹, Eliathamby Ambikairajah³ - Show less +1 more•Institutions (3)

Nanyang Technological University¹, University of Eastern Finland², University of New South Wales³

01 Dec 2012

TL;DR: To reduce false acceptance rate caused by spoofing attack, a general anti-spoofing attack framework is proposed for the speaker verification systems, where a converted speech detector is adopted as a post-processing module for the Speaker verification system's acceptance decision.

...read moreread less

Abstract: Voice conversion technique, which modifies one speaker's (source) voice to sound like another speaker (target), presents a threat to automatic speaker verification. In this paper, we first present new results of evaluating the vulnerability of current state-of-the-art speaker verification systems: Gaussian mixture model with joint factor analysis (GMM-JFA) and probabilistic linear discriminant analysis (PLDA) systems, against spoofing attacks. The spoofing attacks are simulated by two voice conversion techniques: Gaussian mixture model based conversion and unit selection based conversion. To reduce false acceptance rate caused by spoofing attack, we propose a general anti-spoofing attack framework for the speaker verification systems, where a converted speech detector is adopted as a post-processing module for the speaker verification system's acceptance decision. The detector decides whether the accepted claim is human speech or converted speech. A subset of the core task in the NIST SRE 2006 corpus is used to evaluate the vulnerability of speaker verification system and the performance of converted speech detector. The results indicate that both conversion techniques can increase the false acceptance rate of GMM-JFA and PLDA system, while the converted speech detector can reduce the false acceptance rate from 31.54% and 41.25% to 1.64% and 1.71% for GMM-JFA and PLDA system on unit-selection based converted speech, respectively.

...read moreread less

Journal Article•DOI•

Noise Correlation Matrix Estimation for Multi-Microphone Speech Enhancement

[...]

Richard C. Hendriks¹, Timo Gerkmann•Institutions (1)

Delft University of Technology¹

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper presents and analyzes an algorithm that estimates the noise correlation matrix without using a VAD, based on measuring the correlation of the noisy input and a noise reference which can be obtained, e.g., by steering a null towards the target source.

...read moreread less

Abstract: For multi-channel noise reduction algorithms like the minimum variance distortionless response (MVDR) beamformer, or the multi-channel Wiener filter, an estimate of the noise correlation matrix is needed. For its estimation, it is often proposed in the literature to use a voice activity detector (VAD). However, using a VAD the estimated matrix can only be updated in speech absence. As a result, during speech presence the noise correlation matrix estimate does not follow changing noise fields with an appropriate accuracy. This effect is further increased, as in nonstationary noise voice activity detection is a rather difficult task, and false-alarms are likely to occur. In this paper, we present and analyze an algorithm that estimates the noise correlation matrix without using a VAD. This algorithm is based on measuring the correlation of the noisy input and a noise reference which can be obtained, e.g., by steering a null towards the target source. When applied in combination with an MVDR beamformer, it is shown that the proposed noise correlation matrix estimate results in a more accurate beamformer response, a larger signal-to-noise ratio improvement and a larger instrumentally predicted speech intelligibility when compared to competing algorithms such as the generalized sidelobe canceler, a VAD-based MVDR beamformer, and an MVDR based on the noisy correlation matrix.

...read moreread less

Journal Article•DOI•

A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation

[...]

Tobias May¹, S.L.J.D.E. van de Par¹, AG Armin Kohlrausch²•Institutions (2)

University of Oldenburg¹, Philips²

01 Sep 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation is presented.

...read moreread less

Abstract: In this study, we present a binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation. In contrast to many other binaural cocktail party processors, the proposed system does not require a priori knowledge about the azimuth position of the target speakers. The proposed system consists of three main building blocks: binaural localization, speech source detection, and automatic speaker identification. First, a binaural front-end is used to robustly localize relevant sound source activity. Second, a speech detection module based on missing data classification is employed to determine whether detected sound source activity corresponds to a speaker or to an interfering noise source using a binary mask that is based on spatial evidence supplied by the binaural front-end. Third, a second missing data classifier is used to recognize the speaker identities of all detected speech sources. The proposed system is systematically evaluated in simulated adverse acoustic scenarios. Compared to state-of-the art MFCC recognizers, the proposed model achieves significant speaker recognition accuracy improvements.

...read moreread less

Journal Article•DOI•

Vowel Onset Point Detection for Low Bit Rate Coded Speech

[...]

Anil Kumar Vuppala¹, Jainath Yadav¹, Saswat Chakrabarti¹, K. S. Rao¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Aug 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed VOP detection method has shown significant improvement in the performance compared to the existing method under clean as well as coded cases and is analyzed in CV recognition by using VOP as an anchor point.

...read moreread less

Abstract: In this paper, we propose a method for detecting the vowel onset points (VOPs) for low bit rate coded speech. VOP is the instant at which the onset of the vowel takes place in the speech signal. VOP plays an important role for the applications, such as consonant-vowel (CV) unit recognition and speech rate modification. The proposed VOP detection method is based on the spectral energy present in the glottal closure region of the speech signal. Speech coders considered to carry out this study are Global System for Mobile Communications (GSM) full rate, code-excited linear prediction (CELP), and mixed-excitation linear prediction (MELP). TIMIT database and CV units collected from the broadcast news corpus are used for evaluation. Performance of the proposed method is compared with existing methods, which uses the combination of evidence from the excitation source, spectral peaks energy, and modulation spectrum. The proposed VOP detection method has shown significant improvement in the performance compared to the existing method under clean as well as coded cases. The effectiveness of the proposed VOP detection method is analyzed in CV recognition by using VOP as an anchor point.

...read moreread less

Journal Article•DOI•

On Dynamic Stream Weighting for Audio-Visual Speech Recognition

[...]

Virginia Estellers¹, M. Gurban¹, Jean-Philippe Thiran¹•Institutions (1)

École Normale Supérieure¹

01 May 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.

...read moreread less

Abstract: The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper, we propose a new confidence measure, compare it to existing ones, and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.

...read moreread less

Patent•

Signal processing apparatus having voice activity detection unit and related signal processing methods

[...]

Hung Chia-Yu¹, Yeh Tsung-Li¹, Tu Yi-Chang¹•Institutions (1)

Realtek¹

13 Sep 2012

TL;DR: In this article, a speech recognition system and a voice activity detection unit are coupled to the speech recognition, and the VADU is used to detect whether an audio signal is a voice signal and accordingly generate a voice-activity detection result to control whether the system should perform speech recognition upon the audio signal.

...read moreread less

Abstract: A signal processing apparatus includes a speech recognition system and a voice activity detection unit. The voice activity detection unit is coupled to the speech recognition system, and arranged for detecting whether an audio signal is a voice signal and accordingly generating a voice activity detection result to the speech recognition system to control whether the speech recognition system should perform speech recognition upon the audio signal.

...read moreread less

Journal Article•DOI•

Multidirectional regression (MDR)-based features for automatic voice disorder detection.

[...]

Ghulam Muhammad¹, Tamer A. Mesallam¹, Khalid H. Malki¹, Mohamed Farahat¹, Awais Mahmood¹, Mansour Alsulaiman¹ - Show less +2 more•Institutions (1)

King Saud University¹

01 Nov 2012-Journal of Voice

TL;DR: The results of this study revealed that incorporating voice onset and offset information leads to efficient automatic voice disordered detection.

...read moreread less

Patent•

Method and systems having improved speech recognition

[...]

Jeffrey D. Beckley¹, Pooja Aggarwal¹, Shivakumar Balasubramanyam¹•Institutions (1)

Qualcomm¹

06 Jun 2012

TL;DR: In this article, a method for improving speech recognition by a speech recognition system includes obtaining a voice sample from a speaker, storing the voice sample of the speaker as a voice model, identifying an area from which sound matching the voice model for the speaker is coming, and providing one or more audio signals corresponding to sound received from the identified area to the system for processing.

...read moreread less

Abstract: A method for improving speech recognition by a speech recognition system includes obtaining a voice sample from a speaker; storing the voice sample of the speaker as a voice model in a voice model database; identifying an area from which sound matching the voice model for the speaker is coming; providing one or more audio signals corresponding to sound received from the identified area to the speech recognition system for processing.

...read moreread less

Collapse