Showing papers on "Voice activity detection published in 2013"

PDF

Open Access

Proceedings Article•DOI•

Recent developments in openSMILE, the munich open-source multimedia feature extractor

[...]

Florian Eyben¹, Felix Weninger¹, Florian Gross¹, Björn Schuller¹•Institutions (1)

21 Oct 2013

TL;DR: OpenSMILE 2.0 as mentioned in this paper unifies feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing, allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries).

...read moreread less

Abstract: We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.

...read moreread less

1,186 citations

Journal Article•DOI•

Deep Belief Networks Based Voice Activity Detection

[...]

Xiao-Lei Zhang¹, Ji Wu¹•Institutions (1)

Tsinghua University¹

01 Apr 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Extensive experimental results on the AURORA2 corpus show that the DBN-based VAD not only outperforms eleven referenced VADs, but also can meet the real-time detection demand of VAD.

...read moreread less

Abstract: Fusing the advantages of multiple acoustic features is important for the robustness of voice activity detection (VAD). Recently, the machine-learning-based VADs have shown a superiority to traditional VADs on multiple feature fusion tasks. However, existing machine-learning-based VADs only utilize shallow models, which cannot explore the underlying manifold of the features. In this paper, we propose to fuse multiple features via a deep model, called deep belief network (DBN). DBN is a powerful hierarchical generative model for feature extraction. It can describe highly variant functions and discover the manifold of the features. We take the multiple serially-concatenated features as the input layer of DBN, and then extract a new feature by transferring these features through multiple nonlinear hidden layers. Finally, we predict the class of the new feature by a linear classifier. We further analyze that even a single-hidden-layer-based belief network is as powerful as the state-of-the-art models in the machine-learning-based VADs. In our empirical comparison, ten common features are used for performance analysis. Extensive experimental results on the AURORA2 corpus show that the DBN-based VAD not only outperforms eleven referenced VADs, but also can meet the real-time detection demand of VAD. The results also show that the DBN-based VAD can fuse the advantages of multiple features effectively.

...read moreread less

326 citations

Patent•

System and method for improving speech recognition accuracy in a work environment

[...]

David R. DiGregorio

14 Mar 2013

TL;DR: In this article, a microprocessor or other application specific integrated circuit provides a mechanism for comparing the relative transit times between a user's voice, a primary speech microphone, and a secondary compliance microphone to determine if the speech microphone is placed in an appropriate proximity to the user's mouth.

...read moreread less

Abstract: Apparatus and method that improves speech recognition accuracy, by monitoring the position of a user's headset-mounted speech microphone, and prompting the user to reconfigure the speech microphone's orientation if required. A microprocessor or other application specific integrated circuit provides a mechanism for comparing the relative transit times between a user's voice, a primary speech microphone, and a secondary compliance microphone. The difference in transit times may be used to determine if the speech microphone is placed in an appropriate proximity to the user's mouth. If required, the user is automatically prompted to reposition the speech microphone.

...read moreread less

308 citations

Proceedings Article•DOI•

Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies

[...]

Florian Eyben, Felix Weninger, Stefano Squartini, Björn Schuller

26 May 2013

TL;DR: A novel, data-driven approach to voice activity detection based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features clearly outperforming three state-of-the-art reference algorithms under the same conditions.

...read moreread less

Abstract: A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye corpora, and adding real long term recordings of diverse noise types. The approach is evaluated on unseen synthetically mixed test data as well as a real-life test set consisting of four full-length Hollywood movies. A frame-wise Equal Error Rate (EER) of 33.2% is obtained for the four movies and an EER of 9.6% is obtained for the synthetic test data at a peak SNR of 0 dB, clearly outperforming three state-of-the-art reference algorithms under the same conditions.

...read moreread less

236 citations

Patent•

Embedded system for construction of small footprint speech recognition with user-definable constraints

[...]

Michael J. Newman, Robert Roth, William D. Alexander, Paul van Mulbregt

23 Apr 2013

TL;DR: In this paper, the authors present techniques and methods that enable a voice trigger that wakes up an electronic device or causes the device to make additional voice commands active, without manual initiation of voice command functionality.

...read moreread less

Abstract: Techniques disclosed herein include systems and methods that enable a voice trigger that wakes-up an electronic device or causes the device to make additional voice commands active, without manual initiation of voice command functionality. In addition, such a voice trigger is dynamically programmable or customizable. A speaker can program or designate a particular phrase as the voice trigger. In general, techniques herein execute a voice-activated wake-up system that operates on a digital signal processor (DSP) or other low-power, secondary processing unit of an electronic device instead of running on a central processing unit (CPU). A speech recognition manager runs two speech recognition systems on an electronic device. The CPU dynamically creates a compact speech system for the DSP. Such a compact system can be continuously run during a standby mode, without quickly exhausting a battery supply.

...read moreread less

210 citations

Proceedings Article•DOI•

Recurrent neural networks for voice activity detection

[...]

Thad Hughes¹, Keir Banks Mierle¹•Institutions (1)

Google¹

26 May 2013

TL;DR: This work presents a novel recurrent neural network model for voice activity detection, in which nodes compute quadratic polynomials and outperforms a much larger baseline system composed of Gaussian mixture models and a hand-tuned state machine for temporal smoothing.

...read moreread less

Abstract: We present a novel recurrent neural network (RNN) model for voice activity detection. Our multi-layer RNN model, in which nodes compute quadratic polynomials, outperforms a much larger baseline system composed of Gaussian mixture models (GMMs) and a hand-tuned state machine (SM) for temporal smoothing. All parameters of our RNN model are optimized together, so that it properly weights its preference for temporal continuity against the acoustic features in each frame. Our RNN uses one tenth the parameters and outperforms the GMM+SM baseline system by 26% reduction in false alarms, reducing overall speech recognition computation time by 17% while reducing word error rate by 1% relative.

...read moreread less

193 citations

Patent•DOI•

Method and apparatus for the provision of information signals based upon speech recognition

[...]

Ira A. Gerson¹•Institutions (1)

BlackBerry Limited¹

10 May 2013-Journal of the Acoustical Society of America

TL;DR: In this paper, a wireless system comprises at least one subscriber unit in wireless communication with an infrastructure, and each subscriber unit implements a speech recognition client, and the infrastructure comprises a Speech Recognition Server.

...read moreread less

Abstract: A wireless system comprises at least one subscriber unit in wireless communication with an infrastructure. Each subscriber unit implements a speech recognition client, and the infrastructure comprises a speech recognition server. A given subscriber unit takes as input an unencoded speech signal that is subsequently parameterized by the speech recognition client. The parameterized speech is then provided to the speech recognition server that, in turn, performs speech recognition analysis on the parameterized speech. Information signals, based in part upon any recognized utterances identified by the speech recognition analysis, are subsequently provided to the subscriber unit. The information signals may be used to control the subscriber unit itself; to control one or more devices coupled to the subscriber unit, or may be operated upon by the subscriber unit or devices coupled thereto.

...read moreread less

191 citations

Journal Article•DOI•

Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux

[...]

Seyed Omid Sadjadi¹, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

04 Jan 2013-IEEE Signal Processing Letters

TL;DR: Experimental results indicate that the proposed SAD scheme is highly effective and provides superior and consistent performance across various noise types and distortion levels.

...read moreread less

Abstract: Effective speech activity detection (SAD) is a necessary first step for robust speech applications. In this letter, we propose a robust and unsupervised SAD solution that leverages four different speech voicing measures combined with a perceptual spectral flux feature, for audio-based surveillance and monitoring applications. Effectiveness of the proposed technique is evaluated and compared against several commonly adopted unsupervised SAD methods under simulated and actual harsh acoustic conditions with varying distortion levels. Experimental results indicate that the proposed SAD scheme is highly effective and provides superior and consistent performance across various noise types and distortion levels.

...read moreread less

186 citations

Proceedings Article•DOI•

A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion

[...]

Li Deng¹, Ossama Abdel-Hamid², Dong Yu¹•Institutions (2)

Microsoft¹, York University²

26 May 2013

TL;DR: A novel deep convolutional neural network architecture is developed, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance.

...read moreread less

Abstract: We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneous-pooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network.

...read moreread less

185 citations

Proceedings Article•DOI•

Audio-visual deep learning for noise robust speech recognition

[...]

Jing Huang¹, Brian Kingsbury¹•Institutions (1)

IBM¹

26 May 2013

TL;DR: This work uses DBNs for audio-visual speech recognition; in particular, it uses deep learning from audio and visual features for noise robust speech recognition and test two methods for using DBN’s in a multimodal setting.

...read moreread less

Abstract: Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.

...read moreread less

182 citations

Book•

DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement: A Survey of the State of the Art

[...]

Richard C. Hendriks¹, Timo Gerkmann², Jesper Jensen³•Institutions (3)

Delft University of Technology¹, University of Oldenburg², Aalborg University³

19 Feb 2013

TL;DR: This survey wishes to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancement.

...read moreread less

Abstract: As speech processing devices like mobile phones, voice controlled devices, and hearing aids have increased in popularity, people expect them to work anywhere and at any time without user intervention However, the presence of acoustical disturbances limits the use of these applications, degrades their performance, or causes the user difficulties in understanding the conversation or appreciating the device A common way to reduce the effects of such disturbances is through the use of single-microphone noise reduction algorithms for speech enhancement The field of single-microphone noise reduction for speech enhancement comprises a history of more than 30 years of research In this survey, we wish to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancementFurthermore, our goal is to provide a concise description of a state-of-the-art speech enhancement system, and demonstrate the relative importance of the various building blocks of such a system This allows the non-expert DSP practitioner to judge the relevance of each building block and to implement a close-to-optimal enhancement system for the particular application at hand Table of Contents: Introduction / Single Channel Speech Enhancement: General Principles / DFT-Based Speech Enhancement Methods: Signal Model and Notation / Speech DFT Estimators / Speech Presence Probability Estimation / Noise PSD Estimation / Speech PSD Estimation / Performance Evaluation Methods / Simulation Experiments with Single-Channel Enhancement Systems / Future Directions

...read moreread less

Proceedings Article•DOI•

Synthetic speech detection using temporal modulation feature

[...]

Zhizheng Wu¹, Xiong Xiao¹, Eng Siong Chng¹, Haizhou Li¹•Institutions (1)

Nanyang Technological University¹

26 May 2013

TL;DR: From the synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features, and the best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate.

...read moreread less

Abstract: Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.

...read moreread less

Proceedings Article•DOI•

Speech activity detection on youtube using deep neural networks.

[...]

Neville Ryant¹, Mark Liberman¹, Jiahong Yuan¹•Institutions (1)

University of Pennsylvania¹

25 Aug 2013

TL;DR: It is demonstrated that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates on YouTube videos compared to a conventional GMM based system.

...read moreread less

Abstract: Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize well to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution is to augment the conventional cepstral features with additional, hand-engineered features (e.g., spectral flux, spectral centroid, multiband spectral entropies) which are robust to changes in environment and recording condition. An alternative approach, explored here, is to learn robust features during the course of training using an appropriate architecture such as deep neural networks (DNNs). In this paper we demonstrate that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates (19.6%) on YouTube videos compared to a conventional GMM based system (40%).

...read moreread less

Journal Article•DOI•

Evaluating the intelligibility benefit of speech modifications in known noise conditions

[...]

Martin Cooke¹, Catherine Mayo², Cassia Valentini-Botinhao², Yannis Stylianou, Bastian Sauert³, Yan Tang¹ - Show less +2 more•Institutions (3)

University of the Basque Country¹, University of Edinburgh², RWTH Aachen University³

01 May 2013-Speech Communication

TL;DR: The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech.

...read moreread less

Proceedings Article•DOI•

A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data

[...]

Tomi Kinnunen¹, Padmanabhan Rajan¹•Institutions (1)

University of Eastern Finland¹

21 Oct 2013

TL;DR: This work studies an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs) and provides open-source implementation of the method.

...read moreread less

Abstract: A voice activity detector (VAD) plays a vital role in robust speaker verification, where energy VAD is most commonly used. Energy VAD works well in noise-free conditions but deteriorates in noisy conditions. One way to tackle this is to introduce speech enhancement preprocessing. We study an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs). The training labels are obtained from enhanced energy VAD. As the speech and nonspeech models are re-trained for each utterance, minimum assumptions of the background noise are made. According to both VAD error analysis and speaker verification results utilizing state-of-the-art i-vector system, the proposed method outperforms energy VAD variants by a wide margin. We provide open-source implementation of the method.

...read moreread less

Patent•

Background audio identification for speech disambiguation

[...]

Jason Sanders¹, Gabriel Taubman¹, John J. Lee¹•Institutions (1)

Google¹

14 Mar 2013

TL;DR: In this paper, a computer-implemented method for providing context-dependent search results is described, which includes receiving an audio stream at a computing device during a time interval, the audio stream comprising user speech data and background audio.

...read moreread less

Abstract: Implementations relate to techniques for providing context-dependent search results. A computer-implemented method includes receiving an audio stream at a computing device during a time interval, the audio stream comprising user speech data and background audio, separating the audio stream into a first substream that includes the user speech data and a second substream that includes the background audio, identifying concepts related to the background audio, generating a set of terms related to the identified concepts, influencing a speech recognizer based on at least one of the terms related to the background audio, and obtaining a recognized version of the user speech data using the speech recognizer.

...read moreread less

Patent•

Mobile terminal and method for recognizing voice thereof

[...]

Juhee Kim¹, Hyun-Seob Lee¹, Joonyup Lee¹, Jungkyu Choi¹•Institutions (1)

LG Electronics¹

13 Jun 2013

TL;DR: In this paper, a mobile terminal and a voice recognition method thereof are described, which includes receiving a user's voice, providing the received voice to a first voice recognition engine provided in the server and a second voice recognition system provided by the mobile terminal.

...read moreread less

Abstract: The present disclosure relates to a mobile terminal and a voice recognition method thereof. The voice recognition method may include receiving a user's voice; providing the received voice to a first voice recognition engine provided in the server and a second voice recognition engine provided in the mobile terminal; acquiring first voice recognition data as a result of recognizing the received voice by the first voice recognition engine; acquiring second voice recognition data as a result of recognizing the received voice by the second voice recognition engine; estimating a function corresponding to the user's intention based on at least one of the first and the second voice recognition data; calculating a similarity between the first and the second voice recognition data when personal information is required for the estimated function; and selecting either one of the first and the second voice recognition data based on the calculated similarity.

...read moreread less

Proceedings Article•DOI•

A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns

[...]

Federico Alegre¹, Asmaa Amehraye¹, Nicholas Evans¹•Institutions (1)

Institut Eurécom¹

01 Sep 2013

TL;DR: A novel countermeasure based on the analysis of speech signals using local binary patterns followed by a one-class classification approach is presented, which captures differences in the spectro-temporal texture of genuine and spoofed speech, but relies only on a model of the former.

...read moreread less

Abstract: The vulnerability of automatic speaker verification systems to spoofing is now well accepted. While recent work has shown the potential to develop countermeasures capable of detecting spoofed speech signals, existing solutions typically function well only for specific attacks on which they are optimised. Since the exact nature of spoofing attacks can never be known in practice, there is thus a need for generalised countermeasures which can detect previously unseen spoofing attacks. This paper presents a novel countermeasure based on the analysis of speech signals using local binary patterns followed by a one-class classification approach. The new countermeasure captures differences in the spectro-temporal texture of genuine and spoofed speech, but relies only on a model of the former. We report experiments with three different approaches to spoofing and with a state-of-the-art i-vector speaker verification system which uses probabilistic linear discriminant analysis for intersession compensation. While a support vector machine classifier is tuned with examples of converted voice, it delivers reliable detection of spoofing attacks using synthesized speech and artificial signals, attacks for which it is not optimised.

...read moreread less

Journal Article•DOI•

Acoustic Environment Identification and Its Applications to Audio Forensics

[...]

Hafiz Malik¹•Institutions (1)

University of Michigan¹

01 Nov 2013-IEEE Transactions on Information Forensics and Security

TL;DR: A statistical technique to model and estimate the amount of reverberation and background noise variance in an audio recording is described and an energy-based voice activity detection method is proposed for automatic decaying-tail-selection from anaudio recording.

...read moreread less

Abstract: An audio recording is subject to a number of possible distortions and artifacts. Consider, for example, artifacts due to acoustic reverberation and background noise. The acoustic reverberation depends on the shape and the composition of a room, and it causes temporal and spectral smearing of the recorded sound. The background noise, on the other hand, depends on the secondary audio source activities present in the evidentiary recording. Extraction of acoustic cues from an audio recording is an important but challenging task. Temporal changes in the estimated reverberation and background noise can be used for dynamic acoustic environment identification (AEI), audio forensics, and ballistic settings. We describe a statistical technique to model and estimate the amount of reverberation and background noise variance in an audio recording. An energy-based voice activity detection method is proposed for automatic decaying-tail-selection from an audio recording. Effectiveness of the proposed method is tested using a data set consisting of speech recordings. The performance of the proposed method is also evaluated for both speaker-dependent and speaker-independent scenarios.

...read moreread less

Journal Article•DOI•

A Voice-Input Voice-Output Communication Aid for People With Severe Speech Impairment

[...]

Mark S. Hawley¹, Stuart Cunningham¹, Phil D. Green¹, Pam Enderby¹, Rebecca Palmer¹, Siddharth Sehgal¹, P. O'Neill - Show less +3 more•Institutions (1)

University of Sheffield¹

01 Jan 2013

TL;DR: The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input, with mean recognition accuracy of 67% in these circumstances.

...read moreread less

Abstract: A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.

...read moreread less

Patent•

Speech recognition platforms

[...]

Gregory M. Hart¹, Frederic Johan Georges Deramat¹, Vikram Kumar Gundeti¹, John Daniel Thimsen¹, Allan Timothy Lindsay¹, Peter Paul Henri Carbon¹, Scott Ian Blanksteen¹ - Show less +3 more•Institutions (1)

Wilmington University¹

15 Mar 2013

TL;DR: In this paper, a speech recognition platform is configured to receive an audio signal that includes speech from a user and perform automatic speech recognition (ASR) on the audio signal to identify ASR results.

...read moreread less

Abstract: A speech recognition platform configured to receive an audio signal that includes speech from a user and perform automatic speech recognition (ASR) on the audio signal to identify ASR results. The platform may identify: (i) a domain of a voice command within the speech based on the ASR results and based on context information associated with the speech or the user, and (ii) an intent of the voice command. In response to identifying the intent, the platform may perform a corresponding action, such as streaming audio to the device, setting a reminder for the user, purchasing an item on behalf of the user, making a reservation for the user or launching an application for the user. The speech recognition platform, in combination with the device, may therefore facilitate efficient interactions between the user and a voice-controlled device.

...read moreread less

Proceedings Article•DOI•

The IBM speech activity detection system for the DARPA RATS program.

[...]

George Saon¹, Samuel Thomas², Hagen Soltau¹, Sriram Ganapathy², Brian Kingsbury³ - Show less +1 more•Institutions (3)

IBM¹, Johns Hopkins University², Nuance Communications³

25 Aug 2013

Abstract: In this paper we describe improvements to the IBM speech activity detection (SAD) system for the third phase of the DARPA RATS program. The progress during this final phase comes from jointly training convolutional and regular deep neural networks with rich time-frequency representations of speech. With these additions, the phase 3 system reduces the equal error rate (EER) significantly on both of the program's development sets (relative improvements of 20% on dev1 and 7% on dev2) compared to an earlier phase 2 system. For the final program evaluation, the newly developed system also performs well past the program target of 3% Pmiss at 1% Pfa with a performance of 1.2% Pmiss at 1% Pfa and 0.3% Pfa at 3% Pmiss.

...read moreread less

Patent•

Outcome-oriented dialogs on a speech recognition platform

[...]

Jeff Bradley Beal¹, Sumedha Arvind Kshirsagar¹, Nishant Kumar¹, Ajay Gopalakrishnan¹, Kevin Robert Charter¹ - Show less +1 more•Institutions (1)

Amazon.com¹

17 Dec 2013

...read moreread less

Abstract: A speech recognition platform configured to receive an audio signal that includes speech from a user and perform automatic speech recognition (ASR) on the audio signal to identify ASR results. The platform may identify: (i) a domain of a voice command within the speech based on the ASR results and based on context information associated with the speech or the user, and (ii) an intent of the voice command. In response to identifying the intent, the platform may perform multiple actions corresponding to this intent. The platform may select a target action to perform, and may engage in a back-and-forth dialog to obtain information for completing the target action. The action may include streaming audio to the device, setting a reminder for the user, purchasing an item on behalf of the user, making a reservation for the user or launching an application for the user.

...read moreread less

Patent•

Natural language understanding automatic speech recognition post processing

[...]

Darrin Kenneth John Fry

20 Jun 2013

TL;DR: In this article, the speech recognition results are post processed for use in a specified context, where a portion of the results are compared to keywords that are sensitive to the specified context.

...read moreread less

Abstract: In an automatic speech recognition post processing system, speech recognition results are received from an automatic speech recognition service. The speech recognition results may include transcribed speech, an intent classification and/or extracted fields of intent parameters. The speech recognition results are post processed for use in a specified context. All or a portion of the speech recognition results are compared to keywords that are sensitive to the specified context. The post processed speech recognition results are provided to an appropriate application which is operable to utilize the context sensitive product of post processing.

...read moreread less

Journal Article•DOI•

Efficient voice activity detection algorithm using long-term spectral flatness measure

[...]

Yanna Ma¹, Akinori Nishihara¹•Institutions (1)

Tokyo Institute of Technology¹

01 Dec 2013-Eurasip Journal on Audio, Speech, and Music Processing

TL;DR: A novel and robust voice activity detection (VAD) algorithm utilizing long-term spectral flatness measure (LSFM) which is capable of working at 10 dB and lower signal-to-noise ratios(SNRs).

...read moreread less

Abstract: This paper proposes a novel and robust voice activity detection (VAD) algorithm utilizing long-term spectral flatness measure (LSFM) which is capable of working at 10 dB and lower signal-to-noise ratios(SNRs). This new LSFM-based VAD improves speech detection robustness in various noisy environments by employing a low-variance spectrum estimate and an adaptive threshold. The discriminative power of the new LSFM feature is shown by conducting an analysis of the speech/non-speech LSFM distributions. The proposed algorithm was evaluated under 12 types of noises (11 from NOISEX-92 and speech-shaped noise) and five types of SNR in core TIMIT test corpus. Comparisons with three modern standardized algorithms (ETSI adaptive multi-rate (AMR) options AMR1 and AMR2 and ITU-T G.729) demonstrate that our proposed LSFM-based VAD scheme achieved the best average accuracy rate. A long-term signal variability (LTSV)-based VAD scheme is also compared with our proposed method. The results show that our proposed algorithm outperforms the LTSV-based VAD scheme for most of the noises considered including difficult noises like machine gun noise and speech babble noise.

...read moreread less

Patent•

Speech recognition apparatus, speech recognition method, and television set

[...]

Tomohiro Koganei¹•Institutions (1)

Panasonic¹

26 Sep 2013

TL;DR: A speech recognition apparatus includes a speech acquisition unit which acquires speech uttered by a user, a recognition result acquisition unit that acquires a result of recognition performed on the acquired speech, an extraction unit which, when the recognition result includes a keyword and a selection command that is used for selecting one of selectable information items, extracts a selection candidate that includes the keyword, and a display control unit which changes a display manner of the display information according to the second selection mode switched from the first selection mode.

...read moreread less

Abstract: A speech recognition apparatus includes: a speech acquisition unit which acquires speech uttered by a user; a recognition result acquisition unit which acquires a result of recognition performed on the acquired speech; an extraction unit which, when the recognition result includes a keyword and a selection command that is used for selecting one of selectable information items, extracts a selection candidate that includes the keyword; a selection mode switching unit which, when more than one selection candidate is extracted, switches a selection mode from a first selection mode that allows selection among the selectable information items to a second selection that allows selection among the selection candidates; a display control unit which changes a display manner of the display information, according to the second selection mode switched from the first selection mode; and a selection unit which selects one of the selection candidates, according to an entry from the user.

...read moreread less

Journal Article•DOI•

Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing

[...]

Robin Hofe¹, Stephen R. Ell², Michael J. Fagan³, James M. Gilbert³, Phil D. Green¹, Roger K. Moore¹, S. I. Rybchenko³ - Show less +3 more•Institutions (3)

University of Sheffield¹, Castle Hill Hospital², University of Hull³

01 Jan 2013-Speech Communication

TL;DR: Both small vocabulary isolated word recognition and connected digit recognition experiments are presented, demonstrating the ability of the system to capture phonetic detail at a level that is surprising for a device without any direct access to voicing information.

...read moreread less

Proceedings Article•DOI•

Denoising deep neural networks based voice activity detection

[...]

Xiao-Lei Zhang¹, Ji Wu¹•Institutions (1)

Tsinghua University¹

26 May 2013

TL;DR: Experimental results show that the proposed denoising-deep-neural-network (DDNN) based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.

...read moreread less

Abstract: Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.

...read moreread less

Patent•

Distributed Speech Recognition System

[...]

Ojas A. Bapat¹•Institutions (1)

Spansion¹

08 Jan 2013

TL;DR: In this article, an apparatus, method, and system for speech recognition of a voice command is described. But the method can include receiving data representing a voice commands, generating a list of targets based on the state information of each target within the system, and selecting a target from the list of target candidates based on a given voice command.

...read moreread less

Abstract: Embodiments of the present invention include an apparatus, method, and system for speech recognition of a voice command. The method can include receiving data representing a voice command, generating a list of targets based on the state information of each target within the system, and selecting a target from the list of targets, based on the voice command.

...read moreread less

Journal Article•DOI•

An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition

[...]

Chin-Hui Lee¹, Sabato Marco Siniscalchi²•Institutions (2)

Georgia Institute of Technology¹, Kore University of Enna²

07 Feb 2013

TL;DR: The recently proposed automatic speech attribute transcription (ASAT) framework is an attempt to mimic some HSR capabilities with asynchronous speech event detection followed by bottom-up knowledge integration and verification.

...read moreread less

Abstract: The field of automatic speech recognition (ASR) has enjoyed more than 30 years of technology advances due to the extensive utilization of the hidden Markov model (HMM) framework and a concentrated effort by the speech community to make available a vast amount of speech and language resources, known today as the Big Data Paradigm. State-of-the-art ASR systems achieve a high recognition accuracy for well-formed utterances of a variety of languages by decoding speech into the most likely sequence of words among all possible sentences represented by a finite-state network (FSN) approximation of all the knowledge sources required by the ASR task. However, the ASR problem is still far from being solved because not all information available in the speech knowledge hierarchy can be directly integrated into the FSN to improve the ASR performance and enhance system robustness. It is believed that some of the current issues of integrating various knowledge sources in top-down integrated search can be partially addressed by processing techniques that take advantage of the full set of acoustic and language information in speech. It has long been postulated that human speech recognition (HSR) determines the linguistic identity of a sound based on detected evidence that exists at various levels of the speech knowledge hierarchy, ranging from acoustic phonetics to syntax and semantics. This calls for a bottom-up attribute detection and knowledge integration framework that links speech processing with information extraction, by spotting speech cues with a bank of attribute detectors, weighting and combining acoustic evidence to form cognitive hypotheses, and verifying these theories until a consistent recognition decision can be reached. The recently proposed automatic speech attribute transcription (ASAT) framework is an attempt to mimic some HSR capabilities with asynchronous speech event detection followed by bottom-up knowledge integration and verification. In the last few years, ASAT has demonstrated good potential and has been applied to a variety of existing applications in speech processing and information extraction.

...read moreread less

Collapse