Showing papers on "Voice activity detection published in 2015"

PDF

Open Access

Proceedings Article•DOI•

Audio augmentation for speech recognition.

[...]

Tom Ko¹, Vijayaditya Peddinti², Daniel Povey², Sanjeev Khudanpur²•Institutions (2)

06 Sep 2015

TL;DR: This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.

...read moreread less

Abstract: Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.

...read moreread less

1,093 citations

Posted Content•

MUSAN: A Music, Speech, and Noise Corpus.

[...]

David Snyder, Guoguo Chen, Daniel Povey

28 Oct 2015-arXiv: Sound

TL;DR: This report introduces a new corpus of music, speech, and noise suitable for training models for voice activity detection (VAD) and music/speech discrimination and demonstrates use of this corpus on Broadcast news and VAD for speaker identification.

...read moreread less

Abstract: This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.

...read moreread less

855 citations

Proceedings Article•DOI•

The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines

[...]

Jon Barker¹, Ricard Marxer¹, Emmanuel Vincent¹, Shinji Watanabe•Institutions (1)

University of Sheffield¹

01 Dec 2015

TL;DR: The design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated scenario: a person talking to a tablet device that has been fitted with a six-channel microphone array, are presented.

...read moreread less

Abstract: The CHiME challenge series aims to advance far field speech recognition technology by promoting research at the interface of signal processing and automatic speech recognition. This paper presents the design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated scenario: a person talking to a tablet device that has been fitted with a six-channel microphone array. The paper describes the data collection, the task definition and the baseline systems for data simulation, enhancement and recognition. The paper then presents an overview of the 26 systems that were submitted to the challenge focusing on the strategies that proved to be most successful relative to the MVDR array processing and DNN acoustic modeling reference system. Challenge findings related to the role of simulated data in system training and evaluation are discussed.

...read moreread less

726 citations

Proceedings Article•DOI•

Hate Speech Detection with Comment Embeddings

[...]

Nemanja Djuric¹, Jing Zhou¹, Robin D. Morris¹, Mihajlo Grbovic¹, Vladan Radosavljevic¹, Narayan Bhamidipati¹ - Show less +2 more•Institutions (1)

Yahoo!¹

18 May 2015

TL;DR: This work proposes to learn distributed low-dimensional representations of comments using recently proposed neural language models, that can then be fed as inputs to a classification algorithm, resulting in highly efficient and effective hate speech detectors.

...read moreread less

Abstract: We address the problem of hate speech detection in online user comments. Hate speech, defined as an "abusive speech targeting specific group characteristics, such as ethnicity, religion, or gender", is an important problem plaguing websites that allow users to leave feedback, having a negative impact on their online business and overall user experience. We propose to learn distributed low-dimensional representations of comments using recently proposed neural language models, that can then be fed as inputs to a classification algorithm. Our approach addresses issues of high-dimensionality and sparsity that impact the current state-of-the-art, resulting in highly efficient and effective hate speech detectors.

...read moreread less

630 citations

Journal Article•DOI•

A Lexicon-based Approach for Hate Speech Detection

[...]

Njagi Dennis Gitari, Zhang Zuping¹, Zuping Zhang¹, Hanyurwimfura Damien, Jun Long¹ - Show less +1 more•Institutions (1)

Central South University¹

30 Apr 2015

TL;DR: The goal of the research is to create a model classifier that uses sentiment analysis techniques and in particular subjectivity detection to not only detect that a given sentence is subjective but also to identify and rate the polarity of sentiment expressions.

...read moreread less

Abstract: We explore the idea of creating a classifier that can be used to detect presence of hate speech in web discourses such as web forums and blogs. In this work, hate speech problem is abstracted into three main thematic areas of race, nationality and religion. The goal of our research is to create a model classifier that uses sentiment analysis techniques and in particular subjectivity detection to not only detect that a given sentence is subjective but also to identify and rate the polarity of sentiment expressions. We begin by whittling down the document size by removing objective sentences. Then, using subjectivity and semantic features related to hate speech, we create a lexicon that is employed to build a classifier for hate speech detection. Experiments with a hate corpus show significant practical application for a real-world web discourse.

...read moreread less

362 citations

Proceedings Article•DOI•

A Comparison of Features for Synthetic Speech Detection

[...]

Md. Sahidullah¹, Tomi Kinnunen, Cemal Hanilci¹•Institutions (1)

University of Eastern Finland¹

06 Sep 2015

TL;DR: Comparative results indicate that features representing spectral information in high-frequency region, dynamic information of speech, and detailed information related to subband characteristics are considerably more useful in detecting synthetic speech detection task.

...read moreread less

Abstract: The performance of biometric systems based on automatic speaker recognition technology is severely degraded due to spoofing attacks with synthetic speech generated using diff erent voice conversion (VC) and speech synthesis (SS) techniques. Various countermeasures are proposed to detect this type of attack, and in this context, choosing an appropriate feature extraction technique for capturing relevant information from speech is an important issue. This paper presents a concise experimental review of different features for synthetic speech detection task. A wide variety of features considered in this stud y include previously investigated features as well as some other potentially useful features for characterizing real and sy nthetic speech. The experiments are conducted on recently released ASVspoof 2015 corpus containing speech data from a large number of VC and SS technique. Comparative results using two different classifiers indicate that features representing spectral information in high-frequency region, dynamic information of speech, and detailed information related to subband characteristics are considerably more useful in detecting synthetic sp eech. Index Terms: anti-spoofing, ASVspoof 2015, feature extraction, countermeasures

...read moreread less

313 citations

Proceedings Article•DOI•

Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis

[...]

Heiga Zen¹, Hasim Sak¹•Institutions (1)

Google¹

19 Apr 2015

TL;DR: Experimental results in subjective listening tests show that the proposed architecture can synthesize natural sounding speech without requiring utterance-level batch processing.

...read moreread less

Abstract: Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis One of the concerns for applying them to text-to-speech applications is its effect on latency To address this concern, this paper proposes a low-latency, streaming speech synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer The use of unidirectional RNN architecture allows frame-synchronous streaming inference of output acoustic features given input linguistic features The recurrent output layer further encourages smooth transition between acoustic features at consecutive frames Experimental results in subjective listening tests show that the proposed architecture can synthesize natural sounding speech without requiring utterance-level batch processing

...read moreread less

278 citations

Proceedings Article•DOI•

The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

[...]

Takuya Yoshioka¹, Nobutaka Ito¹, Marc Delcroix¹, Atsunori Ogawa¹, Keisuke Kinoshita¹, Masakiyo Fujimoto¹, Chengzhu Yu¹, Wojciech J. Fabian¹, Miquel Espi¹, Takuya Higuchi¹, Shoko Araki¹, Tomohiro Nakatani¹ - Show less +8 more•Institutions (1)

Nippon Telegraph and Telephone¹

01 Dec 2015

TL;DR: NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.

...read moreread less

Abstract: CHiME-3 is a research community challenge organised in 2015 to evaluate speech recognition systems for mobile multi-microphone devices used in noisy daily environments. This paper describes NTT's CHiME-3 system, which integrates advanced speech enhancement and recognition techniques. Newly developed techniques include the use of spectral masks for acoustic beam-steering vector estimation and acoustic modelling with deep convolutional neural networks based on the "network in network" concept. In addition to these improvements, our system has several key differences from the official baseline system. The differences include multi-microphone training, dereverberation, and cross adaptation of neural networks with different architectures. The impacts that these techniques have on recognition performance are investigated. By combining these advanced techniques, our system achieves a 3.45% development error rate and a 5.83% evaluation error rate. Three simpler systems are also developed to perform evaluations with constrained set-ups.

...read moreread less

259 citations

Journal Article•DOI•

Brain-to-text: Decoding spoken phrases from phone representations in the brain

[...]

Christian Herff¹, Dominic Heger¹, Adriana de Pesters², Adriana de Pesters³, Dominic Telaar¹, Peter Brunner⁴, Peter Brunner², Gerwin Schalk², Gerwin Schalk⁴, Gerwin Schalk³, Tanja Schultz¹ - Show less +7 more•Institutions (4)

Karlsruhe Institute of Technology¹, New York State Department of Health², University at Albany, SUNY³, Albany Medical College⁴

12 Jun 2015-Frontiers in Neuroscience

TL;DR: It is shown for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic recordings, and this approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones.

...read moreread less

Abstract: It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.

...read moreread less

228 citations

Proceedings Article•DOI•

Deep multimodal learning for Audio-Visual Speech Recognition

[...]

Youssef Mroueh¹, Etienne Marcheret², Vaibhava Goel²•Institutions (2)

Massachusetts Institute of Technology¹, IBM²

19 Apr 2015

TL;DR: In this article, a multimodal learning approach was proposed for fusing speech and visual modalities for audio-visual automatic speech recognition (AV-ASR) using uni-modal deep networks.

...read moreread less

Abstract: In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of 41% under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of 35.83% demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of 34.03%.

...read moreread less

220 citations

Journal Article•DOI•

Phase Processing for Single-Channel Speech Enhancement: History and recent advances

[...]

Timo Gerkmann¹, Martin Krawczyk-Becker¹, Jonathan Le Roux²•Institutions (2)

University of Oldenburg¹, Mitsubishi Electric Research Laboratories²

10 Feb 2015-IEEE Signal Processing Magazine

TL;DR: This work focuses on single-channel speech enhancement algorithms which rely on spectrotemporal properties, and can be employed when the miniaturization of devices only allows for using a single microphone.

...read moreread less

Abstract: With the advancement of technology, both assisted listening devices and speech communication devices are becoming more portable and also more frequently used. As a consequence, users of devices such as hearing aids, cochlear implants, and mobile telephones, expect their devices to work robustly anywhere and at any time. This holds in particular for challenging noisy environments like a cafeteria, a restaurant, a subway, a factory, or in traffic. One way to making assisted listening devices robust to noise is to apply speech enhancement algorithms. To improve the corrupted speech, spatial diversity can be exploited by a constructive combination of microphone signals (so-called beamforming), and by exploiting the different spectro?temporal properties of speech and noise. Here, we focus on single-channel speech enhancement algorithms which rely on spectrotemporal properties. On the one hand, these algorithms can be employed when the miniaturization of devices only allows for using a single microphone. On the other hand, when multiple microphones are available, single-channel algorithms can be employed as a postprocessor at the output of a beamformer. To exploit the short-term stationary properties of natural sounds, many of these approaches process the signal in a time-frequency representation, most frequently the short-time discrete Fourier transform (STFT) domain. In this domain, the coefficients of the signal are complex-valued, and can therefore be represented by their absolute value (referred to in the literature both as STFT magnitude and STFT amplitude) and their phase. While the modeling and processing of the STFT magnitude has been the center of interest in the past three decades, phase has been largely ignored.

...read moreread less

Deep Learning for Acoustic Modeling in Parametric Speech Generation

[...]

Zhen-Hua Ling, Shiyin Kang, Heiga Zen, Andrew W. Senior, Mike Schuster, Xiaojun Qian, Helen Meng, Li Deng - Show less +4 more

01 Jan 2015

TL;DR: This article systematically reviews emerging speech generation approaches with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation.

...read moreread less

Abstract: Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. This article systematically reviews these emerging speech generation approaches, with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation. In speech signal and information processing, many applications have been formulated as machine-learning tasks. ASR is a typical classification task that predicts word sequences from speech waveforms or feature sequences. There are also many regression tasks in speech processing that are aimed to generate speech signals from various types of inputs. They are referred to as speech generation tasks in this article. Speech generation covers a wide range of research topics in speech processing, such as text-to-speech (TTS) synthesis (generating speech from text), voice conversion (modifying nonlinguistic information of the input speech), speech enhancement (improving speech quality by noise reduction or other processing), and articulatory-to-acoustic mapping (converting articulatory movements to acoustic features). These

...read moreread less

Patent•

Device Selection for Providing a Response

[...]

James David Meyers¹, Pravinchandra Shah Samir¹, Liu Yue¹, Arlen R. Dean¹, Daniel Miller¹, Arindam Mandal¹ - Show less +2 more•Institutions (1)

Amazon.com¹

21 Sep 2015

TL;DR: In this paper, a system may use multiple speech interface devices to interact with a user by speech and arbitration is employed to select one of the multiple devices to respond to the user utterance.

...read moreread less

Abstract: A system may use multiple speech interface devices to interact with a user by speech. All or a portion of the speech interface devices may detect a user utterance and may initiate speech processing to determine a meaning or intent of the utterance. Within the speech processing, arbitration is employed to select one of the multiple speech interface devices to respond to the user utterance. Arbitration may be based in part on metadata that directly or indirectly indicates the proximity of the user to the devices, and the device that is deemed to be nearest the user may be selected to respond to the user utterance.

...read moreread less

Proceedings Article•DOI•

Lexicon-Free Conversational Speech Recognition with Neural Networks

[...]

Andrew L. Maas¹, Ziang Xie², Dan Jurafsky¹, Andrew Y. Ng³•Institutions (3)

Stanford University¹, University of California, Berkeley², Google³

01 Jan 2015

TL;DR: An approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks.

...read moreread less

Abstract: We present an approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure. This approach eliminates much of the complex infrastructure of modern speech recognition systems, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks. The system naturally handles out of vocabulary words and spoken word fragments. We demonstrate our approach using the challenging Switchboard telephone conversation transcription task, achieving a word error rate competitive with existing baseline systems. To our knowledge, this is the first entirely neural-network-based system to achieve strong speech transcription results on a conversational speech task. We analyze qualitative differences between transcriptions produced by our lexicon-free approach and transcriptions produced by a standard speech recognition system. Finally, we evaluate the impact of large context neural network character language models as compared to standard n-gram models within our framework.

...read moreread less

Proceedings Article•DOI•

BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge

[...]

Jahn Heymann¹, Lukas Drude¹, Aleksej Chinaev¹, Reinhold Haeb-Umbach¹•Institutions (1)

University of Paderborn¹

01 Dec 2015

TL;DR: A new beamformer front-end for Automatic Speech Recognition that leverages the power of a bi-directional Long Short-Term Memory network to robustly estimate soft masks for a subsequent beamforming step and achieves a 53% relative reduction of the word error rate over the best baseline enhancement system for the relevant test data set.

...read moreread less

Abstract: We present a new beamformer front-end for Automatic Speech Recognition and apply it to the 3rd-CHiME Speech Separation and Recognition Challenge. Without any further modification of the back-end, we achieve a 53% relative reduction of the word error rate over the best baseline enhancement system for the relevant test data set. Our approach leverages the power of a bi-directional Long Short-Term Memory network to robustly estimate soft masks for a subsequent beamforming step. The utilized Generalized Eigenvalue beamforming operation with an optional Blind Analytic Normalization does not rely on a Direction-of-Arrival estimate and can cope with multi-path sound propagation, while at the same time only introducing very limited speech distortions. Our quite simple setup exploits the possibilities provided by simulated training data while still being able to generalize well to the fairly different real data. Finally, combining our front-end with data augmentation and another language model nearly yields a 64 % reduction of the word error rate on the real data test set.

...read moreread less

Patent•

Voice and connection platform

[...]

Gregory Renard, Mathias Herbaux

30 Sep 2015

TL;DR: In this paper, a system and method for providing a voice assistant including receiving, at a first device, a first audio input from a user requesting a first action; performing automatic speech recognition on the first audio inputs; obtaining a context of user; performing natural language understanding based on the speech recognition of the first input; and taking the first action based on context of the user and the natural language understand.

...read moreread less

Abstract: A system and method for providing a voice assistant including receiving, at a first device, a first audio input from a user requesting a first action; performing automatic speech recognition on the first audio input; obtaining a context of user; performing natural language understanding based on the speech recognition of the first audio input; and taking the first action based on the context of the user and the natural language understanding.

...read moreread less

Proceedings Article•DOI•

A deep neural network approach to speech bandwidth expansion

[...]

Kehuang Li¹, Chin-Hui Lee¹•Institutions (1)

Georgia Institute of Technology¹

19 Apr 2015

TL;DR: It is found that the DNN-expanded speech signals give excellent objective quality measures in terms of segmental signal-to-noise ratio and log-spectral distortion when compared with conventional BWE based on Gaussian mixture models.

...read moreread less

Abstract: We propose a deep neural network (DNN) approach to speech bandwidth expansion (BWE) by estimating the spectral mapping function from narrowband (4 kHz in bandwidth) to wideband (8 kHz in bandwidth). Log-spectrum power is used as the input and output features to perform the required nonlinear transformation, and DNNs are trained to realize this high-dimensional mapping function. When evaluating the proposed approach on a large-scale 10-hour test set, we found that the DNN-expanded speech signals give excellent objective quality measures in terms of segmental signal-to-noise ratio and log-spectral distortion when compared with conventional BWE based on Gaussian mixture models (GMMs). Subjective listening tests also give a 69% preference score for DNN-expanded speech over 31% for GMM when the phase information is assumed known. For tests in real operation when the phase information is imaged from the given narrowband signal the preference comparison goes up to 84% versus 16%. A correct phase recovery can further increase the BWE performance for the proposed DNN method.

...read moreread less

Proceedings Article•DOI•

Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge.

[...]

Xiong Xiao¹, Xiaohai Tian¹, Steven Du¹, Haihua Xu¹, Eng Siong Chng¹, Haizhou Li² - Show less +2 more•Institutions (2)

Nanyang Technological University¹, Agency for Science, Technology and Research²

06 Sep 2015

TL;DR: This paper proposes to use high dimensional magnitude and phase based features and long term temporal information for the detection of synthetic speech (called spooﬁng speech) and tests the effectiveness of the 7 features used.

...read moreread less

Abstract: Recent improvement in text-to-speech (TTS) and voice conversion (VC) techniques presents a threat to automatic speaker verification (ASV) systems. An attacker can use the TTS or VC systems to impersonate a target speaker’s voice. To overcome such a challenge, we study the detection of such synthetic speech (called spoofing speech) in this paper. We propose to use high dimensional magnitude and phase based features and long term temporal information for the task. In total, 2 types of magnitude based features and 5 types of phase based features are used. For each feature type, we build a component system using a multilayer perceptron to predict the posterior probabilities of the input features extracted from spoofing speech. The probabilities of all component systems are averaged to produce the score for final decision. When tested on the ASVspoof 2015 benchmarking task, an equal error rate (EER) of 0.29% is obtained for known spoofing types, which demonstrates the highly effectiveness of the 7 features used. For unknown spoofing types, the EER is much higher at 5.23%, suggesting that future research should be focused on improving the generalization of the techniques.

...read moreread less

Patent•

Application focus in speech-based systems

[...]

Peter Spalding Vanlund¹, Kurt Wesley Piersol¹, James David Meyers¹, Jacob Michael Simpson¹, Vikram Kumar Gundeti¹, David Robert Thomas¹, Andrew Christopher Miles¹ - Show less +3 more•Institutions (1)

Amazon.com¹

11 Dec 2015

TL;DR: In this paper, a speech-based system includes an audio device in a user premises and a network-based service that supports use of the audio device by multiple applications, such as music, audio books, etc.

...read moreread less

Abstract: A speech-based system includes an audio device in a user premises and a network-based service that supports use of the audio device by multiple applications. The audio device may be directed to play audio content such as music, audio books, etc. The audio device may also be directed to interact with a user through speech. The network-based service monitors event messages received from the audio device to determine which of the multiple applications currently has speech focus. When receiving speech from a user, the service first offers the corresponding meaning to the application, if any, that currently has primary speech focus. If there is no application that currently has primary speech focus, or if the application having primary speech focus is not able to respond to the meaning, the service then offers the user meaning to the application that currently has secondary speech focus.

...read moreread less

Proceedings Article•DOI•

A deep neural network for time-domain signal reconstruction

[...]

Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

19 Apr 2015

TL;DR: A new deep network is proposed that directly reconstructs the time-domain clean signal through an inverse fast Fourier transform layer and significantly outperforms a recent non-negative matrix factorization based separation system in both objective speech intelligibility and quality.

...read moreread less

Abstract: Supervised speech separation has achieved considerable success recently. Typically, a deep neural network (DNN) is used to estimate an ideal time-frequency mask, and clean speech is produced by feeding the mask-weighted output to a resynthesizer in a subsequent step. So far, the success of DNN-based separation lies mainly in improving human speech intelligibility. In this work, we propose a new deep network that directly reconstructs the time-domain clean signal through an inverse fast Fourier transform layer. The joint training of speech resynthesis and mask estimation yields improved objective quality while maintaining the objective intelligibility performance. The proposed system significantly outperforms a recent non-negative matrix factorization based separation system in both objective speech intelligibility and quality.

...read moreread less

Proceedings Article•DOI•

Exploring multi-channel features for denoising-autoencoder-based speech enhancement

[...]

Shoko Araki¹, Tomoki Hayashi¹, Marc Delcroix¹, Masakiyo Fujimoto¹, Kazuya Takeda², Tomohiro Nakatani¹ - Show less +2 more•Institutions (2)

Nippon Telegraph and Telephone¹, Nagoya University²

19 Apr 2015

TL;DR: Experimental results show that certain multi-channel features outperform both a monaural DAE and a conventional time-frequency-mask-based speech enhancement method.

...read moreread less

Abstract: This paper investigates a multi-channel denoising autoencoder (DAE)-based speech enhancement approach. In recent years, deep neural network (DNN)-based monaural speech enhancement and robust automatic speech recognition (ASR) approaches have attracted much attention due to their high performance. Although multi-channel speech enhancement usually outperforms single channel approaches, there has been little research on the use of multi-channel processing in the context of DAE. In this paper, we explore the use of several multi-channel features as DAE input to confirm whether multi-channel information can improve performance. Experimental results show that certain multi-channel features outperform both a monaural DAE and a conventional time-frequency-mask-based speech enhancement method.

...read moreread less

Patent•

Providing an indication of the suitability of speech recognition

[...]

Yoon Kim¹•Institutions (1)

Apple Inc.¹

24 Aug 2015

TL;DR: In this paper, the suitability of an acoustic environment for speech recognition is evaluated using a visual representation of the speech recognition suitability to indicate the likelihood that a spoken user input will be interpreted correctly.

...read moreread less

Abstract: This relates to providing an indication of the suitability of an acoustic environment for performing speech recognition. One process can include receiving an audio input and determining a speech recognition suitability based on the audio input. The speech recognition suitability can include a numerical, textual, graphical, or other representation of the suitability of an acoustic environment for performing speech recognition. The process can further include displaying a visual representation of the speech recognition suitability to indicate the likelihood that a spoken user input will be interpreted correctly. This allows a user to determine whether to proceed with the performance of a speech recognition process, or to move to a different location having a better acoustic environment before performing the speech recognition process. In some examples, the user device can disable operation of a speech recognition process in response to determining that the speech recognition suitability is below a threshold suitability.

...read moreread less

Patent•

Language model modification for local speech recognition systems using remote sources

[...]

Michael E. Deisher, Georg Stemmer

26 Jun 2015

TL;DR: In this article, a language model is modified for a local speech recognition system using remote speech recognition sources, and text results corresponding to the utterance are generated using local vocabulary using the received text results and the generated text result are compared to determine words that are out of the local vocabulary.

...read moreread less

Abstract: A language model is modified for a local speech recognition system using remote speech recognition sources. In one example, a speech utterance is received. The speech utterance is sent to at least one remote speech recognition system. Text results corresponding to the utterance are received from the remote speech recognition system. A local text result is generated using local vocabulary. The received text results and the generated text result are compared to determine words that are out of the local vocabulary and the local vocabulary is updated using the out of vocabulary words.

...read moreread less

Proceedings Article•DOI•

Overview of the EVS codec architecture

[...]

Martin Dietz, Markus Multrus, Vaclav Eksler, Vladimir Malenovsky, Erik Norvell¹, Harald Pobloth¹, Lei Miao², Zhe Wang², Lasse Juhani Laaksonen³, Adriana Vasilache³, Yutaka Kamamoto⁴, Kei Kikuiri⁵, Stéphane Ragot⁶, Julien Faure⁶, Hiroyuki Ehara⁷, Vivek Rajendran⁸, Atti Venkatraman S⁸, Ho-Sang Sung⁹, Eunmi Oh⁹, Hao Yuan¹⁰, Changbao Zhu¹⁰ - Show less +17 more•Institutions (10)

Ericsson¹, Huawei², Nokia³, Nippon Telegraph and Telephone⁴, NTT DoCoMo⁵, Orange⁶, Panasonic⁷, Qualcomm⁸, Samsung⁹, ZTE¹⁰

19 Apr 2015

TL;DR: An overview of the underlying architecture as well as the novel technologies in the EVS codec are given and listening test results showing the performance of the new codec in terms of compression and speech/audio quality are presented.

...read moreread less

Abstract: The recently standardized 3GPP codec for Enhanced Voice Services (EVS) offers new features and improvements for low-delay real-time communication systems. Based on a novel, switched low-delay speech/audio codec, the EVS codec contains various tools for better compression efficiency and higher quality for clean/noisy speech, mixed content and music, including support for wideband, super-wideband and full-band content. The EVS codec operates in a broad range of bitrates, is highly robust against packet loss and provides an AMR-WB interoperable mode for compatibility with existing systems. This paper gives an overview of the underlying architecture as well as the novel technologies in the EVS codec and presents listening test results showing the performance of the new codec in terms of compression and speech/audio quality.

...read moreread less

Patent•

Automatic speech recognition based on user feedback

[...]

Mahesh Krishnamoorthy¹, Matthias Paulik¹•Institutions (1)

Apple Inc.¹

07 Jan 2015

TL;DR: In this article, a first speech input can be received from a user and a second speech input that is a repetition of the first input can then be processed using a second automatic speech recognition system to produce a second recognition result.

...read moreread less

Abstract: Systems and processes for processing speech in a digital assistant are provided. In one example process, a first speech input can be received from a user. The first speech input can be processed using a first automatic speech recognition system to produce a first recognition result. An input indicative of a potential error in the first recognition result can be received. The input can be used to improve the first recognition result. For example, the input can include a second speech input that is a repetition of the first speech input. The second speech input can be processed using a second automatic speech recognition system to produce a second recognition result.

...read moreread less

Journal Article•DOI•

Toward a Universal Synthetic Speech Spoofing Detection Using Phase Information

[...]

Jon Sanchez¹, Ibon Saratxaga¹, Inma Hernaez¹, Eva Navas¹, Daniel Erro¹, Tuomo Raitio² - Show less +2 more•Institutions (2)

University of the Basque Country¹, Aalto University²

02 Feb 2015-IEEE Transactions on Information Forensics and Security

TL;DR: This paper presents a synthetic speech detector that can be connected at the front-end or at the back-end of a standard SV system, and that will protect it from spoofing attacks coming from state-of-the-art statistical Text to Speech (TTS) systems.

...read moreread less

Abstract: In the field of speaker verification (SV) it is nowadays feasible and relatively easy to create a synthetic voice to deceive a speech driven biometric access system. This paper presents a synthetic speech detector that can be connected at the front-end or at the back-end of a standard SV system, and that will protect it from spoofing attacks coming from state-of-the-art statistical Text to Speech (TTS) systems. The system described is a Gaussian Mixture Model (GMM) based binary classifier that uses natural and copy-synthesized signals obtained from the Wall Street Journal database to train the system models. Three different state-of-the-art vocoders are chosen and modeled using two sets of acoustic parameters: 1) relative phase shift and 2) canonical Mel Frequency Cepstral Coefficients (MFCC) parameters, as baseline. The vocoder dependency of the system and multivocoder modeling features are thoroughly studied. Additional phase-aware vocoders are also tested. Several experiments are carried out, showing that the phase-based parameters perform better and are able to cope with new unknown attacks. The final evaluations, testing synthetic TTS signals obtained from the Blizzard challenge, validate our proposal.

...read moreread less

Proceedings Article•DOI•

Joint training of front-end and back-end deep neural networks for robust speech recognition

[...]

Tian Gao¹, Jun Du¹, Li-Rong Dai¹, Chin-Hui Lee²•Institutions (2)

University of Science and Technology of China¹, Georgia Institute of Technology²

19 Apr 2015

TL;DR: It is shown that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary.

...read moreread less

Abstract: Based on the recently proposed speech pre-processing front-end with deep neural networks (DNNs), we first investigate different feature mapping directly from noisy speech via DNN for robust speech recognition. Next, we propose to jointly train a single DNN for both feature mapping and acoustic modeling. In the end, we show that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary. Testing on the Aurora4 noisy speech recognition task our best system with multi-condition training can achieves an average WER of 10.3%, yielding a relative reduction of 16.3% over our previous DNN pre-processing only system with a WER of 12.3%. To the best of our knowledge, this represents the best published result on the Aurora4 task without using any adaptation techniques.

...read moreread less

Journal Article•DOI•

Strategies for distant speech recognitionin reverberant environments

[...]

Marc Delcroix¹, Takuya Yoshioka¹, Atsunori Ogawa¹, Yotaro Kubo¹, Masakiyo Fujimoto¹, Nobutaka Ito¹, Keisuke Kinoshita¹, Miquel Espi¹, Shoko Araki¹, Takaaki Hori¹, Tomohiro Nakatani¹ - Show less +7 more•Institutions (1)

Nippon Telegraph and Telephone¹

19 Jul 2015-EURASIP Journal on Advances in Signal Processing

TL;DR: A recognition system that was developed at the laboratory to deal with reverberant speech consists of a speech enhancement front-end that employs long-term linear prediction-based dereverberation followed by noise reduction and an ASR back- end that uses neural networks for acoustic and language modeling.

...read moreread less

Abstract: Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

...read moreread less

Journal Article•DOI•

Features for voice activity detection: a comparative analysis

[...]

Simon Graf¹, Simon Graf², Tobias Herbig¹, Markus Buck¹, Gerhard Schmidt² - Show less +1 more•Institutions (2)

Nuance Communications¹, University of Kiel²

11 Nov 2015-EURASIP Journal on Advances in Signal Processing

TL;DR: A structured overview of several established VAD features that target at different properties of speech, categorize the features with respect to properties that are exploited, such as power, harmonicity, or modulation, and evaluate the performance of some dedicated features.

...read moreread less

Abstract: In many speech signal processing applications, voice activity detection (VAD) plays an essential role for separating an audio stream into time intervals that contain speech activity and time intervals where speech is absent. Many features that reflect the presence of speech were introduced in literature. However, to our knowledge, no extensive comparison has been provided yet. In this article, we therefore present a structured overview of several established VAD features that target at different properties of speech. We categorize the features with respect to properties that are exploited, such as power, harmonicity, or modulation, and evaluate the performance of some dedicated features. The importance of temporal context is discussed in relation to latency restrictions imposed by different applications. Our analyses allow for selecting promising VAD features and finding a reasonable trade-off between performance and complexity.

...read moreread less

Patent•

Speech recognition using electronic device and server

[...]

Seok Yeong Jung¹, Kyung Tae Kim¹•Institutions (1)

Samsung¹

07 Apr 2015

TL;DR: In this paper, an electronic device includes a processor configured to perform automatic speech recognition (ASR) on a speech input by using a speech recognition model that is stored in a memory and a communication module configured to provide the speech input to a server and receive a speech instruction, which corresponds to the input, from the server.

...read moreread less

Abstract: An electronic device is provided. The electronic device includes a processor configured to perform automatic speech recognition (ASR) on a speech input by using a speech recognition model that is stored in a memory and a communication module configured to provide the speech input to a server and receive a speech instruction, which corresponds to the speech input, from the server. The electronic device may perform different operations according to a confidence score of a result of the ASR. Besides, it may be permissible to prepare other various embodiments speculated through the specification.

...read moreread less

Collapse