Showing papers on "Word error rate published in 2017"

PDF

Open Access

Proceedings Article•DOI•

Joint CTC-attention based end-to-end speech recognition using multi-task learning

[...]

Suyoun Kim¹, Takaaki Hori¹, Shinji Watanabe¹•Institutions (1)

Mitsubishi Electric Research Laboratories¹

05 Mar 2017

TL;DR: This paper proposed a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder framework for end-to-end speech recognition, which can improve robustness and achieve fast convergence by using a joint CTC-attention model.

...read moreread less

Abstract: Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4–14.6% relative improvements in Character Error Rate (CER).

...read moreread less

645 citations

Proceedings Article•DOI•

Very deep convolutional networks for end-to-end speech recognition

[...]

Yu Zhang¹, William Chan², Navdeep Jaitly³•Institutions (3)

Massachusetts Institute of Technology¹, Carnegie Mellon University², Google³

05 Mar 2017

TL;DR: This work successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models, and applies network-in-network principles, batch normalization, residual connections and convolutionAL LSTMs to build very deep recurrent and Convolutional structures.

...read moreread less

Abstract: Sequence-to-sequence models have shown success in end-to-end speech recognition. However these models have only used shallow acoustic encoder networks. In our work, we successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models. We apply network-in-network principles, batch normalization, residual connections and convolutional LSTMs to build very deep recurrent and convolutional structures. Our models exploit the spectral structure in the feature space and add computational depth without overfitting issues. We experiment with the WSJ ASR task and achieve 10.5% word error rate without any dictionary or language model using a 15 layer deep network.

...read moreread less

439 citations

Proceedings Article•DOI•

English Conversational Telephone Speech Recognition by Humans and Machines

[...]

George Saon¹, Gakuto Kurata¹, Tom Sercu¹, Kartik Audhkhasi¹, Samuel Thomas¹, Dimitrios Dimitriadis², Xiaodong Cui¹, Bhuvana Ramabhadran¹, Michael Picheny¹, Lynn-Li Lim, Bergul Roomi, Phil Hall - Show less +8 more•Institutions (2)

IBM¹, AT&T²

20 Aug 2017

TL;DR: In this article, a set of acoustic and language modeling techniques were used to lower the word error rate of a conversational telephone LVCSR system to 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation.

...read moreread less

Abstract: One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.

...read moreread less

330 citations

Proceedings Article•DOI•

The microsoft 2016 conversational speech recognition system

[...]

Wayne Xiong¹, Jasha Droppo¹, Xuedong Huang¹, Frank Seide¹, Michael L. Seltzer¹, Andreas Stolcke¹, Dong Yu¹, Geoffrey Zweig¹ - Show less +4 more•Institutions (1)

Microsoft¹

05 Mar 2017

TL;DR: Microsoft's conversational speech recognition system is described, in which recent developments in neural-network-based acoustic and language modeling are combined to advance the state of the art on the Switchboard recognition task.

...read moreread less

Abstract: We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

...read moreread less

322 citations

Posted Content•

Evolving Deep Convolutional Neural Networks for Image Classification

[...]

Yanan Sun¹, Bing Xue², Mengjie Zhang², Gary G. Yen³•Institutions (3)

Sichuan University¹, Victoria University of Wellington², Oklahoma State University–Stillwater³

30 Oct 2017-arXiv: Neural and Evolutionary Computing

TL;DR: A new method using genetic algorithms for evolving the architectures and connection weight initialization values of a deep convolutional neural network to address image classification problems and a novel fitness evaluation method is proposed to speed up the heuristic search with substantially less computational resource.

...read moreread less

Abstract: Evolutionary computation methods have been successfully applied to neural networks since two decades ago, while those methods cannot scale well to the modern deep neural networks due to the complicated architectures and large quantities of connection weights. In this paper, we propose a new method using genetic algorithms for evolving the architectures and connection weight initialization values of a deep convolutional neural network to address image classification problems. In the proposed algorithm, an efficient variable-length gene encoding strategy is designed to represent the different building blocks and the unpredictable optimal depth in convolutional neural networks. In addition, a new representation scheme is developed for effectively initializing connection weights of deep convolutional neural networks, which is expected to avoid networks getting stuck into local minima which is typically a major issue in the backward gradient-based optimization. Furthermore, a novel fitness evaluation method is proposed to speed up the heuristic search with substantially less computational resource. The proposed algorithm is examined and compared with 22 existing algorithms on nine widely used image classification tasks, including the state-of-the-art methods. The experimental results demonstrate the remarkable superiority of the proposed algorithm over the state-of-the-art algorithms in terms of classification error rate and the number of parameters (weights).

...read moreread less

291 citations

Proceedings Article•DOI•

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer

[...]

Kanishka Rao¹, Hasim Sak¹, Rohit Prabhavalkar¹•Institutions (1)

Google¹

01 Dec 2017

TL;DR: In this article, a recurrent neural network transducer (RNN-T) is proposed to jointly learn acoustic and language model components from transcribed acoustic data, which achieves state-of-the-art performance for end-to-end speech recognition.

...read moreread less

Abstract: We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an ‘encoder’, which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a ‘decoder’ which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets achieves a word error rate of 8.5% on voice-search and 5.2% on voice-dictation tasks and is comparable to a state-of-the-art baseline at 8.3% on voice-search and 5.4% voice-dictation.

...read moreread less

270 citations

Posted Content•

The Microsoft 2017 Conversational Speech Recognition System

[...]

Wayne Xiong¹, Lingfeng Wu¹, Fileno A. Alleva¹, Jasha Droppo¹, Xuedong Huang¹, Andreas Stolcke¹ - Show less +2 more•Institutions (1)

Microsoft¹

21 Aug 2017-arXiv: Computation and Language

TL;DR: The 2017 version of Microsoft's conversational speech recognition system is described in this article, which adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring.

...read moreread less

Abstract: We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set.

...read moreread less

266 citations

Proceedings Article•

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data

[...]

Wei-Ning Hsu¹, Yu Zhang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2017

TL;DR: A factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision by formulating it explicitly within a factorsized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables.

...read moreread less

Abstract: We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision. Specifically, we exploit the multi-scale nature of information in sequential data by formulating it explicitly within a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. The model is evaluated on two speech corpora to demonstrate, qualitatively, its ability to transform speakers or linguistic content by manipulating different sets of latent variables; and quantitatively, its ability to outperform an i-vector baseline for speaker verification and reduce the word error rate by as much as 35% in mismatched train/test scenarios for automatic speech recognition tasks.

...read moreread less

256 citations

Journal Article•DOI•

Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

[...]

Tara N. Sainath¹, Ron Weiss¹, Kevin W. Wilson¹, Bo Li¹, Arun Narayanan¹, Ehsan Variani¹, Michiel Bacchiani¹, Izhak Shafran¹, Andrew W. Senior¹, Kean Chin¹, Ananya Misra¹, Chanwoo Kim¹ - Show less +8 more•Institutions (1)

Google¹

01 May 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper introduces a neural network architecture, which performs multichannel filtering in the first layer of the network, and shows that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target Speaker direction.

...read moreread less

Abstract: Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture, which performs multichannel filtering in the first layer of the network, and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.

...read moreread less

221 citations

Journal Article•DOI•

Toward Human Parity in Conversational Speech Recognition

[...]

Wayne Xiong¹, Jasha Droppo¹, Xuedong Huang¹, Frank Seide¹, Michael L. Seltzer¹, Andreas Stolcke¹, Dong Yu¹, Geoffrey Zweig¹ - Show less +4 more•Institutions (1)

Microsoft¹

01 Dec 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A human error rate on the widely used NIST 2000 test set for commercial bulk transcription is measured, suggesting that, given sufficient matched training data, conversational speech transcription engines are approximating human parity in both quantitative and qualitative terms.

...read moreread less

Abstract: Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure a human error rate on the widely used NIST 2000 test set for commercial bulk transcription. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion, where friends and family members have open-ended conversations. In both cases, our automated system edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and long-short-term memory acoustic model architectures, combined with a novel spatial smoothing method and lattice-free discriminative acoustic training, multiple recurrent neural network language modeling approaches, and a systematic use of system combination. Comparing frequent errors in our human and machine transcripts, we find them to be remarkably similar, and highly correlated as a function of the speaker. Human subjects find it very difficult to tell which errorful transcriptions come from humans. Overall, this suggests that, given sufficient matched training data, conversational speech transcription engines are approximating human parity in both quantitative and qualitative terms.

...read moreread less

194 citations

Proceedings Article•DOI•

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

[...]

Daniel Michelsanti, Zheng-Hua Tan

20 Aug 2017

TL;DR: In this article, a conditional generative adversarial network (cGAN) was proposed to improve the performance of speech enhancement in noisy environments by learning a mapping from the spectrogram of noisy speech to an enhanced counterpart, which was trained in an adversarial manner to distinguish between enhanced spectrograms provided by the generator and clean ones from the database using the noisy spectrogram as a condition.

...read moreread less

Abstract: Improving speech system performance in noisy environments remains a challenging task, and speech enhancement (SE) is one of the effective techniques to solve the problem. Motivated by the promising results of generative adversarial networks (GANs) in a variety of image processing tasks, we explore the potential of conditional GANs (cGANs) for SE, and in particular, we make use of the image processing framework proposed by Isola et al. [1] to learn a mapping from the spectrogram of noisy speech to an enhanced counterpart. The SE cGAN consists of two networks, trained in an adversarial manner: a generator that tries to enhance the input noisy spectrogram, and a discriminator that tries to distinguish between enhanced spectrograms provided by the generator and clean ones from the database using the noisy spectrogram as a condition. We evaluate the performance of the cGAN method in terms of perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and equal error rate (EER) of speaker verification (an example application). Experimental results show that the cGAN method overall outperforms the classical short-time spectral amplitude minimum mean square error (STSA-MMSE) SE algorithm, and is comparable to a deep neural network-based SE approach (DNN-SE).

...read moreread less

Proceedings Article•DOI•

Data Augmentation for Recognition of Handwritten Words and Lines Using a CNN-LSTM Network

[...]

Curtis Wigington¹, Seth Stewart¹, Brian Davis¹, Bill Barrett¹, Brian Price², Scott Cohen² - Show less +2 more•Institutions (2)

Brigham Young University¹, Adobe Systems²

01 Nov 2017

TL;DR: Two data augmentation and normalization techniques are introduced, used with a CNN-LSTM, which significantly reduce Word Error Rate (WER) and Character Error rate (CER) beyond best-reported results on handwriting recognition tasks.

...read moreread less

Abstract: We introduce two data augmentation and normalization techniques, which, used with a CNN-LSTM, significantly reduce Word Error Rate (WER) and Character Error Rate (CER) beyond best-reported results on handwriting recognition tasks (1) We apply a novel profile normalization technique to both word and line images (2) We augment existing text images using random perturbations on a regular grid We apply our normalization and augmentation to both training and test images Our approach achieves low WER and CER over hundreds of authors, multiple languages and a variety of collections written centuries apart Image augmentation in this manner achieves state-of-the-art recognition accuracy on several popular handwritten word benchmarks

...read moreread less

Journal Article•DOI•

An unsupervised deep domain adaptation approach for robust speech recognition

[...]

Sining Sun¹, Binbin Zhang¹, Lei Xie¹, Yanning Zhang¹•Institutions (1)

Northwestern Polytechnical University¹

27 Sep 2017-Neurocomputing

TL;DR: An unsupervised deep domain adaptation (DDA) approach to acoustic modeling is introduced in order to eliminate the training–testing mismatch that is common in real-world use of speech recognition.

...read moreread less

Proceedings Article•DOI•

A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition

[...]

Albert Zeyer¹, Patrick Doetsch¹, Paul Voigtlaender¹, Ralf Schlüter¹, Hermann Ney¹ - Show less +1 more•Institutions (1)

RWTH Aachen University¹

05 Mar 2017

TL;DR: In this paper, a comprehensive overview of various bidirectional long short-term memory (BLSTM) training aspects and their interplay within ASR has been provided, which has been missing so far in the literature.

...read moreread less

Abstract: Recent experiments show that deep bidirectional long short-term memory (BLSTM) recurrent neural network acoustic models outperform feedforward neural networks for automatic speech recognition (ASR). However, their training requires a lot of tuning and experience. In this work, we provide a comprehensive overview over various BLSTM training aspects and their interplay within ASR, which has been missing so far in the literature. We investigate on different variants of optimization methods, batching, truncated backpropagation, and regularization techniques such as dropout, and we study the effect of size and depth, training models of up to 10 layers. This includes a comparison of computation times vs. recognition performance. Furthermore, we introduce a pretraining scheme for LSTMs with layer-wise construction of the network showing good improvements especially for deep networks. The experimental analysis mainly was performed on the Quaero task, with additional results on Switchboard. The best BLSTM model gave a relative improvement in word error rate of over 15% compared to our best feed-forward baseline on our Quaero 50h task. All experiments were done using RETURNN and RASR, RWTH's extensible training framework for universal recurrent neural networks and ASR toolkit. The training configuration files are publicly available.

...read moreread less

Proceedings Article•DOI•

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

[...]

Kartik Audhkhasi¹, Bhuvana Ramabhadran¹, George Saon¹, Michael Picheny¹, David Nahamoo¹ - Show less +1 more•Institutions (1)

IBM¹

22 Mar 2017

TL;DR: This paper presents the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome, and presents rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone C TC models.

...read moreread less

Abstract: Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

...read moreread less

Proceedings Article•DOI•

Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system

[...]

Jahn Heymann¹, Lukas Drude¹, Christoph Boeddeker¹, Patrick Hanebrink¹, Reinhold Haeb-Umbach¹ - Show less +1 more•Institutions (1)

University of Paderborn¹

05 Mar 2017

TL;DR: This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system, where a neural network which estimates masks for a statistically optimum beamformer is jointly trained with a network for acoustic modeling.

...read moreread less

Abstract: This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly trained with a network for acoustic modeling. To update its parameters, we propagate the gradients from the acoustic model all the way through feature extraction and the complex valued beamforming operation. Besides avoiding a mismatch between the front-end and the back-end, this approach also eliminates the need for stereo data, i.e., the parallel availability of clean and noisy versions of the signals. Instead, it can be trained with real noisy multi-channel data only. Also, relying on the signal statistics for beamforming, the approach makes no assumptions on the configuration of the microphone array. We further observe a performance gain through joint training in terms of word error rate in an evaluation of the system on the CHiME 4 dataset.

...read moreread less

Journal Article•DOI•

The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes

[...]

Jon Barker¹, Ricard Marxer¹, Emmanuel Vincent², Shinji Watanabe³•Institutions (3)

University of Sheffield¹, French Institute for Research in Computer Science and Automation², Mitsubishi Electric Research Laboratories³

01 Nov 2017-Computer Speech & Language

TL;DR: The design and outcomes of the CHiME-3 challenge are presented, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario, and strong evidence of a dependence on signal-to-noise ratio and channel quality is found.

...read moreread less

Journal Article•DOI•

Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks

[...]

Kun Li¹, Xiaojun Qian¹, Helen Meng¹•Institutions (1)

The Chinese University of Hong Kong¹

01 Jan 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An acoustic-graphemic-phonemic model (AGPM) using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions (encoded as binary vectors), which develops a unified MDD framework which works much like free-phone recognition.

...read moreread less

Abstract: This paper investigates the use of multidistribution deep neural networks DNNs for mispronunciation detection and diagnosis MDD, to circumvent the difficulties encountered in an existing approach based on extended recognition networks ERNs. The ERNs leverage existing automatic speech recognition technology by constraining the search space via including the likely phonetic error patterns of the target words in addition to the canonical transcriptions. MDDs are achieved by comparing the recognized transcriptions with the canonical ones. Although this approach performs reasonably well, it has the following issues: 1 Learning the error patterns of the target words to generate the ERNs remains a challenging task. Phones or phone errors missing from the ERNs cannot be recognized even if we have well-trained acoustic models; and 2 acoustic models and phonological rules are trained independently, and hence, contextual information is lost. To address these issues, we propose an acoustic-graphemic-phonemic model AGPM using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions encoded as binary vectors. The AGPM can implicitly model both grapheme-to-likely-pronunciation and phoneme-to-likely-pronunciation conversions, which are integrated into acoustic modeling. With the AGPM, we develop a unified MDD framework, which works much like free-phone recognition. Experiments show that our method achieves a phone error rate PER of 11.1%. The false rejection rate FRR, false acceptance rate FAR, and diagnostic error rate DER for MDD are 4.6%, 30.5%, and 13.5%, respectively. It outperforms the ERN approach using DNNs as acoustic models, whose PER, FRR, FAR, and DER are 16.8%, 11.0%, 43.6%, and 32.3%, respectively.

...read moreread less

Posted Content•

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

[...]

Rohit Prabhavalkar¹, Tara N. Sainath¹, Yonghui Wu¹, Patrick Nguyen¹, Zhifeng Chen¹, Chung-Cheng Chiu¹, Anjuli Kannan¹ - Show less +3 more•Institutions (1)

Google¹

05 Dec 2017-arXiv: Computation and Language

TL;DR: In this paper, two loss functions are proposed to approximate the expected number of word errors: either sampling from the model, or by using N-best lists of decoded hypotheses, which are more effective than the sampling-based method.

...read moreread less

Abstract: Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.

...read moreread less

Journal Article•DOI•

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

[...]

Amirsina Torfi¹, Seyed Mehdi Iranmanesh¹, Nasser M. Nasrabadi¹, Jeremy Dawson¹•Institutions (1)

University College of Engineering¹

09 Oct 2017-IEEE Access

TL;DR: This paper proposes the use of a coupled 3D convolutional neural network (3D CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio–visual streams using the learned multimodal features.

...read moreread less

Abstract: Audio–visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this paper. We propose the use of a coupled 3D convolutional neural network (3D CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio–visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller data set for training, our proposed method surpasses the performance of the existing similar methods for audio–visual matching, which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20% on the equal error rate and over 7% on the average precision in comparison to the state-of-the-art method.

...read moreread less

Journal Article•DOI•

A Segmental Framework for Fully-Unsupervised Large-Vocabulary Speech Recognition

[...]

Herman Kamper¹, Aren Jansen², Sharon Goldwater¹•Institutions (2)

University of Edinburgh¹, Google²

01 Nov 2017-Computer Speech & Language

TL;DR: This article presents the first attempt to apply a Bayesian modelling framework with segmental word representations to large-vocabulary multi-speaker data and shows that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speakers and multi-Speaker versions of this system outperform a purely bottom- up single- Speaker syllable-based approach.

...read moreread less

Posted Content•

English Conversational Telephone Speech Recognition by Humans and Machines

[...]

IBM¹, AT&T²

06 Mar 2017-arXiv: Computation and Language

TL;DR: An independent set of human performance measurements on two conversational tasks are performed and it is found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve.

...read moreread less

Posted Content•

Large-Scale Domain Adaptation via Teacher-Student Learning

[...]

Jinyu Li¹, Michael L. Seltzer¹, Xi Wang, Rui Zhao¹, Yifan Gong¹ - Show less +1 more•Institutions (1)

Microsoft¹

17 Aug 2017-arXiv: Computation and Language

TL;DR: This work proposes an approach to domain adaptation that does not require transcriptions but instead uses a corpus of unlabeled parallel data, consisting of pairs of samples from the source domain of the well-trained model and the desired target domain, to perform adaptation.

...read moreread less

Abstract: High accuracy speech recognition requires a large amount of transcribed data for supervised training. In the absence of such data, domain adaptation of a well-trained acoustic model can be performed, but even here, high accuracy usually requires significant labeled data from the target domain. In this work, we propose an approach to domain adaptation that does not require transcriptions but instead uses a corpus of unlabeled parallel data, consisting of pairs of samples from the source domain of the well-trained model and the desired target domain. To perform adaptation, we employ teacher/student (T/S) learning, in which the posterior probabilities generated by the source-domain model can be used in lieu of labels to train the target-domain model. We evaluate the proposed approach in two scenarios, adapting a clean acoustic model to noisy speech and adapting an adults speech acoustic model to children speech. Significant improvements in accuracy are obtained, with reductions in word error rate of up to 44% over the original source model without the need for transcribed data in the target domain. Moreover, we show that increasing the amount of unlabeled data results in additional model robustness, which is particularly beneficial when using simulated training data in the target-domain.

...read moreread less

Journal Article•DOI•

Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming

[...]

Tsubasa Ochiai¹, Shinji Watanabe², Takaaki Hori², John R. Hershey², Xiong Xiao³ - Show less +1 more•Institutions (3)

Doshisha University¹, Mitsubishi Electric Research Laboratories², Nanyang Technological University³

18 Oct 2017-IEEE Journal of Selected Topics in Signal Processing

TL;DR: This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end- to-end framework and elaborate the effectiveness of this proposed method on the multichannel ASR benchmarks in noisy environments.

...read moreread less

Abstract: This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end-to-end framework. Recently, the end-to-end ASR paradigm has attracted great research interest as an alternative to conventional hybrid paradigms with deep neural networks and hidden Markov models. Using this novel paradigm, we simplify ASR architecture by integrating such ASR components as acoustic, phonetic, and language models with a single neural network and optimize the overall components for the end-to-end ASR objective: generating a correct label sequence. Although most existing end-to-end frameworks have mainly focused on ASR in clean environments, our aim is to build more realistic end-to-end systems in noisy environments. To handle such challenging noisy ASR tasks, we study multichannel end-to-end ASR architecture, which directly converts multichannel speech signal to text through speech enhancement. This architecture allows speech enhancement and ASR components to be jointly optimized to improve the end-to-end ASR objective and leads to an end-to-end framework that works well in the presence of strong background noise. We elaborate the effectiveness of our proposed method on the multichannel ASR benchmarks in noisy environments (CHiME-4 and AMI). The experimental results show that our proposed multichannel end-to-end system obtained performance gains over the conventional end-to-end baseline with enhanced inputs from a delay-and-sum beamformer (i.e., BeamformIT) in terms of character error rate. In addition, further analysis shows that our neural beamformer, which is optimized only with the end-to-end ASR objective, successfully learned a noise suppression function.

...read moreread less

Journal Article•DOI•

Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR

[...]

Takuya Higuchi¹, Nobutaka Ito¹, Shoko Araki¹, Takuya Yoshioka¹, Marc Delcroix¹, Tomohiro Nakatani¹ - Show less +2 more•Institutions (1)

Nippon Telegraph and Telephone¹

01 Apr 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A probabilistic prior distribution for a spatial correlation matrix (a CGMM parameter), which enables more stable steering vector estimation in the presence of interfering speakers, is introduced in this paper.

...read moreread less

Abstract: This paper considers acoustic beamforming for noise robust automatic speech recognition. A beamformer attenuates background noise by enhancing sound components coming from a direction specified by a steering vector. Hence, accurate steering vector estimation is paramount for successful noise reduction. Recently, time-frequency masking has been proposed to estimate the steering vectors that are used for a beamformer. In particular, we have developed a new form of this approach, which uses a speech spectral model based on a complex Gaussian mixture model CGMM to estimate the time-frequency masks needed for steering vector estimation, and extended the CGMM-based beamformer to an online speech enhancement scenario. Our previous experiments showed that the proposed CGMM-based approach outperforms a recently proposed mask estimator based on a Watson mixture model and the baseline speech enhancement system of the CHiME-3 challenge. This paper provides additional experimental results for our online processing, which achieves performance comparable to that of batch processing with a suitable block-batch size. This online version reduces the CHiME-3 word error rate WER on the evaluation set from 8.37% to 8.06%. Moreover, in this paper, we introduce a probabilistic prior distribution for a spatial correlation matrix a CGMM parameter, which enables more stable steering vector estimation in the presence of interfering speakers. In practice, the performance of the proposed online beamformer degrades with observations that contain only noise or/and interference because of the failure of the CGMM parameter estimation. The introduced spatial prior enables the target speaker's parameter to avoid overfitting to noise or/and interference. Experimental results show that the spatial prior reduces the WER from 38.4% to 29.2% in a conversation recognition task compared with the CGMM-based approach without the prior, and outperforms a conventional online speech enhancement approach.

...read moreread less

Proceedings Article•DOI•

Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation

[...]

Wei-Ning Hsu¹, Yu Zhang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Dec 2017

TL;DR: In this article, the authors address the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech.

...read moreread less

Abstract: Domain mismatch between training and testing can lead to significant degradation in performance in many machine learning scenarios. Unfortunately, this is not a rare situation for automatic speech recognition deployments in real-world applications. Research on robust speech recognition can be regarded as trying to overcome this domain mismatch issue. In this paper, we address the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech. We present novel augmentation-based methods that transform speech in a way that does not change the transcripts. Specifically, we first train a variational autoencoder on both source and target domain data (without supervision) to learn a latent representation of speech. We then transform nuisance attributes of speech that are irrelevant to recognition by modifying the latent representations, in order to augment labeled training data with additional data whose distribution is more similar to the target domain. The proposed method is evaluated on the CHiME-4 dataset and reduces the absolute word error rate (WER) by as much as 35% compared to the non-adapted baseline.

...read moreread less

Journal Article•DOI•

Towards a Continuous Biometric System Based on ECG Signals Acquired on the Steering Wheel

[...]

Joao Ribeiro Pinto¹, Jaime S. Cardoso¹, André Lourenço², Carlos Carreiras•Institutions (2)

University of Porto¹, Instituto Superior de Engenharia de Lisboa²

28 Sep 2017-Sensors

TL;DR: The enhancement of the unprecedented lesser quality of electrocardiogram signals through the combination of Savitzky-Golay and moving average filters, followed by outlier detection and removal based on normalised cross-correlation and clustering was able to render ensemble heartbeats of significantly higher quality.

...read moreread less

Abstract: Electrocardiogram signals acquired through a steering wheel could be the key to seamless, highly comfortable, and continuous human recognition in driving settings. This paper focuses on the enhancement of the unprecedented lesser quality of such signals, through the combination of Savitzky-Golay and moving average filters, followed by outlier detection and removal based on normalised cross-correlation and clustering, which was able to render ensemble heartbeats of significantly higher quality. Discrete Cosine Transform (DCT) and Haar transform features were extracted and fed to decision methods based on Support Vector Machines (SVM), k-Nearest Neighbours (kNN), Multilayer Perceptrons (MLP), and Gaussian Mixture Models - Universal Background Models (GMM-UBM) classifiers, for both identification and authentication tasks. Additional techniques of user-tuned authentication and past score weighting were also studied. The method's performance was comparable to some of the best recent state-of-the-art methods (94.9% identification rate (IDR) and 2.66% authentication equal error rate (EER)), despite lesser results with scarce train data (70.9% IDR and 11.8% EER). It was concluded that the method was suitable for biometric recognition with driving electrocardiogram signals, and could, with future developments, be used on a continuous system in seamless and highly noisy settings.

...read moreread less

Posted Content•

The CAPIO 2017 Conversational Speech Recognition System

[...]

Kyu Jeong Han, Akshay Chandrashekaran, Jungsuk Kim, Ian R. Lane

29 Dec 2017-arXiv: Computation and Language

TL;DR: This paper shows how the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set is achieved, and proposes an acoustic model adaptation scheme that simply averages the parameters of a seed neural network acoustic model and its adapted version.

...read moreread less

Abstract: In this paper we show how we have achieved the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set. We explore densely connected LSTMs, inspired by the densely connected convolutional networks recently introduced for image classification tasks. We also propose an acoustic model adaptation scheme that simply averages the parameters of a seed neural network acoustic model and its adapted version. This method was applied with the CallHome training corpus and improved individual system performances by on average 6.1% (relative) against the CallHome portion of the evaluation set with no performance loss on the Switchboard portion. With RNN-LM rescoring and lattice combination on the 5 systems trained across three different phone sets, our 2017 speech recognition system has obtained 5.0% and 9.1% on Switchboard and CallHome, respectively, both of which are the best word error rates reported thus far. According to IBM in their latest work to compare human and machine transcriptions, our reported Switchboard word error rate can be considered to surpass the human parity (5.1%) of transcribing conversational telephone speech.

...read moreread less

Journal Article•DOI•

Duration-Controlled LSTM for Polyphonic Sound Event Detection

[...]

Tomoki Hayashi¹, Shinji Watanabe², Tomoki Toda¹, Takaaki Hori², Jonathan Le Roux², Kazuya Takeda¹ - Show less +2 more•Institutions (2)

Nagoya University¹, Mitsubishi Electric Research Laboratories²

01 Nov 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper builds upon a state-of-the-art SED method that performs frame-by-frame detection using a bidirectional LSTM recurrent neural network, and incorporates a duration-controlled modeling technique based on a hidden semi-Markov model that makes it possible to model the duration of each sound event precisely and to perform sequence- by-sequence detection without having to resort to thresholding.

...read moreread less

Abstract: This paper presents a new hybrid approach called duration-controlled long short-term memory (LSTM) for polyphonic sound event detection (SED). It builds upon a state-of-the-art SED method that performs frame-by-frame detection using a bidirectional LSTM recurrent neural network (BLSTM), and incorporates a duration-controlled modeling technique based on a hidden semi-Markov model. The proposed approach makes it possible to model the duration of each sound event precisely and to perform sequence-by-sequence detection without having to resort to thresholding, as in conventional frame-by-frame methods. Furthermore, to effectively reduce sound event insertion errors, which often occur under noisy conditions, we also introduce a binary-mask-based postprocessing that relies on a sound activity detection network to identify segments with any sound event activity, an approach inspired by the well-known benefits of voice activity detection in speech recognition systems. We conduct an experiment using the DCASE2016 task 2 dataset to compare our proposed method with typical conventional methods, such as nonnegative matrix factorization and standard BLSTM. Our proposed method outperforms the conventional methods both in an event-based evaluation, achieving a 75.3% F1 score and a 44.2% error rate, and in a segment-based evaluation, achieving an 81.1% F1 score, and a 32.9% error rate, outperforming the best results reported in the DCASE2016 task 2 Challenge.

...read moreread less

Proceedings Article•DOI•

On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition

[...]

Xiong Xiao¹, Shengkui Zhao², Douglas L. Jones², Eng Siong Chng¹, Haizhou Li¹ - Show less +1 more•Institutions (2)

Nanyang Technological University¹, Agency for Science, Technology and Research²

05 Mar 2017

TL;DR: This paper focuses on the TF mask estimation using recurrent neural networks (RNN) and shows that the proposed methods improve the ASR performance individually and also work complementarily.

...read moreread less

Abstract: Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.

...read moreread less

Collapse