Direct Acoustics-to-Word Models for English Conversational Speech Recognition

doi:10.21437/INTERSPEECH.2017-546

Open AccessProceedings ArticleDOI

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

- pp 959-963

TLDR

This paper presents the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome, and presents rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone C TC models.

Abstract:

Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park, +6 more

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

...read moreread less

Proceedings ArticleDOI

Improved Training of End-to-end Attention Models for Speech Recognition

Albert Zeyer, +3 more

TL;DR: In this article, a sequence-to-sequence attention-based model on subword units was proposed to achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks.

...read moreread less

Journal ArticleDOI

Recent progresses in deep learning based acoustic models

Dong Yu, +1 more

- 10 Jul 2017 -

IEEE/CAA Journal of Automatica Sinica

TL;DR: In this paper, the authors summarize recent progress made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques, and further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation.

...read moreread less

Proceedings ArticleDOI

Speech Model Pre-training for End-to-End Spoken Language Understanding

Loren Lugosch, +4 more

TL;DR: The authors proposed a method to reduce the data requirements of end-to-end spoken language understanding (SLU) in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU.

...read moreread less

Proceedings ArticleDOI

A Comparison of Transformer and LSTM Encoder Decoder Models for ASR

Albert Zeyer, +4 more

TL;DR: Competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model are presented and it is observed that the Transformer training is in general more stable compared to the L STM, although it also seems to overfit more, and thus shows more problems with generalization.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Posted Content

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013 -

arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Sep 2014 -

arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton, +10 more

- 18 Oct 2012 -

IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Proceedings ArticleDOI

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, +3 more

TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.

...read moreread less

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Citations

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Improved Training of End-to-end Attention Models for Speech Recognition

Recent progresses in deep learning based acoustic models

Speech Model Pre-training for End-to-End Spoken Language Understanding

A Comparison of Transformer and LSTM Encoder Decoder Models for ASR

References

Glove: Global Vectors for Word Representation

Efficient Estimation of Word Representations in Vector Space

Neural Machine Translation by Jointly Learning to Align and Translate

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Related Papers (5)

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

Towards End-To-End Speech Recognition with Recurrent Neural Networks

Attention-based models for speech recognition

Speech recognition with deep recurrent neural networks