scispace - formally typeset
Open AccessProceedings ArticleDOI

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

TLDR
This paper presents the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome, and presents rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone C TC models.
Abstract
Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

read more

Citations
More filters
Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Proceedings ArticleDOI

Improved Training of End-to-end Attention Models for Speech Recognition

TL;DR: In this article, a sequence-to-sequence attention-based model on subword units was proposed to achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks.
Journal ArticleDOI

Recent progresses in deep learning based acoustic models

TL;DR: In this paper, the authors summarize recent progress made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques, and further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation.
Proceedings ArticleDOI

Speech Model Pre-training for End-to-End Spoken Language Understanding

TL;DR: The authors proposed a method to reduce the data requirements of end-to-end spoken language understanding (SLU) in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU.
Proceedings ArticleDOI

A Comparison of Transformer and LSTM Encoder Decoder Models for ASR

TL;DR: Competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model are presented and it is observed that the Transformer training is in general more stable compared to the L STM, although it also seems to overfit more, and thus shows more problems with generalization.
References
More filters
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Posted Content

Efficient Estimation of Word Representations in Vector Space

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Proceedings ArticleDOI

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Related Papers (5)