Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

doi:10.21437/INTERSPEECH.2016-595

Proceedings ArticleDOI

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

Daniel Povey, +7 more

- pp 2751-2755

Chats0

TLDR

A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.

Abstract:

In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding. We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LFMMI provide a relative word error rate reduction of ∼11.5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions. A further reduction of ∼2.5%, relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park, +6 more

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

...read moreread less

Proceedings ArticleDOI

ESPNet: End-to-end speech processing toolkit

Shinji Watanabe, +11 more

TL;DR: In this article, a new open source platform for end-to-end speech processing named ESPnet is introduced, which mainly focuses on automatic speech recognition (ASR), and adopts widely used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine.

...read moreread less

Proceedings ArticleDOI

A study on data augmentation of reverberant speech for robust speech recognition

Tom Ko, +4 more

TL;DR: It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.

...read moreread less

Journal ArticleDOI

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Shinji Watanabe, +4 more

- 16 Oct 2017 -

IEEE Journal of Selected Topics in Signa...

TL;DR: The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

...read moreread less

Posted Content

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Alice Coucke, +11 more

- 25 May 2018 -

arXiv: Computation and Language

TL;DR: The machine learning architecture of the Snips Voice Platform is presented, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices that is fast and accurate while enforcing privacy by design, as no personal user data is ever collected.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, +3 more

TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.

...read moreread less

Posted Content

Deep Speech: Scaling up end-to-end speech recognition

Awni Hannun, +10 more

- 17 Dec 2014 -

arXiv: Computation and Language

TL;DR: Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.

...read moreread less

Proceedings ArticleDOI

Audio augmentation for speech recognition.

Tom Ko, +3 more

TL;DR: This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.

...read moreread less

Proceedings ArticleDOI

A time delay neural network architecture for efficient modeling of long temporal contexts.

Vijayaditya Peddinti, +2 more

TL;DR: This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.

...read moreread less

Book ChapterDOI

An n log n algorithm for minimizing states in a finite automaton

John E. Hopcroft

TL;DR: An algorithm is given for minimizing the number of states in a finite automaton or for determining if two finite automata are equivalent and the running time is bounded by k n log n.

...read moreread less

Neural Computation

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

Citations

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

ESPNet: End-to-end speech processing toolkit

A study on data augmentation of reverberant speech for robust speech recognition

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

References

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Deep Speech: Scaling up end-to-end speech recognition

Audio augmentation for speech recognition.

A time delay neural network architecture for efficient modeling of long temporal contexts.

An n log n algorithm for minimizing states in a finite automaton

Related Papers (5)

The Kaldi Speech Recognition Toolkit

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Librispeech: An ASR corpus based on public domain audio books

Attention is All you Need

Long short-term memory