Audio augmentation for speech recognition.

doi:10.21437/INTERSPEECH.2015-711

Proceedings ArticleDOI

Audio augmentation for speech recognition.

- pp 3586-3589

TLDR

This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.

Abstract:

Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park, +6 more

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

...read moreread less

Proceedings ArticleDOI

A time delay neural network architecture for efficient modeling of long temporal contexts.

Vijayaditya Peddinti, +2 more

TL;DR: This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.

...read moreread less

Proceedings ArticleDOI

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

Daniel Povey, +7 more

TL;DR: A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.

...read moreread less

Proceedings ArticleDOI

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Jason Wei, +1 more

TL;DR: This paper proposed easy data augmentation techniques for boosting performance on text classification tasks, which consists of synonym replacement, random insertion, random swap, and random deletion, and showed that EDA improves performance for both convolutional and recurrent neural networks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

S. Davis, +1 more

- 01 Aug 1980 -

IEEE Transactions on Acoustics, Speech, ...

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.

...read moreread less

Proceedings ArticleDOI

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, +3 more

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

Posted Content

Deep Speech: Scaling up end-to-end speech recognition

Awni Hannun, +10 more

- 17 Dec 2014 -

arXiv: Computation and Language

TL;DR: Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.

...read moreread less

Proceedings ArticleDOI

A time delay neural network architecture for efficient modeling of long temporal contexts.

Vijayaditya Peddinti, +2 more

TL;DR: This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.

...read moreread less

Journal ArticleDOI

Deep Scattering Spectrum

Joakim Andén, +1 more

- 29 May 2014 -

IEEE Transactions on Signal Processing

TL;DR: A scattering transform defines a locally translation invariant representation which is stable to time-warping deformation and extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators.

...read moreread less

Audio augmentation for speech recognition.

Citations

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

End to end speech recognition in English and Mandarin

A time delay neural network architecture for efficient modeling of long temporal contexts.

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

References

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

Librispeech: An ASR corpus based on public domain audio books

Deep Speech: Scaling up end-to-end speech recognition

A time delay neural network architecture for efficient modeling of long temporal contexts.

Deep Scattering Spectrum

Related Papers (5)

The Kaldi Speech Recognition Toolkit

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Librispeech: An ASR corpus based on public domain audio books

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition