scispace - formally typeset
Proceedings ArticleDOI

Audio augmentation for speech recognition.

TLDR
This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.
Abstract
Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Proceedings ArticleDOI

A time delay neural network architecture for efficient modeling of long temporal contexts.

TL;DR: This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.
Proceedings ArticleDOI

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

TL;DR: A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.
Proceedings ArticleDOI

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

TL;DR: This paper proposed easy data augmentation techniques for boosting performance on text classification tasks, which consists of synonym replacement, random insertion, random swap, and random deletion, and showed that EDA improves performance for both convolutional and recurrent neural networks.
References
More filters
Journal ArticleDOI

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Proceedings ArticleDOI

Librispeech: An ASR corpus based on public domain audio books

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Posted Content

Deep Speech: Scaling up end-to-end speech recognition

TL;DR: Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Proceedings ArticleDOI

A time delay neural network architecture for efficient modeling of long temporal contexts.

TL;DR: This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.
Journal ArticleDOI

Deep Scattering Spectrum

TL;DR: A scattering transform defines a locally translation invariant representation which is stable to time-warping deformation and extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators.
Related Papers (5)