scispace - formally typeset
Proceedings ArticleDOI

Reverberation robust acoustic modeling using i-vectors with time delay neural networks.

TLDR
iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 10% relative improvement in word error rate, and subsampling the outputs at TDNN layers across time steps, training time is reduced.
Abstract
In reverberant environments there are long term interactions between speech and corrupting sources. In this paper a time delay neural network (TDNN) architecture, capable of learning long term temporal relationships and translation invariant representations, is used for reverberation robust acoustic modeling. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 10% relative improvement in word error rate. By subsampling the outputs at TDNN layers across time steps, training time is reduced. Using a parallel training algorithm we show that the TDNN can be trained on ∼ 5500 hours of speech data in 3 days using up to 32 GPUs. The TDNN is shown to provide results competitive with state of the art systems in the IARPA ASpIRE challenge, with 27.7% WER on the dev test set.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

An unsupervised deep domain adaptation approach for robust speech recognition

TL;DR: An unsupervised deep domain adaptation (DDA) approach to acoustic modeling is introduced in order to eliminate the training–testing mismatch that is common in real-world use of speech recognition.
Proceedings ArticleDOI

Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting.

TL;DR: This paper proposes to apply singular value decomposition (SVD) to further reduce TDNN complexity, and results show that the full-rank TDNN achieves a 19.7% DET AUC reduction compared to a similar-size deep neural network baseline.
Proceedings ArticleDOI

JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS

TL;DR: This paper tackles the problem of reverberant speech recognition using 5500 hours of simulated reverberant data using time-delay neural network (TDNN) architecture, which is capable of tackling long-term interactions between speech and corrupting sources in reverberant environments.
Proceedings ArticleDOI

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

TL;DR: DFSMN as mentioned in this paper introduces skip connections between memory blocks in adjacent layers, which enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure.
Proceedings ArticleDOI

Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning.

TL;DR: A multi-task architecture that jointly learns an accent classifier and a multi-accent acoustic model is proposed and augmenting the speech input with accent information in the form of embeddings extracted by a separate network is considered.
References
More filters
Proceedings Article

SRILM – An Extensible Language Modeling Toolkit

TL;DR: The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.
Journal ArticleDOI

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Journal ArticleDOI

Front-End Factor Analysis for Speaker Verification

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Book

Phoneme recognition using time-delay neural networks

TL;DR: The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation.
Journal ArticleDOI

Phoneme recognition using time-delay neural networks

TL;DR: In this article, the authors presented a time-delay neural network (TDNN) approach to phoneme recognition, which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input
Related Papers (5)