scispace - formally typeset
Open AccessPosted Content

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Reads0
Chats0
TLDR
A simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding, trained to output letters, without the need for force alignment of phonemes is presented.
Abstract
This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.

read more

Content maybe subject to copyright    Report

Under review as a conference paper at ICLR 2017
WAV2LETTER: AN END-TO-END CONVNET-BASED
SPEECH RECOGNITION SYSTEM
Ronan Collobert
Facebook AI Research, Menlo Park
locronan@fb.com
Christian Puhrsch
Facebook AI Research, Menlo Park
cpuhrsch@fb.com
Gabriel Synnaeve
Facebook AI Research, New York
gab@fb.com
ABSTRACT
This paper presents a simple end-to-end model for speech recognition, combining
a convolutional network based acoustic model and a graph decoding. It is trained
to output letters, with transcribed speech, without the need for force alignment of
phonemes. We introduce an automatic segmentation criterion for training from
sequence annotation without alignment that is on par with CTC (Graves et al.,
2006) while being simpler. We show competitive results in word error rate on the
Librispeech corpus (Panayotov et al., 2015) with MFCC features, and promising
results from raw waveform.
1 INTRODUCTION
We present an end-to-end system to speech recognition, going from the speech signal (e.g. Mel-
Frequency Cepstral Coefficients (MFCC), power spectrum, or raw waveform) to the transcription.
The acoustic model is trained using letters (graphemes) directly, which take out the need for an
intermediate (human or automatic) phonetic transcription. Indeed, the classical pipeline to build
state of the art systems for speech recognition consists in first training an HMM/GMM model to
force align the units on which the final acoustic model operates (most often context-dependent phone
states). This approach takes its roots in HMM/GMM training (Woodland & Young, 1993). The
improvements brought by deep neural networks (DNNs) (Mohamed et al., 2012; Hinton et al., 2012)
and convolutional neural networks (CNNs) (Sercu et al., 2015; Soltau et al., 2014) for acoustic
modeling only extend this training pipeline.
The current state of the art on Librispeech (the dataset that we used for our evaluations) uses this
approach too (Panayotov et al., 2015; Peddinti et al., 2015b), with an additional step of speaker
adaptation (Saon et al., 2013; Peddinti et al., 2015a). Recently, Senior et al. (2014) proposed GMM-
free training, but the approach still requires to generate a force alignment. An approach that cut ties
with the HMM/GMM pipeline (and with force alignment) was to train with a recurrent neural network
(RNN) (Graves et al., 2013) for phoneme transcription. There are now competitive end-to-end
approaches of acoustic models toppled with RNNs layers as in (Hannun et al., 2014; Miao et al.,
2015; Saon et al., 2015; Amodei et al., 2015), trained with a sequence criterion (Graves et al., 2006).
However these models are computationally expensive, and thus take a long time to train.
Compared to classical approaches that need phonetic annotation (often derived from a phonetic
dictionary, rules, and generative training), we propose to train the model end-to-end, using graphemes
directly. Compared to sequence criterion based approaches that train directly from speech signal to
graphemes (Miao et al., 2015), we propose a simple(r) architecture (23 millions of parameters for our
best model, vs. 100 millions of parameters in (Amodei et al., 2015)) based on convolutional networks
1

Under review as a conference paper at ICLR 2017
for the acoustic model, toppled with a graph transformer network (Bottou et al., 1997), trained with
a simpler sequence criterion. Our word-error-rate on clean speech is slightly better than (Hannun
et al., 2014), and slightly worse than (Amodei et al., 2015), in particular factoring that they train on
12,000 hours while we only train on the 960h available in LibriSpeech’s train set. Finally, some of
our models are also trained on the raw waveform, as in (Palaz et al., 2013; 2015; Sainath et al., 2015).
The rest of the paper is structured as follows: the next section presents the convolutional networks
used for acoustic modeling, along with the automatic segmentation criterion. The following section
shows experimental results comparing different features, the criterion, and our current best word error
rates on LibriSpeech.
2 ARCHITECTURE
Our speech recognition system is a standard convolutional neural network (LeCun & Bengio, 1995)
fed with various different features, trained through an alternative to the Connectionist Temporal
Classification (CTC) (Graves et al., 2006), and coupled with a simple beam search decoder. In the
following sub-sections, we detail each of these components.
2.1 FEATURES
CONV
kw = 1
2000 : 40
CONV
kw = 1
2000 : 2000
CONV
kw = 32
250 : 2000
CONV
kw = 7
250 : 250
CONV
kw = 7
250 : 250
CONV
kw = 7
250 : 250
CONV
kw = 7
250 : 250
CONV
kw = 7
250 : 250
CONV
kw = 7
250 : 250
CONV
kw = 7
250 : 250
CONV
kw = 48, dw = 2
250 : 250
CONV
kw = 250, dw = 160
1 : 250
Figure 1: Our neural network
architecture for raw wave. First
two layers are convolutions with
strides. Last two layers are
convolutions with
kw = 1
,
which are equivalent to fully
connected layers. Power spec-
trum and MFCC based networks
do not have the first layer.
We consider three types of input features for our model: MFCCs,
power-spectrum, and raw wave. MFCCs are carefully designed
speech-specific features, often found in classical HMM/GMM
speech systems (Woodland & Young, 1993) because of their di-
mensionality compression (13 coefficients are often enough to span
speech frequencies). Power-spectrum features are found in most
recent deep learning acoustic modeling features (Amodei et al.,
2015). Raw wave has been somewhat explored in few recent work
(Palaz et al., 2013; 2015). ConvNets have the advantage to be
flexible enough to be used with either of these input feature types.
Our acoustic models output letter scores (one score per letter, given
a dictionary L).
2.2 CONVNET ACOUSTIC MODEL
The acoustic models we considered in this paper are all based on
standard 1D convolutional neural networks (ConvNets). ConvNets
interleave convolution operations with pointwise non-linearity op-
erations. Often ConvNets also embark pooling layers: these type of
layers allow the network to “see” a larger context, without increas-
ing the number of parameters, by locally aggregating the previous
convolution operation output. Instead, our networks leverage strid-
ing convolutions. Given
(x
t
)
t=1...T
x
an input sequence with
T
x
frames of
d
x
dimensional vectors, a convolution with kernel width
kw, stride dw and d
y
frame size output computes the following:
y
i
t
= b
i
+
d
x
X
j=1
kw
X
k=1
w
i,j,k
x
j
dw×(t1)+k
1 i d
y
, (1)
where
b R
d
y
and
w R
d
y
×d
x
×kw
are the parameters of the
convolution (to be learned).
Pointwise non-linear layers are added after convolutional layers.
In our experience, we surprisingly found that using hyperbolic
tangents, their piecewise linear counterpart HardTanh (as in (Palaz
et al., 2015)) or ReLU units lead to similar results.
There are some slight variations between the architectures, depend-
ing on the input features. MFCC-based networks need less striding,
as standard MFCC filters are applied with large strides on the input
2

Under review as a conference paper at ICLR 2017
raw sequence. With power spectrum-based and raw wave-based networks, we observed that the
overall stride of the network was more important than where the convolution with strides were placed.
We found thus preferrable to set the strided convolutions near the first input layers of the network, as
it leads to the fastest architectures: with power spectrum features or raw wave, the input sequences
are very long and the first convolutions are thus the most expensive ones.
The last layer of our convolutional network outputs one score per letter in the letter dictionary
(
d
y
= |L|
). Our architecture for raw wave is shown in Figure 1 and is inspired by (Palaz et al., 2015).
The architectures for both power spectrum and MFCC features do not include the first layer. The
full network can be seen as a non-linear convolution, with a kernel width of size
31280
and stride
equal to 320; given the sample rate of our data is 16KHz, label scores are produced using a window
of 1955 ms, with steps of 20ms.
2.3 INFERRING SEGMENTATION WITH AUTOSEGCRITERION
Most large labeled speech databases provide only a text transcription for each audio file. In a
classification framework (and given our acoustic model produces letter predictions), one would
need the segmentation of each letter in the transcription to train properly the model. Unfortunately,
manually labeling the segmentation of each letter would be tedious. Several solutions have been
explored in the speech community to alleviate this issue: HMM/GMM models use an iterative EM
procedure: (i) during the Estimation step, the best segmentation is inferred, according to the current
model, by maximizing the joint probability of the letter (or any sub-word unit) transcription and input
sequence. (ii) During the Maximization step the model is optimized by minimizing a frame-level
criterion, based on the (now fixed) inferred segmentation. This approach is also often used to boostrap
the training of neural network-based acoustic models.
Other alternatives have been explored in the context of hybrid HMM/NN systems, such as the MMI
criterion (Bahl et al., 1986) which maximizes the mutual information between the acoustic sequence
and word sequences or the Minimum Bayse Risk (MBR) criterion (Gibson & Hain, 2006).
More recently, standalone neural network architectures have been trained using criterions which
jointly infer the segmentation of the transcription while increase the overall score of the right tran-
scription (Graves et al., 2006; Palaz et al., 2014). The most popular one is certainly the Connectionist
Temporal Classification (CTC) criterion, which is at the core of Baidu’s Deep Speech architec-
ture (Amodei et al., 2015). CTC assumes that the network output probability scores, normalized
at the frame level. It considers all possible sequence of letters (or any sub-word units), which can
lead to a to a given transcription. CTC also allow a special “blank” state to be optionally inserted
between each letters. The rational behind the blank state is two-folds: (i) modeling “garbage” frames
which might occur between each letter and (ii) identifying the separation between two identical
consecutive letters in a transcription. Figure 2a shows an example of the sequences accepted by CTC
for a given transcription. In practice, this graph is unfolded as shown in Figure 2b, over the available
frames output by the acoustic model. We denote
G
ctc
(θ, T )
an unfolded graph over
T
frames for a
given transcription
θ
, and
π = π
1
, . . . , π
T
G
ctc
(θ, T )
a path in this graph representing a (valid)
sequence of letters for this transcription. At each time step
t
, each node of the graph is assigned
with the corresponding log-probability letter (that we denote
f
t
(·)
) output by the acoustic model.
CTC aims at maximizing the “overall” score of paths in
G
ctc
(θ, T )
; for that purpose, it minimizes the
Forward score:
CT C(θ, T ) = logadd
π∈G
ctc
(θ,T )
T
X
t=1
f
π
t
(x) , (2)
where the “logadd” operation, also often called “log-sum-exp” is defined as
logadd(a, b) =
exp(log(a) + log(b))
. This overall score can be efficiently computed with the Forward algorithm. To
put things in perspective, if one would replace the
logadd(·)
by a
max(·)
in (2) (which can be then
efficiently computed by the Viterbi algorithm, the counterpart of the Forward algorithm), one would
then maximize the score of the best path, according to the model belief. The
logadd(·)
can be seen
as a smooth version of the
max(·)
: paths with similar scores will be attributed the same weight in the
overall score (and hence receive the same gradient), and paths with much larger score will have much
more overall weight than paths with low scores. In practice, using the
logadd(·)
works much better
than the
max(·)
. It is also worth noting that maximizing (2) does not diverge, as the acoustic model
is assumed to output normalized scores (log-probabilities) f
i
(·).
3

Under review as a conference paper at ICLR 2017
C
A
T
(a)
C C C C
A A A A
T T T T
(b)
Figure 2: The CTC criterion graph. (a) Graph which represents all the acceptable sequences of letters
(with the blank state denoted
”), for the transcription “cat”. (b) Shows the same graph unfolded
over 5 frames. There are no transitions scores. At each time step, nodes are assigned a conditional
probability output by the neural network acoustic model.
In this paper, we explore an alternative to CTC, with three differences: (i) there are no blank labels,
(ii) un-normalized scores on the nodes (and possibly un-normalized transition scores on the edges)
(iii) global normalization instead of per-frame normalization:
The advantage of (i) is that it produces a much simpler graph (see Figure 3a and Figure 3b).
We found that in practice there was no advantage of having a blank class to model the
possible “garbage” frames between letters. Modeling letter repetitions (which is also an
important quality of the blank label in CTC) can be easily replaced by repetition character
labels (we used two extra labels for two and three repetitions). For example “caterpillar”
could be written as “caterpil2ar”, where “2” is a label to represent the repetition of the
previous letter. Not having blank labels also simplifies the decoder.
With (ii) one can easily plug an external language model, which would insert transition
scores on the edges of the graph. This could be particularly useful in future work, if one
wanted to model representations more high-level than letters. In that respect, avoiding
normalized transitions is important to alleviate the problem of “label bias” Bottou (1991);
Lafferty et al. (2001). In this work, we limited ourselves to transition scalars, which are
learned together with the acoustic model.
The normalization evoked in (iii) is necessary when using un-normalized scores on nodes or
edges; it insures incorrect transcriptions will have a low confidence.
In the following, we name our criterion Auto Segmentation Criterion” (ASG). Considering the
same notations than for CTC in (2), and an unfolded graph
G
asg
(θ, T )
over
T
frames for a given
transcription
θ
(as in Figure 3b), as well as a fully connected graph
G
full
(θ, T )
over
T
frames
(representing all possible sequence of letters, as in Figure 3c), ASG aims at minimizing:
ASG(θ, T ) = logadd
π∈G
asg
(θ,T )
T
X
t=1
(f
π
t
(x) + g
π
t1
t
(x)) + logadd
π∈G
f ull
(θ,T )
T
X
t=1
(f
π
t
(x) + g
π
t1
t
(x)) ,
(3)
where
g
i,j
(·)
is a transition score model to jump from label
i
to label
j
. The left-hand part of
3
promotes sequences of letters leading to the right transcription, and the right-hand part demotes all
sequences of letters. As for CTC, these two parts can be efficiently computed with the Forward
algorithm. Derivatives with respect to
f
i
(·)
and
g
i,j
(·)
can be obtained (maths are a bit tedious) by
applying the chain rule through the Forward recursion.
2.4 BEAM-SEARCH DECODER
We wrote our own one-pass decoder, which performs a simple beam-search with beam threholding,
histogram pruning and language model smearing Steinbiss et al. (1994). We kept the decoder as
4

Under review as a conference paper at ICLR 2017
C A T
(a)
C C C C
A A A A
T T T T
(b)
A
B
...
Z
A
B
...
Z
A
B
...
Z
A
B
...
Z
A
B
...
Z
A
B
...
Z
(c)
Figure 3: The ASG criterion graph. (a) Graph which represents all the acceptable sequences of
letters for the transcription “cat”. (b) Shows the same graph unfolded over 5 frames. (c) Shows the
corresponding fully connected graph, which describe all possible sequences of letter; this graph is
used for normalization purposes. Un-normalized transitions scores are possible on the edges. At
each time step, nodes are assigned a conditional un-normalized score, output by the neural network
acoustic model.
simple as possible (under 1000 lines of C code). We did not implement any sort of model adaptation
before decoding, nor any word graph rescoring. Our decoder relies on KenLM Heafield et al. (2013)
for the language modeling part. It also accepts un-normalized acoustic scores (transitions and
emissions from the acoustic model) as input. The decoder attempts to maximize the following:
L(θ) = logadd
π∈G
asg
(θ,T )
T
X
t=1
(f
π
t
(x) + g
π
t1
t
(x)) + α log P
lm
(θ) + β|θ| , (4)
where
P
lm
(θ)
is the probability of the language model given a transcription
θ
,
α
and
β
are two
hyper-parameters which control the weight of the language model and the word insertion penalty
respectively.
3 EXPERIMENTS
3.1 SETUP
We implemented everything using Torch7
1
. The ASG criterion as well as the decoder were imple-
mented in C (and then interfaced into Torch).
We consider as benchmark LibriSpeech, a large speech database freely available for download (Panay-
otov et al., 2015). LibriSpeech comes with its own train, validation and test sets. Except when
specified, we used all the available data (about 1000h of audio files) for training and validating our
models. We use the original 16 KHz sampling rate. The vocabulary
L
contains 30 graphemes: the
standard English alphabet plus the apostrophe, silence, and two special “repetition” graphemes which
encode the duplication (once or twice) of the previous letter (see Section 2.3).
The architecture hyper-parameters, as well the decoder ones were tuned using the validation set. In
the following, we either report letter-error-rates (LERs) or word-error-rates (WERs). WERs have
been obtained by using our own decoder (see Section 2.4), with the standard 4-gram language model
provided with LibriSpeech
2
.
1
http://www.torch.ch.
2
http://www.openslr.org/11.
5

Citations
More filters
Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Journal ArticleDOI

Multimodal Machine Learning: A Survey and Taxonomy

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Posted Content

wav2vec: Unsupervised Pre-training for Speech Recognition

TL;DR: Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Proceedings ArticleDOI

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training

TL;DR: It is demonstrated that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints and back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data is introduced.
Journal ArticleDOI

Deep Audio-visual Speech Recognition

TL;DR: This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.
References
More filters
Proceedings Article

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Proceedings ArticleDOI

Speech recognition with deep recurrent neural networks

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Posted Content

Speech Recognition with Deep Recurrent Neural Networks

TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Related Papers (5)