Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Under review as a conference paper at ICLR 2017

WAV2LETTER: AN END-TO-END CONVNET-BASED

SPEECH RECOGNITION SYSTEM

Ronan Collobert

Facebook AI Research, Menlo Park

locronan@fb.com

Christian Puhrsch

Facebook AI Research, Menlo Park

cpuhrsch@fb.com

Gabriel Synnaeve

Facebook AI Research, New York

gab@fb.com

ABSTRACT

This paper presents a simple end-to-end model for speech recognition, combining

a convolutional network based acoustic model and a graph decoding. It is trained

to output letters, with transcribed speech, without the need for force alignment of

phonemes. We introduce an automatic segmentation criterion for training from

sequence annotation without alignment that is on par with CTC (Graves et al.,

2006) while being simpler. We show competitive results in word error rate on the

Librispeech corpus (Panayotov et al., 2015) with MFCC features, and promising

results from raw waveform.

1 INTRODUCTION

We present an end-to-end system to speech recognition, going from the speech signal (e.g. Mel-

Frequency Cepstral Coefﬁcients (MFCC), power spectrum, or raw waveform) to the transcription.

The acoustic model is trained using letters (graphemes) directly, which take out the need for an

intermediate (human or automatic) phonetic transcription. Indeed, the classical pipeline to build

state of the art systems for speech recognition consists in ﬁrst training an HMM/GMM model to

force align the units on which the ﬁnal acoustic model operates (most often context-dependent phone

states). This approach takes its roots in HMM/GMM training (Woodland & Young, 1993). The

improvements brought by deep neural networks (DNNs) (Mohamed et al., 2012; Hinton et al., 2012)

and convolutional neural networks (CNNs) (Sercu et al., 2015; Soltau et al., 2014) for acoustic

modeling only extend this training pipeline.

The current state of the art on Librispeech (the dataset that we used for our evaluations) uses this

approach too (Panayotov et al., 2015; Peddinti et al., 2015b), with an additional step of speaker

adaptation (Saon et al., 2013; Peddinti et al., 2015a). Recently, Senior et al. (2014) proposed GMM-

free training, but the approach still requires to generate a force alignment. An approach that cut ties

with the HMM/GMM pipeline (and with force alignment) was to train with a recurrent neural network

(RNN) (Graves et al., 2013) for phoneme transcription. There are now competitive end-to-end

approaches of acoustic models toppled with RNNs layers as in (Hannun et al., 2014; Miao et al.,

2015; Saon et al., 2015; Amodei et al., 2015), trained with a sequence criterion (Graves et al., 2006).

However these models are computationally expensive, and thus take a long time to train.

Compared to classical approaches that need phonetic annotation (often derived from a phonetic

dictionary, rules, and generative training), we propose to train the model end-to-end, using graphemes

directly. Compared to sequence criterion based approaches that train directly from speech signal to

graphemes (Miao et al., 2015), we propose a simple(r) architecture (23 millions of parameters for our

best model, vs. 100 millions of parameters in (Amodei et al., 2015)) based on convolutional networks

1

Under review as a conference paper at ICLR 2017

for the acoustic model, toppled with a graph transformer network (Bottou et al., 1997), trained with

a simpler sequence criterion. Our word-error-rate on clean speech is slightly better than (Hannun

et al., 2014), and slightly worse than (Amodei et al., 2015), in particular factoring that they train on

12,000 hours while we only train on the 960h available in LibriSpeech’s train set. Finally, some of

our models are also trained on the raw waveform, as in (Palaz et al., 2013; 2015; Sainath et al., 2015).

The rest of the paper is structured as follows: the next section presents the convolutional networks

used for acoustic modeling, along with the automatic segmentation criterion. The following section

shows experimental results comparing different features, the criterion, and our current best word error

rates on LibriSpeech.

2 ARCHITECTURE

Our speech recognition system is a standard convolutional neural network (LeCun & Bengio, 1995)

fed with various different features, trained through an alternative to the Connectionist Temporal

Classiﬁcation (CTC) (Graves et al., 2006), and coupled with a simple beam search decoder. In the

following sub-sections, we detail each of these components.

2.1 FEATURES

CONV

kw = 1

2000 : 40

CONV

kw = 1

2000 : 2000

CONV

kw = 32

250 : 2000

CONV

kw = 7

250 : 250

CONV

kw = 7

250 : 250

CONV

kw = 7

250 : 250

CONV

kw = 7

250 : 250

CONV

kw = 7

250 : 250

CONV

kw = 7

250 : 250

CONV

kw = 7

250 : 250

CONV

kw = 48, dw = 2

250 : 250

CONV

kw = 250, dw = 160

1 : 250

Figure 1: Our neural network

architecture for raw wave. First

two layers are convolutions with

strides. Last two layers are

convolutions with

kw = 1

,

which are equivalent to fully

connected layers. Power spec-

trum and MFCC based networks

do not have the ﬁrst layer.

We consider three types of input features for our model: MFCCs,

power-spectrum, and raw wave. MFCCs are carefully designed

speech-speciﬁc features, often found in classical HMM/GMM

speech systems (Woodland & Young, 1993) because of their di-

mensionality compression (13 coefﬁcients are often enough to span

speech frequencies). Power-spectrum features are found in most

recent deep learning acoustic modeling features (Amodei et al.,

2015). Raw wave has been somewhat explored in few recent work

(Palaz et al., 2013; 2015). ConvNets have the advantage to be

ﬂexible enough to be used with either of these input feature types.

Our acoustic models output letter scores (one score per letter, given

a dictionary L).

2.2 CONVNET ACOUSTIC MODEL

The acoustic models we considered in this paper are all based on

standard 1D convolutional neural networks (ConvNets). ConvNets

interleave convolution operations with pointwise non-linearity op-

erations. Often ConvNets also embark pooling layers: these type of

layers allow the network to “see” a larger context, without increas-

ing the number of parameters, by locally aggregating the previous

convolution operation output. Instead, our networks leverage strid-

ing convolutions. Given

(x

t

)

t=1...T

x

an input sequence with

T

x

frames of

d

x

dimensional vectors, a convolution with kernel width

kw, stride dw and d

y

frame size output computes the following:

y

i

t

= b

i

+

d

x

X

j=1

kw

X

k=1

w

i,j,k

x

j

dw×(t−1)+k

∀1 ≤ i ≤ d

y

, (1)

where

b ∈ R

d

y

and

w ∈ R

d

y

×d

x

×kw

are the parameters of the

convolution (to be learned).

Pointwise non-linear layers are added after convolutional layers.

In our experience, we surprisingly found that using hyperbolic

tangents, their piecewise linear counterpart HardTanh (as in (Palaz

et al., 2015)) or ReLU units lead to similar results.

There are some slight variations between the architectures, depend-

ing on the input features. MFCC-based networks need less striding,

as standard MFCC ﬁlters are applied with large strides on the input

2

Under review as a conference paper at ICLR 2017

raw sequence. With power spectrum-based and raw wave-based networks, we observed that the

overall stride of the network was more important than where the convolution with strides were placed.

We found thus preferrable to set the strided convolutions near the ﬁrst input layers of the network, as

it leads to the fastest architectures: with power spectrum features or raw wave, the input sequences

are very long and the ﬁrst convolutions are thus the most expensive ones.

The last layer of our convolutional network outputs one score per letter in the letter dictionary

(

d

y

= |L|

). Our architecture for raw wave is shown in Figure 1 and is inspired by (Palaz et al., 2015).

The architectures for both power spectrum and MFCC features do not include the ﬁrst layer. The

full network can be seen as a non-linear convolution, with a kernel width of size

31280

and stride

equal to 320; given the sample rate of our data is 16KHz, label scores are produced using a window

of 1955 ms, with steps of 20ms.

2.3 INFERRING SEGMENTATION WITH AUTOSEGCRITERION

Most large labeled speech databases provide only a text transcription for each audio ﬁle. In a

classiﬁcation framework (and given our acoustic model produces letter predictions), one would

need the segmentation of each letter in the transcription to train properly the model. Unfortunately,

manually labeling the segmentation of each letter would be tedious. Several solutions have been

explored in the speech community to alleviate this issue: HMM/GMM models use an iterative EM

procedure: (i) during the Estimation step, the best segmentation is inferred, according to the current

model, by maximizing the joint probability of the letter (or any sub-word unit) transcription and input

sequence. (ii) During the Maximization step the model is optimized by minimizing a frame-level

criterion, based on the (now ﬁxed) inferred segmentation. This approach is also often used to boostrap

the training of neural network-based acoustic models.

Other alternatives have been explored in the context of hybrid HMM/NN systems, such as the MMI

criterion (Bahl et al., 1986) which maximizes the mutual information between the acoustic sequence

and word sequences or the Minimum Bayse Risk (MBR) criterion (Gibson & Hain, 2006).

More recently, standalone neural network architectures have been trained using criterions which

jointly infer the segmentation of the transcription while increase the overall score of the right tran-

scription (Graves et al., 2006; Palaz et al., 2014). The most popular one is certainly the Connectionist

Temporal Classiﬁcation (CTC) criterion, which is at the core of Baidu’s Deep Speech architec-

ture (Amodei et al., 2015). CTC assumes that the network output probability scores, normalized

at the frame level. It considers all possible sequence of letters (or any sub-word units), which can

lead to a to a given transcription. CTC also allow a special “blank” state to be optionally inserted

between each letters. The rational behind the blank state is two-folds: (i) modeling “garbage” frames

which might occur between each letter and (ii) identifying the separation between two identical

consecutive letters in a transcription. Figure 2a shows an example of the sequences accepted by CTC

for a given transcription. In practice, this graph is unfolded as shown in Figure 2b, over the available

frames output by the acoustic model. We denote

G

ctc

(θ, T )

an unfolded graph over

T

frames for a

given transcription

θ

, and

π = π

1

, . . . , π

T

∈ G

ctc

(θ, T )

a path in this graph representing a (valid)

sequence of letters for this transcription. At each time step

t

, each node of the graph is assigned

with the corresponding log-probability letter (that we denote

f

t

(·)

) output by the acoustic model.

CTC aims at maximizing the “overall” score of paths in

G

ctc

(θ, T )

; for that purpose, it minimizes the

Forward score:

CT C(θ, T ) = − logadd

π∈G

ctc

(θ,T )

T

X

t=1

f

π

t

(x) , (2)

where the “logadd” operation, also often called “log-sum-exp” is deﬁned as

logadd(a, b) =

exp(log(a) + log(b))

. This overall score can be efﬁciently computed with the Forward algorithm. To

put things in perspective, if one would replace the

logadd(·)

by a

max(·)

in (2) (which can be then

efﬁciently computed by the Viterbi algorithm, the counterpart of the Forward algorithm), one would

then maximize the score of the best path, according to the model belief. The

logadd(·)

can be seen

as a smooth version of the

max(·)

: paths with similar scores will be attributed the same weight in the

overall score (and hence receive the same gradient), and paths with much larger score will have much

more overall weight than paths with low scores. In practice, using the

logadd(·)

works much better

than the

max(·)

. It is also worth noting that maximizing (2) does not diverge, as the acoustic model

is assumed to output normalized scores (log-probabilities) f

i

(·).

3

Under review as a conference paper at ICLR 2017

∅

C

∅

A

∅

T

∅

(a)

∅ ∅ ∅

C C C C

∅ ∅ ∅

A A A A

∅ ∅ ∅

T T T T

∅ ∅ ∅

(b)

Figure 2: The CTC criterion graph. (a) Graph which represents all the acceptable sequences of letters

(with the blank state denoted “

∅

”), for the transcription “cat”. (b) Shows the same graph unfolded

over 5 frames. There are no transitions scores. At each time step, nodes are assigned a conditional

probability output by the neural network acoustic model.

In this paper, we explore an alternative to CTC, with three differences: (i) there are no blank labels,

(ii) un-normalized scores on the nodes (and possibly un-normalized transition scores on the edges)

(iii) global normalization instead of per-frame normalization:

•

The advantage of (i) is that it produces a much simpler graph (see Figure 3a and Figure 3b).

We found that in practice there was no advantage of having a blank class to model the

possible “garbage” frames between letters. Modeling letter repetitions (which is also an

important quality of the blank label in CTC) can be easily replaced by repetition character

labels (we used two extra labels for two and three repetitions). For example “caterpillar”

could be written as “caterpil2ar”, where “2” is a label to represent the repetition of the

previous letter. Not having blank labels also simpliﬁes the decoder.

•

With (ii) one can easily plug an external language model, which would insert transition

scores on the edges of the graph. This could be particularly useful in future work, if one

wanted to model representations more high-level than letters. In that respect, avoiding

normalized transitions is important to alleviate the problem of “label bias” Bottou (1991);

Lafferty et al. (2001). In this work, we limited ourselves to transition scalars, which are

learned together with the acoustic model.

•

The normalization evoked in (iii) is necessary when using un-normalized scores on nodes or

edges; it insures incorrect transcriptions will have a low conﬁdence.

In the following, we name our criterion “Auto Segmentation Criterion” (ASG). Considering the

same notations than for CTC in (2), and an unfolded graph

G

asg

(θ, T )

over

T

frames for a given

transcription

θ

(as in Figure 3b), as well as a fully connected graph

G

full

(θ, T )

over

T

frames

(representing all possible sequence of letters, as in Figure 3c), ASG aims at minimizing:

ASG(θ, T ) = − logadd

π∈G

asg

(θ,T )

T

X

t=1

(f

π

t

(x) + g

π

t−1

,π

t

(x)) + logadd

π∈G

f ull

(θ,T )

T

X

t=1

(f

π

t

(x) + g

π

t−1

,π

t

(x)) ,

(3)

where

g

i,j

(·)

is a transition score model to jump from label

i

to label

j

. The left-hand part of

3

promotes sequences of letters leading to the right transcription, and the right-hand part demotes all

sequences of letters. As for CTC, these two parts can be efﬁciently computed with the Forward

algorithm. Derivatives with respect to

f

i

(·)

and

g

i,j

(·)

can be obtained (maths are a bit tedious) by

applying the chain rule through the Forward recursion.

2.4 BEAM-SEARCH DECODER

We wrote our own one-pass decoder, which performs a simple beam-search with beam threholding,

histogram pruning and language model smearing Steinbiss et al. (1994). We kept the decoder as

4

Under review as a conference paper at ICLR 2017

C A T

(a)

C C C C

A A A A

T T T T

(b)

A

B

...

Z

A

B

...

Z

A

B

...

Z

A

B

...

Z

A

B

...

Z

A

B

...

Z

(c)

Figure 3: The ASG criterion graph. (a) Graph which represents all the acceptable sequences of

letters for the transcription “cat”. (b) Shows the same graph unfolded over 5 frames. (c) Shows the

corresponding fully connected graph, which describe all possible sequences of letter; this graph is

used for normalization purposes. Un-normalized transitions scores are possible on the edges. At

each time step, nodes are assigned a conditional un-normalized score, output by the neural network

acoustic model.

simple as possible (under 1000 lines of C code). We did not implement any sort of model adaptation

before decoding, nor any word graph rescoring. Our decoder relies on KenLM Heaﬁeld et al. (2013)

for the language modeling part. It also accepts un-normalized acoustic scores (transitions and

emissions from the acoustic model) as input. The decoder attempts to maximize the following:

L(θ) = logadd

π∈G

asg

(θ,T )

T

X

t=1

(f

π

t

(x) + g

π

t−1

,π

t

(x)) + α log P

lm

(θ) + β|θ| , (4)

where

P

lm

(θ)

is the probability of the language model given a transcription

θ

,

α

and

β

are two

hyper-parameters which control the weight of the language model and the word insertion penalty

respectively.

3 EXPERIMENTS

3.1 SETUP

We implemented everything using Torch7

1

. The ASG criterion as well as the decoder were imple-

mented in C (and then interfaced into Torch).

We consider as benchmark LibriSpeech, a large speech database freely available for download (Panay-

otov et al., 2015). LibriSpeech comes with its own train, validation and test sets. Except when

speciﬁed, we used all the available data (about 1000h of audio ﬁles) for training and validating our

models. We use the original 16 KHz sampling rate. The vocabulary

L

contains 30 graphemes: the

standard English alphabet plus the apostrophe, silence, and two special “repetition” graphemes which

encode the duplication (once or twice) of the previous letter (see Section 2.3).

The architecture hyper-parameters, as well the decoder ones were tuned using the validation set. In

the following, we either report letter-error-rates (LERs) or word-error-rates (WERs). WERs have

been obtained by using our own decoder (see Section 2.4), with the standard 4-gram language model

provided with LibriSpeech

2

.

1

http://www.torch.ch.

2

http://www.openslr.org/11.

5

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Citations

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Multimodal Machine Learning: A Survey and Taxonomy

wav2vec: Unsupervised Pre-training for Speech Recognition

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training

Deep Audio-visual Speech Recognition

References

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Probabilistic Models for Segmenting and Labeling Sequence Data

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Speech recognition with deep recurrent neural networks

Speech Recognition with Deep Recurrent Neural Networks

Related Papers (5)

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Librispeech: An ASR corpus based on public domain audio books

The Kaldi Speech Recognition Toolkit

Attention is All you Need

Neural Machine Translation by Jointly Learning to Align and Translate