scispace - formally typeset
Open AccessProceedings ArticleDOI

Improving connected letter recognition by lipreading

TLDR
The authors show how recognition performance in automated speech perception can be significantly improved by additional lipreading, so called speech-reading, on an extension of a state-of-the-art speech recognition system, a modular multistage time delay neural network architecture (MS-TDNN).
Abstract
The authors show how recognition performance in automated speech perception can be significantly improved by additional lipreading, so called speech-reading. They show this on an extension of a state-of-the-art speech recognition system, a modular multistage time delay neural network architecture (MS-TDNN). The acoustic and visual speech data are preclassified in two separate front-end phoneme TDNNs and combined with acoustic-visual hypotheses for the dynamic time warping algorithm. This is shown on a connected word recognition problem, the notoriously difficult letter spelling task. With speech-reading, the error rate could be reduced by up to half of the error rate of the pure acoustic recognition. >

read more

Content maybe subject to copyright    Report

IMPROVING CONNECTED LETTER RECOGNITION
BY
LIPREADING
Christoph Bregler*, Hermunn Hild, Stefan Manke, and Alex Waibel
University
of
Karlsruhe
Department
of
Computer
Science
Am
Fasanengarten
5
7500
Karlsruhe
1
Germany
bregler
@
irauka.de, manke
@
irauka.de
ABSTRACT
In this paper we show how recognition performance in
automated speech perception can be significantly
improved by additional Lipreading,
so
called "Speech-
reading". We show this on an extension of an existing
state-of-the-art speech recognition system,
a
modular MS-
TDNN.
The
acoustic and visual speech data is preclassi-
fied in two separate front-end phoneme
TDNNs
and com-
bined
to
acoustic-visual hypotheses for the Dynamic Time
Warping algorithm. This
is
shown on
a
connected word
recognition problem,
the
notoriously difficult letter spell-
ing task. With speechreading we could reduce the error
rate
up
to
half of the error rate of the pure acoustic recog-
ni ti on.
1.
INTRODUCTION
Automated speech perception systems still perform
poorly, when it comes to real world applications. Most
approaches
are
very sensitive to background noise or fail
totally when more than one speaker
talks
simultaneously
(cocktail party effect),
as
it often happens in offices, cock-
pits,
outdoors and other real world environments.
Humans
deal with this distortions in considering addi-
tional sources.
Very
often misclassified acoustic signals
can
be corrected with the
use
of higher level context infor-
mation. In recognition systems this is partly covered by
language models or grammars. Psychological studies have
shown
[31,
that
on
the lower level additional information
contributes
to
human hearing
as
well. Besides the acoustic
signal from both
ears,
visual information, mostly lipmove-
ments, are subconsciously involved in the recognition pro-
cess. This source is even more important for hearing
impaired people, but also contributes significantly for nor-
mal
hearing recognition.
We
investigate this phenomena on the letter spelling
%
'l'he author
1s
now with International Computer Science
Institute,
1947
Center Street, Berkeley, CA
94704
Carnegie Mellon University
School
of
Computer
Science
Pittsburgh
Pennsylvania
15213
U.S.A.
hhild@cs.cmu.edu, ahw@cs.cmu.edu
task. No grammars or other higher level information are
employed. If visual information is missing
as
well, even
humans perform poorly. Just remember how hard
it
is to
recognize spelled names at the telephone.
The spelling task is seen
as
a connected word recogni-
tion problem.
As
words we
take
the highly ambiguous
26
German letters.
A
test person in front of
a
microphone
and
video camera is spelling names and random letter
sequences in German. We did not care about high quality
recordings, we even degraded the acoustic signal with arti-
ficial noise
to
simulate some real world conditions.
As
speech recognition system we present
an
extension
of
an existing Multi-State
Time
Delay Neural Network
architecture
(MS-TD")
[6] for handling both modalities,
acoustic and visual sensor input. It is shown how recogni-
tion performance with integrated acoustic and visual infor-
mation achieves significant improvements over acoustic
input only.
2.
BIMODAL ACQUISITION
AND
PRE-
PROCESSING
Our
recording setup consists
of
a
conventional
NTSC
camera and microphone. The video images are grabbed
in
real-time
(30
fullframedsec) into our workstation
and
arc
saved
as
256x256 pixel images with 8bit grey-level infor-
mation
per
pixel. This squared region covers the
full
face
of
the speaker. In parallel the acoustic
data
is
sampled
at
a
16KHz
rate and 12bit resolution.
Also
timestamps
were
saved,
because
the correct synchronization between audio
and video signals is critical for later processing.
For
acoustic preprocessing we follow the established
approach to apply
FFT
on the Hamming windowed speech
data
in
order to get 16 Melscale Fourier coefficients at a
10
msec frame rate.
For
visual preprocessing there is still the
active discussion, how much preprocessing heuristics is
appropriate before some connectionist classification
schemes are applied
to
the
data.
We
follow the idea to
allow
only
transformations with fairly low information
reduction. In other preprocessing algorithms like edge
1-557
0-7803-09.16-4'93
$3.00
'
1993
IEI<E

detection, some “hard decisions” are made, which may
hide useful information for the later learning scheme. In
fact it
has
been reported
[lo]
that such edge detectors are
learned automatically in cases were it is necessary.
We apply two alternative preprocessing techniques:
Histogram normalized grey-value coding, or
2
dimen-
sional Fourier transformation. In both cases we just con-
sider an area of interest (AOI) centered around the lips,
and low pass filter these AOIs. The AOIs were initially
segmented by hand, but an automatic procedure is now
also available
[ll].
Grey-Value coding:
We found that a 24x16 pixel reso-
lution is enough to recognize lip shapes and movements
(Figure
1).
Each of these A01 pixels is the average grey-
value of a
small
square in the original image (low pass fil-
ter). The grey-levels are rescaled in such a way that the
darkesthrightest
5%
in
the histogram are coded with
-1.O/
1
.O.
The remaining
90%
are scaled linear between
-1.0
and
1
.o.
2D-FFT:
The A01 is rescaled
to
a
64x64 pixel image
so
that the
2
dimensional
FFT
results also with 64x64
coefficients. We just consider the log magnitudes of the
first 13x13
FFT
coefficients and rescale them to
[-1.0,
1.01.
(After multiplying the complex
FFT
space with a 13x13
window and applying the inverse
FFT,
we could still rec-
ognize in the resulting low passed original image the dis-
tinct lip
shapes
and movements.) The motivation for
considering the
FFT
is, that
this
coding is spatial shift
invariant. It makes the recognition more stable against
inaccurate A01 positioning.
,
Figure
1:
Typical
AOIs
3.
SYSTEM ARCHITECTURE
As recognition system we use
a
modular
MS-TDNN
[6].
Figure
2
shows the architecture. The preprocessed acous-
tic
and visual data are
fed
into two frontad
”Ns
1141,
respectively. Each
TDNN
consists of an input layer, one
hidden layer and the phone-state layer. Backpropagation
was applied
to
train the networks in a bootstrapping phase,
to
fit
phoneme targets.
Above the two phone-state layers, the Dynamic Tie
Warping algorithm
[8]
is applied (in the DTW layer) to
find the optimal path of phone-hypotheses for the word
models (German alphabet). In the letter layer the activa-
tions of the phone-state units along the optimal paths are
accumulated. The highest score of the letter units repre-
sents the recognized letter. In a second phase the networks
are trained
to
fit letter targets. The error derivatives are
backpropagated from the letter units through the best path
in the DTW layer down
to
the front-end
TDNNs,
ensuring
that the network is optimized for the actual evaluation
task,
which is letter and not phoneme recognition. As
before, the acoustic and visual subnets
are
trained individ-
In the final “combined mode” of the recognizer, a com-
bined phone-state layer is included between the front-end
TDNNs
and the DTW layer. The activation of each com-
bined phone-state unit is the weighted sum of the regard-
ing acoustic phone-state unit and visual phone-state unit.
We call these weights “entropy-weights”, because their
values are proportional to the relative entropy between all
acoustic phone-state activations and all visual phone-state
activations. Hypotheses with higher uncertainty (higher
entropy)
are
weighted lower than hypotheses with lower
uncertainty.
&Ye
DT“
Layer
PhooemeNisem-
slate
Layer
Hidden
Layer
Input
Layer
(FFT
or
AOIs)
Figure 2:
Neural Network Architecture
4.
PHONEMES
&
VISEMES
For the acoustic classification we use
a
set of 65 phoneme-
states (phoneme-to-phoneme transition states included).
They represent
a
reasonable choice of smallest acoustic
distinguishable units in German speech, and the
”N
architecture is very well suited
to
be trained
as
a classifier
for them.
For visual features this will be different. Distinct
sounds are generated by distinct vocal tract positions, and
voicedunvoiced excitations. External features of the vocal
tract like the lips, part of the tongue and teeth, contribute
only in part to the sound generation. I.e.
/b/
and /p/ are
generated by similar lip-movements, and cannot be distin-
guished with pure visual information. Training a
TDNN
to
1-558

classify
/b/
and /p/ based only on visual information would
lead to recognition rates not better than guessing,
or
the
net perhaps would get sensitive for features which are
uncorelated to the produced speech. This leads to the
design of a smaller set of visual distinguishable units in
speech,
so
called “visemes”. We investigate a new set of
42 visemes and a 1-to-n mapping from the viseme set to
the phoneme set. The mapping is necessary for the com-
bined layer, in order to calculate the combined acoustic
and visual hyphotheses for the DTW layer.
For
example
the hypotheses for
/b/
and /p/
are
built out of the same
viseme /b-or-p/ but the different phonemes /b/ and /p/
respectly.
mcblnoisy
5.
SIMULATION RESULTS
Our database consists of 114 and 350 letter sequences
spelled by two
male
speakers. They consist of names and
random sequences. The first data set was split into 75
training and 39 test sequences (speaker msm). The second
data set
was
split into 200 training and
150
test sequences
(speaker mcb).
Best results were achieved with 15 hidden units in the
acoustic subnet and 7 hidden units in the visual subnet.
Obviously visual speech
data
contains less information
than acoustic
data.
Therefore better generalization was
achieved with
as
little
as
7 hidden units.
Backpropagation was applied with a learning rate of
0.05
and momentum
of
0.5.
We applied different error
functions to compute the error derivatives. For bootstrap-
ping the McClelland error measure was applied, and for
the global training on letter targets the Classification Fig-
ure
of
Merit [16] was applied.
59.0% 46.9% 69.6%
Table
1
summarizes the recognition performance
results on the sentence level.
Errors
are
misclassified
words, insertion, and deletion errors. For speaker “msm”,
we get an error reduction on clean data from
11.2%
(acoustic only) down to 6.8% with additional visual
data.
With noise added to the acoustic
data,
the error
rate
was
52.8%,
and could
be
reduced down to 24.4% with lipread-
ing, which means an error reduction to less
than
half
of
the
pure acoustic recognition.
For
speaker “mcb”, we could
not get the same error reduction. Obviously the pronuncia-
tion
of
speaker “mcb” was better, but doing that, he was
not moving his lips
so
much.
It also should
be
noted that in the pure visual recogni-
tion a lot of the
errors
are caused by insertion and deletion
errors. When we presented the letters with known bound-
aries, we came
to
visual recognition rates of up to 50.2%.
The results of table
1
were achieved with histogram-nor-
malized grey-value images. Experiments with 2D-FFT
images are still in progress.
In
our initial 2D-FFT simula-
tions we come to visual recognition errors, which are on
average about
8%
higher than the grey-level coding recog-
nition errors.
We also took a closer look
to
the dynamic behavior
of
the entropy-weights. Figure 3 shows the weights from the
acoustic and visual
TDNN
to the combined layer over time
during the letter sequence M-I-E was spoken. The upper
dots represent the acoustic weight A and the lower dots the
visual weight
V,
where
A=
0.5+
(entropy(
Vi
ual-TDNN)
-entmpy(Aco ust ic-TD
NN))n
K
and
V=I.
0-A.
Big white
dots
represent weights close
to
1.0
and big
black dots weights close
to
0.0.
K
is the maximum entropy
difference in the training
set.
At the end of the /m/-pho-
/ae/
I
/m/
I
I
I,^
I
I
lie/
A
V
Figure
3:
Entropy-Weights
neme when the lips are closed,
V
is higher than A. Obvi-
ously there the visual hypotheses are more certain than the
acoustic ones. During the he/-phoneme the acoustic
hypotheses are more
certain
than the visual ones, which
also makes sense.
6.
RELATED
WORK
The interest in automated speechreading
(or
lipreading) is
growing recently. As
a
nonconnectionistic approach the
work of Petajan et al. [9] should be mentioned.
Yuhas
et
al.
[151 did
use
a neural network for vowel recognition,
work-
ing on static images. Stork et al. [13] used a conventional
TDNN
(without DTW) for speechreading. They limited
1-559

the
task
to
recognize
10
isolated letters and
used
artilicial
markers on the lips. No visual feature extraction was inte-
grated into their model.
Also
of interest are some psychological studies about
human speechreading and their approach to describe the
human performance. This measurements could also
be
applied
to
the performance analysis of automated
speechreading systems. Dodd and Campbell [3], and
Demorest and Bemstein [2] did some valuable work in this
area.
7.
CONCLUSION AND FUTURE WORK
LVe
have shown how a state-of-the-art speech recognition
system can be improved by considering additional visual
information for the recognition process. This
is
true for
optimal
recording conditions but even more for non-opti-
mal
recording conditions
as
they usually exist in
real
world applications. Experiments were performed
on
the
connected letter recognition task, but similar results can be
expected for continuous speech recognition
as
well.
Work is in progress to integrate not only the time inde-
pendent weight sharing but also position independent
weight sharing for the visual
TDNN,
in order to locate and
track the lips. We are
also
on the way
to
largely increase
our database in order to achieve better recognition rates
and to train speaker independently. Investigations of dif-
ferent approaches are still in progress in order
to
combine
visual and acoustic features and
to
apply different prepro-
cessing
to
the visual data.
ACKNOWLEDGEMENTS
We appreciate the help from the DEC on campus
research center (CEC) for the initial
data
acquisition. This
research is sponsored
in
part by the Land Baden Wiirttem-
berg
(Landesschwerpunktprogramm
Neuroinformatik),
and the National Science Foundation.
REFERENCES
Christian Benoit, Tahar Lallouache, Tayeb Moham-
adi, and Christian Abry. A Set of French Visemes for
Visual Speech Synthesis.
Talking Machines: Theo-
ries, Models,
and
Designs,
1992.
M.E. Demorest and
L.E.
Bemstein. Computational
Explorations of Speechreading.
In Submission.
B.
Dodd and R. Campbell. Hearing by Eye: The Psy-
chology fo Lipreading.
Lawrence Erlbaum Press,
1987.
C.G.
Fischer. Confusion among visually perceived
consonants.
J. Speech Hearing Res.,
11,1968.
P. Haffner and A. Waibel. Multi-State Time Delay
Neural Networks for Continuous Speech Recogni-
tion. In
Neural Information Processing
Spsfenzs
(NIPS
4).
Morgan Kaufmann, April 1992.
H. Hild and A. Waibel. Connected Letter Recogni-
tion with a Multi-State Time Delay Neural Network.
To appear in
Neural Information Processing
Systritis
(NIPS
5).
K.
Mase and A. Pentland. LIP READING:
Auto-
matic Visual Recognition of Spoken Words.
Pruc.
Image
Understanding and Machine Vision,
Oplicrrl
Society of America,
June 1989.
H. Ney. The Use of a One-Stage Dynamic Program-
ming Algorithm for Connected Word Recognition.
IEEE International Conference
on
AcuuslrcJ,
Speech, and Signal Processing,
April
1983.
E. Petajan, B. Bischoff, D. Bodoff, and N.M.
Brooke.
An Improved Automatic Lipreading
System
to
enhance Speech Recognition. In
ACM
SICCHI,
1988.
[lo] D.A. Pomerleau. Neural Network Perccption
lor
Mobile Robot Guidance. PhD Thesis, CMIJ.
CMU-
CS-92-115,
February 1992.
[ll] P.W. Rander. Facetracking Using a Template
Based
Approach.
Personal Communication.
[12] D.E. Rumelhart, G.E. Hinton, and
R.J.
Williams.
Learning Internal Representations by Error 1'ropag:i-
tion.
Parallel Distributed Processing
Vol.
1.
MII'
Press, 1986.
[13] David G. Stork, Greg Wolff, and
Earl
Levinc. Neural
Network Lipreading System for Improved
Speech
Recognition. In
IJCNN,
June 1992.
[14]
A.
Waibel, T. Hanazawa, G. Hinton,
K.
Shikruio,
aid
K.
Lang. Phoneme Recognition Using Time-Delay
Neural Networks.
IEEE Transactions on Acoustics,
Speech, and Signal Processing,
37(3):328-339,
March 1989.
[15] B.P. Yuhas, M.H. Goldstein, and T.J. Sejnowski.
Integration
of
Acoustic and Visual Speech
Sign:ils
u
sin g N
eu
ral Networks
.
I
E
E
E
Co
m
I)Z
u
n
i
ca
t
i
o
n
s
Magazine,
[16] John B. Hampshire I1 and Alexander
11.
Waibel.
A
Novel Objective Function for Improved I'honeme
Recognition Using Time-Delay Neural Networks.
IEEE Transactions on Neural Networks,
1(2),
June
1990.
1-560
Citations
More filters
Journal ArticleDOI

Recent advances in the automatic recognition of audiovisual speech

TL;DR: The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.
Journal ArticleDOI

Learning words from sights and sounds: a computational model

TL;DR: The model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects, demonstrating the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling.
Book

Survey of the State of the Art in Human Language Technology

R. Cole
TL;DR: In this article, the authors present a glossary for language analysis and understanding in the context of spoken language input and output technologies, and evaluate their work with a set of annotated corpora.
Book

Multimodal interfaces

TL;DR: This chapter will review the main types of multimodal interfaces, their advantages and cognitive science underpinnings, primary features and architectural characteristics, and general research in the field of multi-modal interaction and interface design.
Journal ArticleDOI

Extraction of visual features for lipreading

TL;DR: Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared and two are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively.
References
More filters
Book ChapterDOI

Learning internal representations by error propagation

TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Book

Phoneme recognition using time-delay neural networks

TL;DR: The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation.
Journal ArticleDOI

Phoneme recognition using time-delay neural networks

TL;DR: In this article, the authors presented a time-delay neural network (TDNN) approach to phoneme recognition, which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input
Book

Speech synthesis

Book

Neural Network Perception for Mobile Robot Guidance

TL;DR: This book describes a connectionist system called ALVINN (Autonomous Land Vehicle In a Neural Network) that overcomes difficulties and can learn to control an autonomous van in under 5 minutes by watching a person drive.