Improving connected letter recognition by lipreading

doi:10.1109/ICASSP.1993.319179

IMPROVING CONNECTED LETTER RECOGNITION

BY

LIPREADING

Christoph Bregler*, Hermunn Hild, Stefan Manke, and Alex Waibel

University

of

Karlsruhe

Department

of

Computer

Science

Am

Fasanengarten

5

7500

Karlsruhe

1

Germany

bregler

@

irauka.de, manke

@

irauka.de

ABSTRACT

In this paper we show how recognition performance in

automated speech perception can be significantly

improved by additional Lipreading,

so

called "Speech-

reading". We show this on an extension of an existing

state-of-the-art speech recognition system,

a

modular MS-

TDNN.

The

acoustic and visual speech data is preclassi-

fied in two separate front-end phoneme

TDNNs

and com-

bined

to

acoustic-visual hypotheses for the Dynamic Time

Warping algorithm. This

is

shown on

a

connected word

recognition problem,

the

notoriously difficult letter spell-

ing task. With speechreading we could reduce the error

rate

up

to

half of the error rate of the pure acoustic recog-

ni ti on.

1.

INTRODUCTION

Automated speech perception systems still perform

poorly, when it comes to real world applications. Most

approaches

are

very sensitive to background noise or fail

totally when more than one speaker

talks

simultaneously

(cocktail party effect),

as

it often happens in offices, cock-

pits,

outdoors and other real world environments.

Humans

deal with this distortions in considering addi-

tional sources.

Very

often misclassified acoustic signals

can

be corrected with the

use

of higher level context infor-

mation. In recognition systems this is partly covered by

language models or grammars. Psychological studies have

shown

[31,

that

on

the lower level additional information

contributes

to

human hearing

as

well. Besides the acoustic

signal from both

ears,

visual information, mostly lipmove-

ments, are subconsciously involved in the recognition pro-

cess. This source is even more important for hearing

impaired people, but also contributes significantly for nor-

mal

hearing recognition.

We

investigate this phenomena on the letter spelling

%

'l'he author

1s

now with International Computer Science

Institute,

1947

Center Street, Berkeley, CA

94704

Carnegie Mellon University

School

of

Computer

Science

Pittsburgh

Pennsylvania

15213

U.S.A.

hhild@cs.cmu.edu, ahw@cs.cmu.edu

task. No grammars or other higher level information are

employed. If visual information is missing

as

well, even

humans perform poorly. Just remember how hard

it

is to

recognize spelled names at the telephone.

The spelling task is seen

as

a connected word recogni-

tion problem.

As

words we

take

the highly ambiguous

26

German letters.

A

test person in front of

a

microphone

and

video camera is spelling names and random letter

sequences in German. We did not care about high quality

recordings, we even degraded the acoustic signal with arti-

ficial noise

to

simulate some real world conditions.

As

speech recognition system we present

an

extension

of

an existing Multi-State

Time

Delay Neural Network

architecture

(MS-TD")

[6] for handling both modalities,

acoustic and visual sensor input. It is shown how recogni-

tion performance with integrated acoustic and visual infor-

mation achieves significant improvements over acoustic

input only.

2.

BIMODAL ACQUISITION

AND

PRE-

PROCESSING

Our

recording setup consists

of

a

conventional

NTSC

camera and microphone. The video images are grabbed

in

real-time

(30

fullframedsec) into our workstation

and

arc

saved

as

256x256 pixel images with 8bit grey-level infor-

mation

per

pixel. This squared region covers the

full

face

of

the speaker. In parallel the acoustic

data

is

sampled

at

a

16KHz

rate and 12bit resolution.

Also

timestamps

were

saved,

because

the correct synchronization between audio

and video signals is critical for later processing.

For

acoustic preprocessing we follow the established

approach to apply

FFT

on the Hamming windowed speech

data

in

order to get 16 Melscale Fourier coefficients at a

10

msec frame rate.

For

visual preprocessing there is still the

active discussion, how much preprocessing heuristics is

appropriate before some connectionist classification

schemes are applied

to

the

data.

We

follow the idea to

allow

only

transformations with fairly low information

reduction. In other preprocessing algorithms like edge

1-557

0-7803-09.16-4'93

$3.00

'

1993

IEI<E

detection, some “hard decisions” are made, which may

hide useful information for the later learning scheme. In

fact it

has

been reported

[lo]

that such edge detectors are

learned automatically in cases were it is necessary.

We apply two alternative preprocessing techniques:

Histogram normalized grey-value coding, or

2

dimen-

sional Fourier transformation. In both cases we just con-

sider an area of interest (AOI) centered around the lips,

and low pass filter these AOIs. The AOIs were initially

segmented by hand, but an automatic procedure is now

also available

[ll].

Grey-Value coding:

We found that a 24x16 pixel reso-

lution is enough to recognize lip shapes and movements

(Figure

1).

Each of these A01 pixels is the average grey-

value of a

small

square in the original image (low pass fil-

ter). The grey-levels are rescaled in such a way that the

darkesthrightest

5%

in

the histogram are coded with

-1.O/

1

.O.

The remaining

90%

are scaled linear between

-1.0

and

1

.o.

2D-FFT:

The A01 is rescaled

to

a

64x64 pixel image

so

that the

2

dimensional

FFT

results also with 64x64

coefficients. We just consider the log magnitudes of the

first 13x13

FFT

coefficients and rescale them to

[-1.0,

1.01.

(After multiplying the complex

FFT

space with a 13x13

window and applying the inverse

FFT,

we could still rec-

ognize in the resulting low passed original image the dis-

tinct lip

shapes

and movements.) The motivation for

considering the

FFT

is, that

this

coding is spatial shift

invariant. It makes the recognition more stable against

inaccurate A01 positioning.

,

Figure

1:

Typical

AOIs

3.

SYSTEM ARCHITECTURE

As recognition system we use

a

modular

MS-TDNN

[6].

Figure

2

shows the architecture. The preprocessed acous-

tic

and visual data are

fed

into two frontad

”Ns

1141,

respectively. Each

TDNN

consists of an input layer, one

hidden layer and the phone-state layer. Backpropagation

was applied

to

train the networks in a bootstrapping phase,

to

fit

phoneme targets.

Above the two phone-state layers, the Dynamic Tie

Warping algorithm

[8]

is applied (in the DTW layer) to

find the optimal path of phone-hypotheses for the word

models (German alphabet). In the letter layer the activa-

tions of the phone-state units along the optimal paths are

accumulated. The highest score of the letter units repre-

sents the recognized letter. In a second phase the networks

are trained

to

fit letter targets. The error derivatives are

backpropagated from the letter units through the best path

in the DTW layer down

to

the front-end

TDNNs,

ensuring

that the network is optimized for the actual evaluation

task,

which is letter and not phoneme recognition. As

before, the acoustic and visual subnets

are

trained individ-

In the final “combined mode” of the recognizer, a com-

bined phone-state layer is included between the front-end

TDNNs

and the DTW layer. The activation of each com-

bined phone-state unit is the weighted sum of the regard-

ing acoustic phone-state unit and visual phone-state unit.

We call these weights “entropy-weights”, because their

values are proportional to the relative entropy between all

acoustic phone-state activations and all visual phone-state

activations. Hypotheses with higher uncertainty (higher

entropy)

are

weighted lower than hypotheses with lower

uncertainty.

&Ye

DT“

Layer

PhooemeNisem-

slate

Layer

Hidden

Layer

Input

Layer

(FFT

or

AOIs)

Figure 2:

Neural Network Architecture

4.

PHONEMES

&

VISEMES

For the acoustic classification we use

a

set of 65 phoneme-

states (phoneme-to-phoneme transition states included).

They represent

a

reasonable choice of smallest acoustic

distinguishable units in German speech, and the

”N

architecture is very well suited

to

be trained

as

a classifier

for them.

For visual features this will be different. Distinct

sounds are generated by distinct vocal tract positions, and

voicedunvoiced excitations. External features of the vocal

tract like the lips, part of the tongue and teeth, contribute

only in part to the sound generation. I.e.

/b/

and /p/ are

generated by similar lip-movements, and cannot be distin-

guished with pure visual information. Training a

TDNN

to

1-558

classify

/b/

and /p/ based only on visual information would

lead to recognition rates not better than guessing,

or

the

net perhaps would get sensitive for features which are

uncorelated to the produced speech. This leads to the

design of a smaller set of visual distinguishable units in

speech,

so

called “visemes”. We investigate a new set of

42 visemes and a 1-to-n mapping from the viseme set to

the phoneme set. The mapping is necessary for the com-

bined layer, in order to calculate the combined acoustic

and visual hyphotheses for the DTW layer.

For

example

the hypotheses for

/b/

and /p/

are

built out of the same

viseme /b-or-p/ but the different phonemes /b/ and /p/

respectly.

mcblnoisy

5.

SIMULATION RESULTS

Our database consists of 114 and 350 letter sequences

spelled by two

male

speakers. They consist of names and

random sequences. The first data set was split into 75

training and 39 test sequences (speaker msm). The second

data set

was

split into 200 training and

150

test sequences

(speaker mcb).

Best results were achieved with 15 hidden units in the

acoustic subnet and 7 hidden units in the visual subnet.

Obviously visual speech

data

contains less information

than acoustic

data.

Therefore better generalization was

achieved with

as

little

as

7 hidden units.

Backpropagation was applied with a learning rate of

0.05

and momentum

of

0.5.

We applied different error

functions to compute the error derivatives. For bootstrap-

ping the McClelland error measure was applied, and for

the global training on letter targets the Classification Fig-

ure

of

Merit [16] was applied.

59.0% 46.9% 69.6%

Table

1

summarizes the recognition performance

results on the sentence level.

Errors

are

misclassified

words, insertion, and deletion errors. For speaker “msm”,

we get an error reduction on clean data from

11.2%

(acoustic only) down to 6.8% with additional visual

data.

With noise added to the acoustic

data,

the error

rate

was

52.8%,

and could

be

reduced down to 24.4% with lipread-

ing, which means an error reduction to less

than

half

of

the

pure acoustic recognition.

For

speaker “mcb”, we could

not get the same error reduction. Obviously the pronuncia-

tion

of

speaker “mcb” was better, but doing that, he was

not moving his lips

so

much.

It also should

be

noted that in the pure visual recogni-

tion a lot of the

errors

are caused by insertion and deletion

errors. When we presented the letters with known bound-

aries, we came

to

visual recognition rates of up to 50.2%.

The results of table

1

were achieved with histogram-nor-

malized grey-value images. Experiments with 2D-FFT

images are still in progress.

In

our initial 2D-FFT simula-

tions we come to visual recognition errors, which are on

average about

8%

higher than the grey-level coding recog-

nition errors.

We also took a closer look

to

the dynamic behavior

of

the entropy-weights. Figure 3 shows the weights from the

acoustic and visual

TDNN

to the combined layer over time

during the letter sequence M-I-E was spoken. The upper

dots represent the acoustic weight A and the lower dots the

visual weight

V,

where

A=

0.5+

(entropy(

Vi

ual-TDNN)

-entmpy(Aco ust ic-TD

NN))n

K

and

V=I.

0-A.

Big white

dots

represent weights close

to

1.0

and big

black dots weights close

to

0.0.

K

is the maximum entropy

difference in the training

set.

At the end of the /m/-pho-

/ae/

I

/m/

I

I,^

I

lie/

A

V

Figure

3:

Entropy-Weights

neme when the lips are closed,

V

is higher than A. Obvi-

ously there the visual hypotheses are more certain than the

acoustic ones. During the he/-phoneme the acoustic

hypotheses are more

certain

than the visual ones, which

also makes sense.

6.

The interest in automated speechreading

(or

lipreading) is

growing recently. As

a

nonconnectionistic approach the

work of Petajan et al. [9] should be mentioned.

Yuhas

et

al.

[151 did

use

a neural network for vowel recognition,

work-

ing on static images. Stork et al. [13] used a conventional

TDNN

(without DTW) for speechreading. They limited

1-559

the

task

to

recognize

10

isolated letters and

used

artilicial

markers on the lips. No visual feature extraction was inte-

grated into their model.

Also

of interest are some psychological studies about

human speechreading and their approach to describe the

human performance. This measurements could also

be

applied

to

the performance analysis of automated

speechreading systems. Dodd and Campbell [3], and

Demorest and Bemstein [2] did some valuable work in this

area.

7.

CONCLUSION AND FUTURE WORK

LVe

have shown how a state-of-the-art speech recognition

system can be improved by considering additional visual

information for the recognition process. This

is

true for

optimal

recording conditions but even more for non-opti-

mal

recording conditions

as

they usually exist in

real

world applications. Experiments were performed

on

the

connected letter recognition task, but similar results can be

expected for continuous speech recognition

as

well.

Work is in progress to integrate not only the time inde-

pendent weight sharing but also position independent

weight sharing for the visual

TDNN,

in order to locate and

track the lips. We are

also

on the way

to

largely increase

our database in order to achieve better recognition rates

and to train speaker independently. Investigations of dif-

ferent approaches are still in progress in order

to

combine

visual and acoustic features and

to

apply different prepro-

cessing

to

the visual data.

ACKNOWLEDGEMENTS

We appreciate the help from the DEC on campus

research center (CEC) for the initial

data

acquisition. This

research is sponsored

in

part by the Land Baden Wiirttem-

berg

(Landesschwerpunktprogramm

Neuroinformatik),

and the National Science Foundation.

REFERENCES

Christian Benoit, Tahar Lallouache, Tayeb Moham-

adi, and Christian Abry. A Set of French Visemes for

Visual Speech Synthesis.

Talking Machines: Theo-

ries, Models,

and

Designs,

1992.

M.E. Demorest and

L.E.

Bemstein. Computational

Explorations of Speechreading.

In Submission.

B.

Dodd and R. Campbell. Hearing by Eye: The Psy-

chology fo Lipreading.

Lawrence Erlbaum Press,

1987.

C.G.

Fischer. Confusion among visually perceived

consonants.

J. Speech Hearing Res.,

11,1968.

P. Haffner and A. Waibel. Multi-State Time Delay

Neural Networks for Continuous Speech Recogni-

tion. In

Neural Information Processing

Spsfenzs

(NIPS

4).

Morgan Kaufmann, April 1992.

H. Hild and A. Waibel. Connected Letter Recogni-

tion with a Multi-State Time Delay Neural Network.

To appear in

Neural Information Processing

Systritis

(NIPS

5).

K.

Mase and A. Pentland. LIP READING:

Auto-

matic Visual Recognition of Spoken Words.

Pruc.

Image

Understanding and Machine Vision,

Oplicrrl

Society of America,

June 1989.

H. Ney. The Use of a One-Stage Dynamic Program-

ming Algorithm for Connected Word Recognition.

IEEE International Conference

on

AcuuslrcJ,

Speech, and Signal Processing,

April

1983.

E. Petajan, B. Bischoff, D. Bodoff, and N.M.

Brooke.

An Improved Automatic Lipreading

System

to

enhance Speech Recognition. In

ACM

SICCHI,

1988.

[lo] D.A. Pomerleau. Neural Network Perccption

lor

Mobile Robot Guidance. PhD Thesis, CMIJ.

CMU-

CS-92-115,

February 1992.

[ll] P.W. Rander. Facetracking Using a Template

Based

Approach.

Personal Communication.

[12] D.E. Rumelhart, G.E. Hinton, and

R.J.

Williams.

Learning Internal Representations by Error 1'ropag:i-

tion.

Parallel Distributed Processing

Vol.

1.

MII'

Press, 1986.

[13] David G. Stork, Greg Wolff, and

Earl

Levinc. Neural

Network Lipreading System for Improved

Speech

Recognition. In

IJCNN,

June 1992.

[14]

A.

Waibel, T. Hanazawa, G. Hinton,

K.

Shikruio,

aid

K.

Lang. Phoneme Recognition Using Time-Delay

Neural Networks.

IEEE Transactions on Acoustics,

Speech, and Signal Processing,

37(3):328-339,

March 1989.

[15] B.P. Yuhas, M.H. Goldstein, and T.J. Sejnowski.

Integration

of

Acoustic and Visual Speech

Sign:ils

u

sin g N

eu

ral Networks

.

I

E

Co

m

I)Z

u

n

i

ca

t

i

o

n

s

Magazine,

[16] John B. Hampshire I1 and Alexander

11.

Waibel.

A

Novel Objective Function for Improved I'honeme

Recognition Using Time-Delay Neural Networks.

IEEE Transactions on Neural Networks,

1(2),

June

1990.

1-560

Improving connected letter recognition by lipreading

Citations

Recent advances in the automatic recognition of audiovisual speech

Learning words from sights and sounds: a computational model

Survey of the State of the Art in Human Language Technology

Multimodal interfaces

Extraction of visual features for lipreading

References

Learning internal representations by error propagation

Phoneme recognition using time-delay neural networks

Phoneme recognition using time-delay neural networks

Speech synthesis

Neural Network Perception for Mobile Robot Guidance

Related Papers (5)

Automatic lipreading to enhance speech recognition (speech reading)

Hearing lips and seeing voices

"Eigenlips" for robust speech recognition

Snakes : Active Contour Models

Audio-visual speech modeling for continuous speech recognition