Error correction via a post-processor for continuous speech recognition

doi:10.1109/ICASSP.1996.541124

Brigham Young University Brigham Young University

BYU ScholarsArchive BYU ScholarsArchive

Faculty Publications

1996-05-01

Error Correction Via a Post-Processor for Continuous Speech Error Correction Via a Post-Processor for Continuous Speech

Recognition Recognition

Eric K. Ringger

ringger@cs.byu.edu

James F. Allen

Follow this and additional works at: https://scholarsarchive.byu.edu/facpub

Part of the Computer Sciences Commons

Original Publication Citation Original Publication Citation

Eric K. Ringger and James F. Allen. May 1996. "Error Correction via a Post-Processor for

Continuous Speech Recognition." Proceedings of the 1996 IEEE International Conference on

Acoustics, Speech, and Signal Processing (ICASSP'96). Atlanta, GA.

BYU ScholarsArchive Citation BYU ScholarsArchive Citation

Ringger, Eric K. and Allen, James F., "Error Correction Via a Post-Processor for Continuous Speech

Recognition" (1996).

Faculty Publications

. 679.

https://scholarsarchive.byu.edu/facpub/679

This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been

accepted for inclusion in Faculty Publications by an authorized administrator of BYU ScholarsArchive. For more

information, please contact ellen_amatangelo@byu.edu.

ERROR CORRECTION VIA A POST-PROCESSOR

FOR CONTINUOUS SPEECH RECOGNITION

*

Eric

K.

Ringger

James

F.

Allen

Department of Computer Science; University of Rochester; Rochester, New York

14627-,0226

{ringger, james}Bcs.rochester.edu

ABSTRACT

This paper presents

a

new technique for overcoming sev-

eral types of speech recognition errors by post-processing

the output of a continuous speech recognizer. The

post-processor output contains fewer errors, thereby mak-

ing interpretation by higher-level modules, such

as

a parser,

in a speech understanding system more reliable. The pri-

mary advantage to the post-processing approach over exist-

ing approaches for overcoming SR errors lies in its ability

to introduce options that are not available in the SR mod-

ule’s output. This work provides evidence for the claim

that

a

modern continuous speech recognizer can be used

successfully in “black-box” fashion for robustly interpret-

ing spontaneous utterances in a dialogue with a human.

1.

INTRODUCTION

Existing methods for continuous speech recognition do not

perform

as

well on spontaneous speech

as

we would hope.

Even state of the art recognizers such

as

Sphinx-I1

[7]

and

a recognizer built using HTK

[14]

achieve less than

60%

word accuracy on fluent speech collected from conversations

about

a

specific problem with the

TRAINS-%

system

[l].

Here are a few examples of the kinds of errors that occur

when recognizing spontaneous utterances. They are drawn

from problem-solving dialogues that we have collected from

users interacting with the

TRAINS-%

system. Some errors

are simple one-for-one replacements, such

as

this one:

REF: RIGHT SEND THE

TRAIN FROM

MONTREAL TO CHARLESTON

HYP: RATE SEND THAT TRAIN FROM MONTREAL TO CHARLESTON

Here is an utterance with a replacement of a single word by

multiple smaller words:

REF: GO

FROM CHICAGO TO TOLEDO

HYP: GO FROM CHICAGO

TO TO LEAVE AT

The following utterance contains a more complex example

in which adjacent words are misrecognized and in which

the hypothesized words overlap the boundary between the

reference words:

*THIS WORK WAS SUPPORTED BY THE UNIVERSITY

OF

ROCHESTER CS DEPARTMENT AND ONR/ARPA RE-

lFor

this experiment involving Sphinx-11, the acoustic model

and the class-based language model were trained on ATIS data.

Hence, some

of

the error

is

attributable to the moderate occur-

rence

of

out-of-vocabulary (OOV) words.

2For

this experiment involving the HTK-based recognizer, the

acoustic model and the word-based language model were trained

on

the

Trains

Dialogue Corpus

[SI

(collected prior to the creation

of

the TRAINS-% system).

31n

the examples, the

HYP

tag indicates the SR system’s

hy-

pothesis,

and the

REF

tag indicates the

reference

transcription.

SEARCH GRANT NUMBER N00014-92-5-1512.

REF

:

GREAT OKAY NOW WE COULD

GO

FROM SAY

-

HYP: I’M GREAT OKAY NOW WEEK

IT

GO

FROM

CITY

-

MONTREAL TO WASHINGTON

-

MONTREAL TO WASHINGTON

In addition, speech recognizers are increasingly being

used

as

“black-boxes,” having a clearlly specified function

and well-defined inputs and outputs but otherwise provid-

ing no hooks for altering

or

tuning internal operations, with

the notable exception of the ability to add words to the

recognizer’s vocabulary.

As

an example of speech recogni-

tion

as

a black-box, several research labs have announced

plans to make speech recognition available to the research

community by running publicly accessilble speech servers on

the Internet. Such servers would likely employ

a

general-

purpose language model and acoustic model. In order to

employ them for a task involving words, not available to the

server’s language model, a remote user would need some

way to correct the errors committed by the black-box SR

server.

This paper presents a new technique for overcoming

sev-

eral types of speech recognition error:; by post-processing

the output of a continuous speech recognizer. The post-

processor output contains fewer errors, thereby making in-

terpretation by higher-level modules, such

as

a

parser, in

a

speech understanding system more reliable. The goal of this

work is to contribute to successful understanding of spon-

taneous spoken utterances in human-computer dialogue by

a conversational planning assistant called the

TRAINS-%

system.

Our objective is to reduce speech recognition errors by

refining

or

even modifying the effective vocabulary

of

a

speech recognizer. To achieve this, we regard the chan-

nel from the speaker to the output of the SR module

as

a

noisy channel, and we adopt statistical techniques (some

of them borrowed from statistical machine translation)

for

modeling that channel in order to correct some of the errors

introduced there.

Why reduce recognition errors by post-processing the SR

output? Why not simply better tune the SR’s language

model for the task? First, if the SR is a general-purpose

black-box (running either locally

or

on the other side of

a

network on someone else’s machine), modifying the de-

coding algorithm to incorporate the post-processor’s model

might not be an option. Using a general-purpose SR en-

gine makes sense because it allows a system to deal with

diverse utterances.

If

needed, the post-processor can tune

the general-purpose hypothesis in

a

domain-specific

or

user-

specific way (there is also room for adapting to domains

and users on-line if the engine was not designed to do

so).

Porting

an

entire

system

to

new domains only requires

tun-

ing the post-processor, and the general-purpose component

with its models can be reused with litt,le

or

no change.

Be-

0-7803-3 192-3/96 $5.0001996

IEEE

427

cause the post-processor is light-weight by comparison, the

savings may be significant.

Second, even if the

SR

engine’s language model can be

updated with new domain-specific data, the post-processor

trained on the

same

new data can provide additional im-

provements in accuracy.

Third, several human speech phenomena are poorly mod-

eled by current continuous

SR

technology, and recognition

is

accordingly impaired. This suggests that the

SR

mod-

ule does indeed belong

as

a

component of the noisy chan-

nel. One poorly modeled phenomenon is

assimilation

of

phonetic features. Most

SR

engines model phonemes in

a context-dependent fashion

(e.g.,

see

[lo]),

and some at-

tempt to model cross-word co-articulation effects

(c.f.

[lo]

also). However,

as

speaking speeds

vary,

the

SR’s

models

may not be well suited to the affected speech signal. Such

errors can be corrected by the post-processing techniques

discussed here.

Finally, the primary advantage to the post-processing ap-

proach over existing approaches for overcoming

SR

errors

lies in its ability to introduce options that are not avail-

able in the

SR

module’s output. Existing rescoring tactics

cannot do

so

(c.f.

[4, 121).

2.

THE

MODELS AND ALGORITHM

A

statistical model for automatically translating individual

sentences between two human languages was proposed by

Brown

et al.

[3].

While this approach to translation has

its

critics, we can adapt the same idea to the process of tran-

scribing a spoken utterance. We simply posit the existence

of

a string of English words

(gl,n

=

(IQ,

w2,.

.

,

wn))

in the

mind of the speaker. Those words are uttered and trans-

mitted to the listening system’s microphone. The sounds

are then transcribed

as

a string of English words

(si:,,,)

by

the

SR

component of the system. The channel beginning at

the speaker and ending at the output of the

SR

module is a

noisy channel, in which errors are frequently introduced in

all segments of the channel, including the

SR

module, essen-

tially at the word-level. We adapt the statistical

MT

tech-

niques

to

recover the original string

of

words and thereby

correct some

of

the errors introduced in the channel. Fig-

ure

1

illustrates the relationship

of

the speaker, the channel,

and the error-correcting post-processor.

Brown

et al.

delineate their approach into three parts:

a translation

(or

channel) model, a language model, and

a search among possible source word sequences. We will

describe each component for our approach to

SR

post-

processing.

We adopt a channel model that describes some of the

effects on utterances that pass through the noisy channel

ending with the speech recognizer. Specifically,

it

accounts

for frequent errors such

as

simple word/word confusions

and short phrasal and segmentation problems

(e.g.,

one-to-

many word substitutions and many-to-one word concatena-

tions). In addition

to

the channel model, we present

a

suit-

able search algorithm that uses the model (together with

a source language model) to find the most likely correction

for a given word sequence from the

SR

module. We have

built a post-processor that employs these models and have

wedged it into the interpretation pipeline of the

TRAINS-%

system just behind the

SR

module. This implementation

of the post-processor can receive input from the

SR

module

incrementally

as

the

SR

decoder improves its primary hy-

pothesis. The post-processor also communicates with the

TRAINS-% parser in an incremental fashion, backing up oc-

casionally where partial solutions change on the fly.

The post-processor repairs utterances according to the

probability estimates acquired from training data.

If

the

training set consists of words from

a

task-specific vocabu-

lary, then the post-processor will map the general-purpose

vocabulary

of

the

SR

module to task-specific vocabulary.

If the training set consists of words from

another

domain,

then the post-processor will map the

SR

vocabulary to the

vocabulary of the other domain.

If

the recognizer suggests

a word that was not observed

as

a

misrecognition in the

post-processor’s training set, then the post-processor will

simply forward the unknown word to subsequent compo-

nents. If, however, that word is known to be frequently

misrecognized, then the post-processor will correct it to the

appropriate in-domain word.

By applying Bayed rule, we derive a simple expression for

the most likely pre-channel sequence

&,.

The derivation

is similar to the derivation of the statistical approach to

SR

(as

explained in

[2, 81):

The first factor,

P[gl,J,

models the formation of English

utterances by the speaker.

It

is

the listener’s model of the

speaker’s language. The second factor,

P&+,

I

s~,~],

mod-

els the behavior of the channel.

2.1.

First

Approximation

For a sizable vocabulary, adequately estimating the proba-

bility distributions that model the channel and the speaker’s

language requires mammoth amounts of data; therefore, it

is necessary to approximate through independence assump-

tions. Several assumptions are possible, and we will begin

with a basic set of assumptions before suggesting others.

For

a

first approximation language model, we use a word-

bigram model.

n-I

i=O

As

a first approximation channel model, we assume that

each word in

$,,,

is simply a transmitted version of the

word with the corresponding position in

g1+.

Thus,

n

(3)

-1

p[cl,nl

I

=

npr.):

I

4

.

i=l

We say that a word

is

aligned

with the word it produces.

We also require a method for searching among possible

source utterances

gl,n

for the most likely correction of the

given word sequence,

i.e.,

the one that yields the greatest

value of

P[gl,+]

.

PE;:,+,

I

s1,J.

We use

a

Viterbi beam-

search

for

this

purpose

(c.J

[5,

111).

2.2.

Enhancements

to

the

Models

To improve the language model, we use higher-order

n-

grams, thereby assuming that each word

in

sl,n

is

depen-

dent on its

n

-

1

predecessors. We

also

use back-off

n-gram

models for combating the problem of sparse training data

For

the channel model, we relax the constraint that re-

placement errors be aligned on a word by word basis, since

not all recognition errors consist

of

simple replacement of

[91.

428

3

Figure

1.

Recovering Word-Seque

one word by another. Some errors appear

as

the break-up

of one word into shorter words. Other errors involve the

erroneous concatenation of two or more words

to

make a

longer word. We will use the following utterance from the

TRAINS-%

dialogues

as

an example.

REF:

GO

FROM

CHICAGO TO TOLEDO

HYP:

GO

FROM

CHICAGO TO TO LEAVE AT

Following Brown et al., we refer to a picture such

as

Figure

2

as

an alignment. We use an alignment to indi-

REF:

GO

FROM

CHICAGO

TO

TOLEDO

Figure

2.

Alignment

of

a Hypothesis and the Reference

Transcription.

cate the source words in the

REF

sequence for each of the

words in the

HYP

sequence. For alignments, we use the fol-

lowing notation: we write the post-channel transcription

(si,+,

)

followed by the pre-channel transcription

(el,,)

sep-

arated by a vertical bar and enclosed in parentheses. We

also refer to the number of post-channel words produced

by a pre-channel word in a particular alignment

as

the fer-

tility of that pre-channel word. Following each of the pre-

channel words, we provide its fertility in the current align-

ment in parentheses. Alignments are easily computed using

a dynamic programming algorithm for word sequence align-

ment. Returning to our example, we have the alignment:

(CO

FROM

CHICAGO TO

TO

LEAVE AT

I

GO(1)

FROM(1)

CHICAGO(1)

TO(1)

TOLEDO(3))

To augment our channel model, we require a fertility

model

P[k

I

w]

that indicates how likely each word

w

in

the pre-channel vocabulary will have a particular fertility

k.

When a word’s fertility

k

is an integer value between two

and five, it indicates that the pre-channel word resulted in

multiple post-channel words. When a word’s fertility

is

one,

then the word accounts for exactly one post-channel word.

When a word’s fertility is a fraction

1

(for

2

5

n

5

5),

then the word and

n

-

1

neighboring words have grouped

together

to

result in a single post-channel word. We call

this situation fractional fertility. For example, a word with

k

=

1.

indicates the situation in which this word and two

neightoring source words contribute to one word in the hy-

pothesis; i.e., each word accounts for one-third of the post-

channel word. When a word’s fertility is a fraction

:

(for

2

5

m

#

n

5

5), then the word and

n

-

1

neighboring

pre-channel words have grouped together to result in

m

post-channel words. The latter

case

can be used to handle

arbitrary segmentation errors. For example, a word with

4Values higher than

five

are

ignored, since they are very rare.

nces Corrupted in a Noisy Channel.

k

=

indicates that this word and a neighboring source

word contribute to three words in the hypothesis; thus, we

can imagine each word accounting for three-halves of the

post-channel words. A concrete example of this alignment

is

(TO

LEAVE

DOING

I

TOLED0(3/2)

1Nf(1/2)).

To

understand how fertility models are used, we need to

extend the basic search algorithm. As b’efore, the algorithm

searches for an optimal source utterance

cl,+,

modulo the

beam pruning. This extended search builds possible se-

quences one word

at

a time using

g!+,

for guidance

as

before. Each word in

si,+,

is exploded (or collapsed with

neighbors) using all possible combinations. The hypotheses

are scored according to

1.

the

LM

and

2.

the channel model

for one-for-one replacements or the fertility model for other

kinds of replacements. As before, dyniamic programming

on partial source sentences and beam pruning will make

the search efficient.

Observe that the fertility model scores only the number

of

words used to replace a particular word.

It

actually relies on

the language model to score the contents of the replacement.

This is motivated by the related approach of Brown et al.,

who appear to have taken this direction in order to avoid

the problems of gathering statistics

froim

hopelessly sparse

data.

3.

EXPERIMENTAL RESULTS

The post-processor has been implemented to use the sim-

ple one-for-one channel model and a back-off bigram lan-

guage model. The channel model incorporating fertility is

work in progress. The language model

was

trained on hand-

transcribed utterances from the

TRAIN:;-%

dialogues. The

channel model was constructed by automatically aligning

the output of Sphinx-I1 (having fixed language and acous-

tic models) with the hand transcriptions and by tabulating

substitutions.

To test the post-processor, an independent set of utter-

ances was held out for evaluation. The cross-validated per-

formance of Sphinx-I1 alone and in tandem with the Post-

processor are depicted in Figure

3.

Sphinx-11’s class-based

language model was trained only on

data

from the ATIS

spoken language corpora. Also illustrated are the amounts

of training data required by the post-processor to make a

particular contribution to word recognition accuracy. This

validates the claim that the post-procerrsor can make a sig-

nificant impact in tuning the

SR

if the SR cannot be modi-

fied

as

we have discussed. Also, equivalent amounts of train-

ing data can be used with comparable impact in the post-

processor

as

in the language model of the SR. Furthermore,

preliminary results indicate that if the language model of

the

SR

can indeed be modified, then the post-processor can

still significantly improve word recognitiion accuracy. Hence

the post-processor is in neither case redundant.

429

Post-Processor Performance

-

0

2000

4000

6000

8000

loo00

12000

#

Trains-95

Words

in Training

Set

Figure

3.

Influence

of

the post-processor with additional training data.

4.

FUTURE DIRECTIONS

We have presented models and methods for overcoming

speech recognition errors. We have also provided evidence

for the claim that modern speech recognition engines can

be used successfully

as

black-boxes for robustly interpreting

utterances in a dialogue with a human.

Open issues include whether word-lattices will provide

better opportunities over simple word sequences for post-

processor correction. For the word-lattice configuration, the

post-processor must be modified to process the alternatives

in the lattice. One point to consider here is the width of

the lattice

(i.e.,

the number of alternatives at a given point

in the utterance). This factor can implicitly reflect the con-

fidence of the SR in its hypotheses and may be useful

as

a

parameter in the correction process.

In addition to the purely statistical mechanisms for re-

covering pre-channel word sequences outlined above, other

cues may augment the search.

For

example, syllables and

vowel nuclei may be usable

for

aligning pre-channel and

post-channel words and phrases. Such alignments may be

useful for further constraining the search algorithms and

yielding better corrections.

REFERENCES

J.

F. Allen,

G.

Ferguson, B. Miller, and E. Ringger.

Spoken Dialogue and Interactive Planning. In

Proceed-

ings

of

the ARPA SLST Workshop,

San Mateo Cali-

fornia, January

1995.

ARPA, Morgan Kaufmann.

L. R. Bahl, F. Jelinek, and R. Mercer. A Maximum

Likelihood Approach to Continuous Speech Recogni-

tion.

IEEE Transactions on Pattern Analysis and Ma-

chzne Intelligence (PAMI),

5(2):179-190,

March

1983.

P. F. Brown,

J.

Cocke,

S.

A. Della Pietra, V.

J.

Della

Pietra,

F.

Jelinek, J. D. LaRerty, R. L. Mercer, and

P.

S.

Roossin. A Statistical Approach to Machine

Translation.

Computational Linguistics,

16(2):79-85,

June

1990.

Y.

Chow and

R.

Schwartz. The n-best algorithm: An

efficient procedure for finding top n sentence hypothe-

ses. In

Proceedings

of

the Second DARPA Workshop on

Speech and Natural Language,

pages

199-202,

San

Ma-

teo, California, October

1989.

DARPA, Morgan Kauf-

mann.

J.

G.

E. Forney. The Viterbi Algorithm. In

Proceedings

of

IEEE,

volume

61,

pages

266-278.

IEEE,

1973.

P. Heeman and

J.

F.

Allen. The

TRAINS

93

Dia-

logues.

TRAINS

Technical Note

94-2,

Department of

Computer Science, University of Rochester, Rochester,

NY,

14627,

March

1995.

X.

D. Huang,

F.

Alleva, H. W. Hon, M.

Y.

Hwang,

K.

F.

Lee, and R. Rosenfeld. The Sphinx-I1 Speech

Recognition System: An Overview.

Computer, Speech

and Language,

1993.

F.

Jelinek. Self-organized Language Modeling for

Speech Recognition. Reprinted in

1131: 450-506, 1990.

S.

M.

Katz. Estimation of probabilities from sparse

data for the language model component of a speech rec-

ognizer. In

IEEE Transactions on Acoustics, Speech,

and Signal Processing,

pages

400-401.

IEEE, IEEE,

March

1987.

K.-F. Lee.

Automatic Speech Recognition: the Deuel-

opment

of

the SPHINX System.

Kluwer Academic,

Boston, London,

1989.

B. Lowerre and R. Reddy. The Harpy Speech Un-

derstanding System. In

Trends

in

Speech Recognition.

Speech Science Publications, Apple Valley, Minnesota,

1986.

Reprinted in

[13]: 576-586.

M.

Rayner,

D.

Carter,

V.

Digalakis, and

P.

Price. Com-

bining Knowledge Sources to Reorder N-best Speech

Hypothesis Lists. In

Proceedings ARPA Human Lan-

guage Technology Workshop,

pages

212-217.

ARPA,

March

1994.

A. Waibel and K.-F. Lee, editors.

Readings

in

Speech

Recognition.

Morgan

Kaufmann, San Mateo,

1990.

S.

J.

Young and

P.

C. Woodland.

HTK: Hidden Markov

Model Toolkit.

Entropic Research Laboratory, Wash-

ington, D.C.,

1993.

430

Error correction via a post-processor for continuous speech recognition

Citations

TRIPs: an integrated intelligent problem-solving assistant

Synthesis Lectures on Human Language Technologies

Method and apparatus for correcting and repairing machine-transcribed input using independent or cross-modal secondary input

Discriminative n-gram language modeling

The sri march 2000 hub-5 conversational speech transcription system

References

The viterbi algorithm

Estimation of probabilities from sparse data for the language model component of a speech recognizer

A statistical approach to machine translation

A Maximum Likelihood Approach to Continuous Speech Recognition

The SPHINX-II Speech Recognition System: An Overview

Related Papers (5)

Error corrective mechanisms for speech recognition

Speech recognition error correction using maximum entropy language model.

Context-based speech recognition error detection and correction

A multi-pass error detection and correction framework for Mandarin LVCSR.

Improving speech recognition through text-based linguistic post-processing