scispace - formally typeset
Open AccessProceedings ArticleDOI

Error correction via a post-processor for continuous speech recognition

Reads0
Chats0
TLDR
This work provides evidence for the claim that a modern continuous speech recognizer can be used successfully in "black-box" fashion for robustly interpreting spontaneous utterances in a dialogue with a human.
Abstract
This paper presents a new technique for overcoming several types of speech recognition errors by post-processing the output of a continuous speech recognizer. The post-processor output contains fewer errors, thereby making interpretation by higher-level modules, such as a parser, in a speech understanding system more reliable. The primary advantage to the post-processing approach over existing approaches for overcoming SR errors lies in its ability to introduce options that are not available in the SR module's output. This work provides evidence for the claim that a modern continuous speech recognizer can be used successfully in "black-box" fashion for robustly interpreting spontaneous utterances in a dialogue with a human.

read more

Content maybe subject to copyright    Report

Brigham Young University Brigham Young University
BYU ScholarsArchive BYU ScholarsArchive
Faculty Publications
1996-05-01
Error Correction Via a Post-Processor for Continuous Speech Error Correction Via a Post-Processor for Continuous Speech
Recognition Recognition
Eric K. Ringger
ringger@cs.byu.edu
James F. Allen
Follow this and additional works at: https://scholarsarchive.byu.edu/facpub
Part of the Computer Sciences Commons
Original Publication Citation Original Publication Citation
Eric K. Ringger and James F. Allen. May 1996. "Error Correction via a Post-Processor for
Continuous Speech Recognition." Proceedings of the 1996 IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP'96). Atlanta, GA.
BYU ScholarsArchive Citation BYU ScholarsArchive Citation
Ringger, Eric K. and Allen, James F., "Error Correction Via a Post-Processor for Continuous Speech
Recognition" (1996).
Faculty Publications
. 679.
https://scholarsarchive.byu.edu/facpub/679
This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been
accepted for inclusion in Faculty Publications by an authorized administrator of BYU ScholarsArchive. For more
information, please contact ellen_amatangelo@byu.edu.

ERROR CORRECTION VIA A POST-PROCESSOR
FOR CONTINUOUS SPEECH RECOGNITION
*
Eric
K.
Ringger
James
F.
Allen
Department of Computer Science; University of Rochester; Rochester, New York
14627-,0226
{ringger, james}Bcs.rochester.edu
ABSTRACT
This paper presents
a
new technique for overcoming sev-
eral types of speech recognition errors by post-processing
the output of a continuous speech recognizer. The
post-processor output contains fewer errors, thereby mak-
ing interpretation by higher-level modules, such
as
a parser,
in a speech understanding system more reliable. The pri-
mary advantage to the post-processing approach over exist-
ing approaches for overcoming SR errors lies in its ability
to introduce options that are not available in the SR mod-
ule’s output. This work provides evidence for the claim
that
a
modern continuous speech recognizer can be used
successfully in “black-box” fashion for robustly interpret-
ing spontaneous utterances in a dialogue with a human.
1.
INTRODUCTION
Existing methods for continuous speech recognition do not
perform
as
well on spontaneous speech
as
we would hope.
Even state of the art recognizers such
as
Sphinx-I1
[7]
and
a recognizer built using HTK
[14]
achieve less than
60%
word accuracy on fluent speech collected from conversations
about
a
specific problem with the
TRAINS-%
system
[l].
Here are a few examples of the kinds of errors that occur
when recognizing spontaneous utterances. They are drawn
from problem-solving dialogues that we have collected from
users interacting with the
TRAINS-%
system. Some errors
are simple one-for-one replacements, such
as
this one:
REF: RIGHT SEND THE
TRAIN FROM
MONTREAL TO CHARLESTON
HYP: RATE SEND THAT TRAIN FROM MONTREAL TO CHARLESTON
Here is an utterance with a replacement of a single word by
multiple smaller words:
REF: GO
FROM CHICAGO TO TOLEDO
HYP: GO FROM CHICAGO
TO TO LEAVE AT
The following utterance contains a more complex example
in which adjacent words are misrecognized and in which
the hypothesized words overlap the boundary between the
reference words:
*THIS WORK WAS SUPPORTED BY THE UNIVERSITY
OF
ROCHESTER CS DEPARTMENT AND ONR/ARPA RE-
lFor
this experiment involving Sphinx-11, the acoustic model
and the class-based language model were trained on ATIS data.
Hence, some
of
the error
is
attributable to the moderate occur-
rence
of
out-of-vocabulary (OOV) words.
2For
this experiment involving the HTK-based recognizer, the
acoustic model and the word-based language model were trained
on
the
Trains
Dialogue Corpus
[SI
(collected prior to the creation
of
the TRAINS-% system).
31n
the examples, the
HYP
tag indicates the SR system’s
hy-
pothesis,
and the
REF
tag indicates the
reference
transcription.
SEARCH GRANT NUMBER N00014-92-5-1512.
REF
:
GREAT OKAY NOW WE COULD
GO
FROM SAY
-
HYP: I’M GREAT OKAY NOW WEEK
IT
GO
FROM
CITY
-
-
MONTREAL TO WASHINGTON
-
MONTREAL TO WASHINGTON
In addition, speech recognizers are increasingly being
used
as
“black-boxes,” having a clearlly specified function
and well-defined inputs and outputs but otherwise provid-
ing no hooks for altering
or
tuning internal operations, with
the notable exception of the ability to add words to the
recognizer’s vocabulary.
As
an example of speech recogni-
tion
as
a black-box, several research labs have announced
plans to make speech recognition available to the research
community by running publicly accessilble speech servers on
the Internet. Such servers would likely employ
a
general-
purpose language model and acoustic model. In order to
employ them for a task involving words, not available to the
server’s language model, a remote user would need some
way to correct the errors committed by the black-box SR
server.
This paper presents a new technique for overcoming
sev-
eral types of speech recognition error:; by post-processing
the output of a continuous speech recognizer. The post-
processor output contains fewer errors, thereby making in-
terpretation by higher-level modules, such
as
a
parser, in
a
speech understanding system more reliable. The goal of this
work is to contribute to successful understanding of spon-
taneous spoken utterances in human-computer dialogue by
a conversational planning assistant called the
TRAINS-%
system.
Our objective is to reduce speech recognition errors by
refining
or
even modifying the effective vocabulary
of
a
speech recognizer. To achieve this, we regard the chan-
nel from the speaker to the output of the SR module
as
a
noisy channel, and we adopt statistical techniques (some
of them borrowed from statistical machine translation)
for
modeling that channel in order to correct some of the errors
introduced there.
Why reduce recognition errors by post-processing the SR
output? Why not simply better tune the SR’s language
model for the task? First, if the SR is a general-purpose
black-box (running either locally
or
on the other side of
a
network on someone else’s machine), modifying the de-
coding algorithm to incorporate the post-processor’s model
might not be an option. Using a general-purpose SR en-
gine makes sense because it allows a system to deal with
diverse utterances.
If
needed, the post-processor can tune
the general-purpose hypothesis in
a
domain-specific
or
user-
specific way (there is also room for adapting to domains
and users on-line if the engine was not designed to do
so).
Porting
an
entire
system
to
new domains only requires
tun-
ing the post-processor, and the general-purpose component
with its models can be reused with litt,le
or
no change.
Be-
0-7803-3 192-3/96 $5.0001996
IEEE
427

cause the post-processor is light-weight by comparison, the
savings may be significant.
Second, even if the
SR
engine’s language model can be
updated with new domain-specific data, the post-processor
trained on the
same
new data can provide additional im-
provements in accuracy.
Third, several human speech phenomena are poorly mod-
eled by current continuous
SR
technology, and recognition
is
accordingly impaired. This suggests that the
SR
mod-
ule does indeed belong
as
a
component of the noisy chan-
nel. One poorly modeled phenomenon is
assimilation
of
phonetic features. Most
SR
engines model phonemes in
a context-dependent fashion
(e.g.,
see
[lo]),
and some at-
tempt to model cross-word co-articulation effects
(c.f.
[lo]
also). However,
as
speaking speeds
vary,
the
SR’s
models
may not be well suited to the affected speech signal. Such
errors can be corrected by the post-processing techniques
discussed here.
Finally, the primary advantage to the post-processing ap-
proach over existing approaches for overcoming
SR
errors
lies in its ability to introduce options that are not avail-
able in the
SR
module’s output. Existing rescoring tactics
cannot do
so
(c.f.
[4, 121).
2.
THE
MODELS AND ALGORITHM
A
statistical model for automatically translating individual
sentences between two human languages was proposed by
Brown
et al.
[3].
While this approach to translation has
its
critics, we can adapt the same idea to the process of tran-
scribing a spoken utterance. We simply posit the existence
of
a string of English words
(gl,n
=
(IQ,
w2,.
.
.
,
wn))
in the
mind of the speaker. Those words are uttered and trans-
mitted to the listening system’s microphone. The sounds
are then transcribed
as
a string of English words
(si:,,,)
by
the
SR
component of the system. The channel beginning at
the speaker and ending at the output of the
SR
module is a
noisy channel, in which errors are frequently introduced in
all segments of the channel, including the
SR
module, essen-
tially at the word-level. We adapt the statistical
MT
tech-
niques
to
recover the original string
of
words and thereby
correct some
of
the errors introduced in the channel. Fig-
ure
1
illustrates the relationship
of
the speaker, the channel,
and the error-correcting post-processor.
Brown
et al.
delineate their approach into three parts:
a translation
(or
channel) model, a language model, and
a search among possible source word sequences. We will
describe each component for our approach to
SR
post-
processing.
We adopt a channel model that describes some of the
effects on utterances that pass through the noisy channel
ending with the speech recognizer. Specifically,
it
accounts
for frequent errors such
as
simple word/word confusions
and short phrasal and segmentation problems
(e.g.,
one-to-
many word substitutions and many-to-one word concatena-
tions). In addition
to
the channel model, we present
a
suit-
able search algorithm that uses the model (together with
a source language model) to find the most likely correction
for a given word sequence from the
SR
module. We have
built a post-processor that employs these models and have
wedged it into the interpretation pipeline of the
TRAINS-%
system just behind the
SR
module. This implementation
of the post-processor can receive input from the
SR
module
incrementally
as
the
SR
decoder improves its primary hy-
pothesis. The post-processor also communicates with the
TRAINS-% parser in an incremental fashion, backing up oc-
casionally where partial solutions change on the fly.
The post-processor repairs utterances according to the
probability estimates acquired from training data.
If
the
training set consists of words from
a
task-specific vocabu-
lary, then the post-processor will map the general-purpose
vocabulary
of
the
SR
module to task-specific vocabulary.
If the training set consists of words from
another
domain,
then the post-processor will map the
SR
vocabulary to the
vocabulary of the other domain.
If
the recognizer suggests
a word that was not observed
as
a
misrecognition in the
post-processor’s training set, then the post-processor will
simply forward the unknown word to subsequent compo-
nents. If, however, that word is known to be frequently
misrecognized, then the post-processor will correct it to the
appropriate in-domain word.
By applying Bayed rule, we derive a simple expression for
the most likely pre-channel sequence
&,.
The derivation
is similar to the derivation of the statistical approach to
SR
(as
explained in
[2, 81):
The first factor,
P[gl,J,
models the formation of English
utterances by the speaker.
It
is
the listener’s model of the
speaker’s language. The second factor,
P&+,
I
s~,~],
mod-
els the behavior of the channel.
2.1.
First
Approximation
For a sizable vocabulary, adequately estimating the proba-
bility distributions that model the channel and the speaker’s
language requires mammoth amounts of data; therefore, it
is necessary to approximate through independence assump-
tions. Several assumptions are possible, and we will begin
with a basic set of assumptions before suggesting others.
For
a
first approximation language model, we use a word-
bigram model.
n-I
i=O
As
a first approximation channel model, we assume that
each word in
$,,,
is simply a transmitted version of the
word with the corresponding position in
g1+.
Thus,
n
(3)
-1
p[cl,nl
I
=
npr.):
I
4
.
i=l
We say that a word
is
aligned
with the word it produces.
We also require a method for searching among possible
source utterances
gl,n
for the most likely correction of the
given word sequence,
i.e.,
the one that yields the greatest
value of
P[gl,+]
.
PE;:,+,
I
s1,J.
We use
a
Viterbi beam-
search
for
this
purpose
(c.J
[5,
111).
2.2.
Enhancements
to
the
Models
To improve the language model, we use higher-order
n-
grams, thereby assuming that each word
in
sl,n
is
depen-
dent on its
n
-
1
predecessors. We
also
use back-off
n-gram
models for combating the problem of sparse training data
For
the channel model, we relax the constraint that re-
placement errors be aligned on a word by word basis, since
not all recognition errors consist
of
simple replacement of
[91.
428

3
Figure
1.
Recovering Word-Seque
one word by another. Some errors appear
as
the break-up
of one word into shorter words. Other errors involve the
erroneous concatenation of two or more words
to
make a
longer word. We will use the following utterance from the
TRAINS-%
dialogues
as
an example.
REF:
GO
FROM
CHICAGO TO TOLEDO
HYP:
GO
FROM
CHICAGO TO TO LEAVE AT
Following Brown et al., we refer to a picture such
as
Figure
2
as
an alignment. We use an alignment to indi-
REF:
GO
FROM
CHICAGO
TO
TOLEDO
Figure
2.
Alignment
of
a Hypothesis and the Reference
Transcription.
cate the source words in the
REF
sequence for each of the
words in the
HYP
sequence. For alignments, we use the fol-
lowing notation: we write the post-channel transcription
(si,+,
)
followed by the pre-channel transcription
(el,,)
sep-
arated by a vertical bar and enclosed in parentheses. We
also refer to the number of post-channel words produced
by a pre-channel word in a particular alignment
as
the fer-
tility of that pre-channel word. Following each of the pre-
channel words, we provide its fertility in the current align-
ment in parentheses. Alignments are easily computed using
a dynamic programming algorithm for word sequence align-
ment. Returning to our example, we have the alignment:
(CO
FROM
CHICAGO TO
TO
LEAVE AT
I
GO(1)
FROM(1)
CHICAGO(1)
TO(1)
TOLEDO(3))
To augment our channel model, we require a fertility
model
P[k
I
w]
that indicates how likely each word
w
in
the pre-channel vocabulary will have a particular fertility
k.
When a word’s fertility
k
is an integer value between two
and five, it indicates that the pre-channel word resulted in
multiple post-channel words. When a word’s fertility
is
one,
then the word accounts for exactly one post-channel word.
When a word’s fertility is a fraction
1
(for
2
5
n
5
5),
then the word and
n
-
1
neighboring words have grouped
together
to
result in a single post-channel word. We call
this situation fractional fertility. For example, a word with
k
=
1.
indicates the situation in which this word and two
neightoring source words contribute to one word in the hy-
pothesis; i.e., each word accounts for one-third of the post-
channel word. When a word’s fertility is a fraction
:
(for
2
5
m
#
n
5
5), then the word and
n
-
1
neighboring
pre-channel words have grouped together to result in
m
post-channel words. The latter
case
can be used to handle
arbitrary segmentation errors. For example, a word with
4Values higher than
five
are
ignored, since they are very rare.
nces Corrupted in a Noisy Channel.
k
=
indicates that this word and a neighboring source
word contribute to three words in the hypothesis; thus, we
can imagine each word accounting for three-halves of the
post-channel words. A concrete example of this alignment
is
(TO
LEAVE
DOING
I
TOLED0(3/2)
1Nf(1/2)).
To
understand how fertility models are used, we need to
extend the basic search algorithm. As b’efore, the algorithm
searches for an optimal source utterance
cl,+,
modulo the
beam pruning. This extended search builds possible se-
quences one word
at
a time using
g!+,
for guidance
as
before. Each word in
si,+,
is exploded (or collapsed with
neighbors) using all possible combinations. The hypotheses
are scored according to
1.
the
LM
and
2.
the channel model
for one-for-one replacements or the fertility model for other
kinds of replacements. As before, dyniamic programming
on partial source sentences and beam pruning will make
the search efficient.
Observe that the fertility model scores only the number
of
words used to replace a particular word.
It
actually relies on
the language model to score the contents of the replacement.
This is motivated by the related approach of Brown et al.,
who appear to have taken this direction in order to avoid
the problems of gathering statistics
froim
hopelessly sparse
data.
3.
EXPERIMENTAL RESULTS
The post-processor has been implemented to use the sim-
ple one-for-one channel model and a back-off bigram lan-
guage model. The channel model incorporating fertility is
work in progress. The language model
was
trained on hand-
transcribed utterances from the
TRAIN:;-%
dialogues. The
channel model was constructed by automatically aligning
the output of Sphinx-I1 (having fixed language and acous-
tic models) with the hand transcriptions and by tabulating
substitutions.
To test the post-processor, an independent set of utter-
ances was held out for evaluation. The cross-validated per-
formance of Sphinx-I1 alone and in tandem with the Post-
processor are depicted in Figure
3.
Sphinx-11’s class-based
language model was trained only on
data
from the ATIS
spoken language corpora. Also illustrated are the amounts
of training data required by the post-processor to make a
particular contribution to word recognition accuracy. This
validates the claim that the post-procerrsor can make a sig-
nificant impact in tuning the
SR
if the SR cannot be modi-
fied
as
we have discussed. Also, equivalent amounts of train-
ing data can be used with comparable impact in the post-
processor
as
in the language model of the SR. Furthermore,
preliminary results indicate that if the language model of
the
SR
can indeed be modified, then the post-processor can
still significantly improve word recognitiion accuracy. Hence
the post-processor is in neither case redundant.
429

Post-Processor Performance
-
0
2000
4000
6000
8000
loo00
12000
#
Trains-95
Words
in Training
Set
Figure
3.
Influence
of
the post-processor with additional training data.
4.
FUTURE DIRECTIONS
We have presented models and methods for overcoming
speech recognition errors. We have also provided evidence
for the claim that modern speech recognition engines can
be used successfully
as
black-boxes for robustly interpreting
utterances in a dialogue with a human.
Open issues include whether word-lattices will provide
better opportunities over simple word sequences for post-
processor correction. For the word-lattice configuration, the
post-processor must be modified to process the alternatives
in the lattice. One point to consider here is the width of
the lattice
(i.e.,
the number of alternatives at a given point
in the utterance). This factor can implicitly reflect the con-
fidence of the SR in its hypotheses and may be useful
as
a
parameter in the correction process.
In addition to the purely statistical mechanisms for re-
covering pre-channel word sequences outlined above, other
cues may augment the search.
For
example, syllables and
vowel nuclei may be usable
for
aligning pre-channel and
post-channel words and phrases. Such alignments may be
useful for further constraining the search algorithms and
yielding better corrections.
REFERENCES
J.
F. Allen,
G.
Ferguson, B. Miller, and E. Ringger.
Spoken Dialogue and Interactive Planning. In
Proceed-
ings
of
the ARPA SLST Workshop,
San Mateo Cali-
fornia, January
1995.
ARPA, Morgan Kaufmann.
L. R. Bahl, F. Jelinek, and R. Mercer. A Maximum
Likelihood Approach to Continuous Speech Recogni-
tion.
IEEE Transactions on Pattern Analysis and Ma-
chzne Intelligence (PAMI),
5(2):179-190,
March
1983.
P. F. Brown,
J.
Cocke,
S.
A. Della Pietra, V.
J.
Della
Pietra,
F.
Jelinek, J. D. LaRerty, R. L. Mercer, and
P.
S.
Roossin. A Statistical Approach to Machine
Translation.
Computational Linguistics,
16(2):79-85,
June
1990.
Y.
Chow and
R.
Schwartz. The n-best algorithm: An
efficient procedure for finding top n sentence hypothe-
ses. In
Proceedings
of
the Second DARPA Workshop on
Speech and Natural Language,
pages
199-202,
San
Ma-
teo, California, October
1989.
DARPA, Morgan Kauf-
mann.
J.
G.
E. Forney. The Viterbi Algorithm. In
Proceedings
of
IEEE,
volume
61,
pages
266-278.
IEEE,
1973.
P. Heeman and
J.
F.
Allen. The
TRAINS
93
Dia-
logues.
TRAINS
Technical Note
94-2,
Department of
Computer Science, University of Rochester, Rochester,
NY,
14627,
March
1995.
X.
D. Huang,
F.
Alleva, H. W. Hon, M.
Y.
Hwang,
K.
F.
Lee, and R. Rosenfeld. The Sphinx-I1 Speech
Recognition System: An Overview.
Computer, Speech
and Language,
1993.
F.
Jelinek. Self-organized Language Modeling for
Speech Recognition. Reprinted in
1131: 450-506, 1990.
S.
M.
Katz. Estimation of probabilities from sparse
data for the language model component of a speech rec-
ognizer. In
IEEE Transactions on Acoustics, Speech,
and Signal Processing,
pages
400-401.
IEEE, IEEE,
March
1987.
K.-F. Lee.
Automatic Speech Recognition: the Deuel-
opment
of
the SPHINX System.
Kluwer Academic,
Boston, London,
1989.
B. Lowerre and R. Reddy. The Harpy Speech Un-
derstanding System. In
Trends
in
Speech Recognition.
Speech Science Publications, Apple Valley, Minnesota,
1986.
Reprinted in
[13]: 576-586.
M.
Rayner,
D.
Carter,
V.
Digalakis, and
P.
Price. Com-
bining Knowledge Sources to Reorder N-best Speech
Hypothesis Lists. In
Proceedings ARPA Human Lan-
guage Technology Workshop,
pages
212-217.
ARPA,
March
1994.
A. Waibel and K.-F. Lee, editors.
Readings
in
Speech
Recognition.
Morgan
Kaufmann, San Mateo,
1990.
S.
J.
Young and
P.
C. Woodland.
HTK: Hidden Markov
Model Toolkit.
Entropic Research Laboratory, Wash-
ington, D.C.,
1993.
430
Citations
More filters
Proceedings Article

TRIPs: an integrated intelligent problem-solving assistant

TL;DR: How the integrated system provides key advantages for helping both work in natural language dialogue processing and in interactive planning and problem solving is discussed, and the opportunities such an approach affords for the future are considered.

Synthesis Lectures on Human Language Technologies

TL;DR: This book gives a comprehensive view of state-of-the-art techniques that are used to build spoken dialogue systems and presents dialogue modelling and system development issues relevant in both academic and industrial environments and also discusses requirements and challenges for advanced interaction management and future research.
Patent

Method and apparatus for correcting and repairing machine-transcribed input using independent or cross-modal secondary input

TL;DR: In this paper, a repair hypothesis is generated for the located error from a secondary input signal, where at least a portion of the recognition hypothesis is correlated with the repair hypothesis to produce a new hypothesis for the location error.
Journal ArticleDOI

Discriminative n-gram language modeling

TL;DR: This paper describes a method based on regularized likelihood that makes use of the feature set given by the perceptron algorithm, and initialization with the perceptRON's weights; this method gives an additional 0.5% reduction in word error rate (WER) over training withThe perceptron alone.

The sri march 2000 hub-5 conversational speech transcription system

TL;DR: SRI’s large vocabulary conversational speech r ecognition system as used in the March 2000 NIST Hub-5E evaluation is described and a generalized ROVER algorithm is applied to combine the N-best hypotheses from several systems based on different acoustic models.
References
More filters
Journal ArticleDOI

The viterbi algorithm

TL;DR: This paper gives a tutorial exposition of the Viterbi algorithm and of how it is implemented and analyzed, and increasing use of the algorithm in a widening variety of areas is foreseen.
Journal ArticleDOI

Estimation of probabilities from sparse data for the language model component of a speech recognizer

TL;DR: The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data, and compares favorably to other proposed methods.
Journal ArticleDOI

A statistical approach to machine translation

TL;DR: The application of the statistical approach to translation from French to English and preliminary results are described and the results are given.
Journal ArticleDOI

A Maximum Likelihood Approach to Continuous Speech Recognition

TL;DR: This paper describes a number of statistical models for use in speech recognition, with special attention to determining the parameters for such models from sparse data, and describes two decoding methods appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks.
Journal ArticleDOI

The SPHINX-II Speech Recognition System: An Overview

TL;DR: The SPHINX-II speech recognition system is reviewed and recent efforts on improved speech recognition are summarized.