What could be used to guide retrieval tasks?

richer information such as accent, stress, emotion, and speaker identification contained in spoken segments could also be extracted and used to guide retrieval tasks.

What is the SMF problem used to design?

The SMF problem is used to design systems that are affine-in-parameters (but not necessarily in the data), subject to a bound on the absolute error between a desired sequence and a linearly filtered version of another sequence.

What is the way to search for key words in a news story?

For BN stories, WERs can be low with much redundancy in the news stories, and therefore, search for key words over longer sequences is a reasonable approach.

Why is the prediction residual used for the stegosignal?

Because the prediction residual associated with the coversignal is used for reconstructing the stegosignal, the autocorrelation values of the stegosignal are different from the modified autocorrelation values derived from the perturbed LP coefficients and the prediction residual .

What is the common approach to minimizing the shift from the well-trained baseline model?

A conservative approach is to minimize the shift from the well-trained baseline model parameters, given the constraint of no loss of discrimination power along the first dominant eigendirections in the test speaker eigenspace:(5)By substituting (4) into (5) and minimizing the objective function using the Lagrange Multiplier method, the adapted mean can be obtained from using a linear transformation , in which is an nonsingular matrix given by(6)and where is an identity matrix.

How many different types of Gaussian states exist in the acoustic model?

The baseline speaker-independent acoustic model has 6275 context-dependent tied states, each having 16 mixture component Gaussians (i.e., in total, 100 400 diagonal mixture component Gaussians exist in the acoustic model).

What constraint was used to determine the fidelity of the watermarking algorithm?

2) SMF-Based Fidelity Criterion: In [71], a general parameter-embedding problem was considered whose solution is subject to an fidelity constraint on the signal.

How many terms are in the transcribed audio?

Since the transcribed audio segments have considerable length variations, the authors make equal to some percentage of the number of terms in each original automatic audio transcription (which achieves better performance than picking a fixed number of terms for all spoken documents).

What is the new tfidf weighting scheme?

the tfidf weighting scheme is replaced with Okapi weighting [90], and several query and document expansion technologies are incorporated.

(Open Access) SpeechFind: advances in spoken document retrieval for a National Gallery of the Spoken Word (2005) | John H. L. Hansen

712 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005

SpeechFind: Advances in Spoken Document Retrieval

for a National Gallery of the Spoken Word

John H. L. Hansen, Senior Member, IEEE, Rongqing Huang, Student Member, IEEE, Bowen Zhou, Member, IEEE,

Michael Seadle, J. R. Deller, Jr, Fellow, IEEE, Aparna R. Gurijala, Mikko Kurimo, and

Pongtep Angkititrakul, Member, IEEE

Abstract—Advances in formulating spoken document retrieval

for a new National Gallery of the Spoken Word (NGSW) are

addressed. NGSW is the ﬁrst large-scale repository of its kind,

consisting of speeches, news broadcasts, and recordings from the

20th century. After presenting an overview of the audio stream

content of the NGSW, with sample audio ﬁles from U.S. Presidents

from 1893 to the present, an overall system diagram is proposed

with a discussion of critical tasks associated with effective audio in-

formation retrieval. These include advanced audio segmentation,

speech recognition model adaptation for acoustic background

noise and speaker variability, and information retrieval using

natural language processing for text query requests that include

document and query expansion. For segmentation, a new eval-

uation criterion entitled fused error score (FES) is proposed,

followed by application of the CompSeg segmentation scheme on

DARPA Hub4 Broadcast News (30.5% relative improvement in

FES) and NGSW data. Transcript generation is demonstrated

for a six-decade portion of the NGSW corpus. Novel model adap-

tation using structure maximum likelihood eigenspace mapping

shows a relative 21.7% improvement. Issues regarding copyright

assessment and metadata construction are also addressed for

the purposes of a sustainable audio collection of this magnitude.

Advanced parameter-embedded watermarking is proposed with

evaluations showing robustness to correlated noise attacks. Our

experimental online system entitled “SpeechFind” is presented,

which allows for audio retrieval from a portion of the NGSW

corpus. Finally, a number of research challenges such as language

modeling and lexicon for changing time periods, speaker trait and

identiﬁcation tracking, as well as new directions, are discussed in

Manuscript received July 1, 2004; revised April 4, 2005. This work was sup-

ported by the National Science Foundation (NSF) Cooperative Agreement IIS-

9817485. Any opinions, ﬁndings, and conclusions expressed are those of the

authors and do not necessarily reﬂect the views of the NSF. The associate editor

coordinating the review of this manuscript and approving it for publication was

Dr. Mazin Gilbert.

J. H. L. Hansen and R. Huang were with the Robust Speech Processing

Group, Center for Spoken Language Research, University of Colorado Boulder,

Boulder, CO 80302 USA. They are now with the Center for Robust Speech

Systems, University of Texas at Dallas, Richardson, TX 75083 USA (e-mail:

John.Hansen@utdallas.edu).

B. Zhou was with Robust Speech Processing Group, Center for Spoken Lan-

guage Research, University of Colorado Boulder, Boulder, CO 80302 USA. He

is now with the IBM T. J. Watson Research Center, Yorktown Heights, NY

10598 USA.

M. Seadle is with the Main Library, Michigan State University, East Lansing,

MI 48824 USA.

J. R. Deller, Jr. and A. R. Gurijala are with the Department of Electrical

and Computer Engineering, Michigan State University, East Lansing, MI 48824

USA.

M. Kurimo is with the Neural Networks Research Center, Helsinki University

of Technology, Helsinki, Finland.

P. Angkititrakul was with the Robust Speech Processing Group, Center for

Spoken Language Research, University of Colorado Boulder, Boulder, CO

80302 USA. He is now with Eliza Corp, Beverly, MA 01915 USA.

Digital Object Identiﬁer 10.1109/TSA.2005.852088

order to address the overall task of robust phrase searching in

unrestricted audio corpora.

Index Terms—Accent classiﬁcation, broadcast news, document

expansion, environmental snifﬁng, ﬁdelity, fused error score, in-

formation retrieval, language modeling, model adaptation, query

expansion, robust speech recognition, robustness, security, speech

segmentation, spoken document retrieval, watermarking.

I. I

NTRODUCTION

HE problem of reliable speech recognition for spoken

document/information retrieval is a challenging problem

when data is recorded across different media, equipment, and

time periods. In this paper, we address a number of issues

associated with audio stream phrase recognition, copyright/wa-

termarking, and audio content delivery for a new National

Gallery of the Spoken Word (NGSW) [1]. This is the ﬁrst

large-scale repository of its kind, consisting of speeches, news

broadcasts, and recordings that are of signiﬁcant historical con-

tent. The U.S. National Science Foundation recently established

an initiative to provide better transition of library services to

digital format. As part of this Phase-II Digital Libraries Initia-

tive, researchers from Michigan State University (MSU) and

the University of Colorado at Boulder (CU) have teamed to

establish a fully searchable, online WWW database of spoken

word collections that span the 20th century [5]. The database

draws primarily from holdings of MSU’s Vincent Voice Li-

brary (VVL) that include more than 60 000 hr of recordings

(from Thomas Edison’s ﬁrst cylinder disk recordings to famous

speeches such as man’s ﬁrst steps on the moon “One Small

Step for Man,” to American presidents over the past 100 years).

In this partnership, MSU digitizes and houses the collection,

as well as cataloging, organizing, and providing meta-tagging

information. A networked client-server conﬁguration has been

established between MSU and CU to provide automatic tran-

script generation for seamless audio content delivery. MSU

is also responsible for several engineering challenges such as

digital watermarking and effective compression strategies [6],

[7]. The Robust Speech Processing Group—Center for Spoken

Language Research (RSPG-CSLR) (CU) is responsible for

developing robust automatic speech recognition for transcript

generation and proto-type audio/metadata/transcript-based user

search engine, which is called

SpeechFind [2].

In the ﬁeld of robust speech recognition, there is a variety

challenging problems that persist, such as reliable speech recog-

nition across wireless communications channels, recognition of

HANSEN et al.: SPEECHFIND: ADVANCES IN SPOKEN DOCUMENT RETRIEVAL FOR A NATIONAL GALLERY OF THE SPOKEN WORD 713

speech across changing speaker conditions (emotion and stress

[25]–[27], accent [28], [29]), or recognition of speech from

unknown or changing acoustic environments. The ability to

achieve effective performance in changing speaker conditions

for large vocabulary continuous speech recognition (LVCSR)

remains a challenge, as demonstrated in recent DARPA evalua-

tions focused on Broadcast News (BN) versus previous results

from the Wall Street Journal (WSJ) corpus. Although the

problem of audio stream search is relatively new, it is related

to a number of previous research problems. Systems developed

for streaming video search based on audio [30] or closed-cap-

tioning can be effective but often assume either an associated

text stream or a clean audio stream. Information retrieval

via audio and audio mining have recently produced several

commercial approaches [32], [33]; however, these methods

generally focus on relatively clean single-speaker recording

conditions. Alternative methods have considered ways to

time-compress or modify speech in order to give human lis-

teners the ability to more quickly skim through recorded audio

data [34]. In general, keyword spotting systems can be used

for topic or gisting

applications. However, for phrase search,

the system must be able to recover from errors in both the user

requested text-sequence and rank-ordered detected phrase sites

within the stream. Phrase search focuses more on locating a

single requested occurrence, whereas keyword/topic spotting

systems assume a number of possible searched outcomes. Great

strides have also been made in LVCSR for spoken document

retrieval for BN in English [31], [35]–[39], German [40], [41],

Italian [42], Korean [43], Japanese [44]–[47], Mandarin/Chi-

nese [48]–[52], [57], Finnish [104], Portuguese [53], Arabic

[54], and French [55]. The American English BN corpus re-

ﬂects a wider range of acoustic environments than many large

vocabulary corpora (e.g., WSJ, TIMIT). However, the recogni-

tion of speech in BN reﬂects a homogeneous data corpus (i.e.,

recordings from TV and radio news broadcasts from the 1990s,

organized into seven classes from F0: clean, to FX: low ﬁdelity

with cross-talk). One natural solution to audio stream search is

to perform forced transcription for the entire dataset and simply

search the synchronized text stream. Whereas this may be a

manageable task for BN (consisting of about 100 hr), the initial

offering for NGSW will be 5000 hr (with a potential of 60 000

total hr), and it will not be possible to achieve accurate forced

transcription since text data will generally not be available.

Other studies have also considered web-based spoken docu-

ment retrieval (SDR) [3], [4], [56]. Transcript generation of

broadcast news can also be conducted in an effort to obtain near

real-time closed-captioning [58]. Instead of generating exact

transcripts, some studies have considered summarization and

topic indexing [59]–[61] or, more speciﬁcally, topic detection

and tracking [64]; others have considered lattice-based search

[101]. Some of these ideas are related to speaker clustering [62],

[63], which is needed to improve acoustic model adaptation

for BN transcription generation. Language model adaptation

[65] and multiple/alternative language modeling [66] have also

been considered for SDR. Finally, cross and multilingual-based

Here, the word “gisting” refers to systems that identify the main topic or

“gist” of the audio material.

studies have also been performed for SDR [67], [68]. Advances

represented by the cited BN and SDR studies notwithstanding,

the NGSW database involves a level of complexity in terms of

the range and extent of acoustic distortion, speaker variability,

and audio quality that has not been approached in existing

research. Probably the only corpus-based study that comes

close to NGSW is one focused on Holocaust Survivors [69],

consisting of a broad range of speakers in structured two-person

interview formats.

In this paper, we introduce SpeechFind: an experimental on-

line spoken document retrieval system for the NGSW. In Sec-

tion II, we discuss the structure of the audio materials contained

in the VVL including time periods, recording conditions, audio

format, and acoustic conditions. Section III considers a brief dis-

cussion on copyright issues for NGSW. Section IV presents an

overview of the SpeechFind system including transcript genera-

tion and text-based search. Next, Section V addresses transcript-

generation based on i) unsupervised segmentation, ii) model

adaptation, iii) LVCSR, and iv) text-based information retrieval.

Section VI revisits copyright issues, with a treatment of digital

watermarking strategies. Section VII considers additional audio

stream tagging and language model concepts for next-genera-

tion SDR. Finally, Section VIII summarizes the main contribu-

tions and areas for future research.

II. A

UDIO CORPUS STRUCTURE OF

NGSW

Spoken document retrieval focuses on employing text-based

search strategies from transcripts of audio materials. The tran-

scripts, in turn, have reverse index timing information that al-

lows audio segments to be returned for user access. Whereas

automatic speech recognition (ASR) technology has advanced

signiﬁcantly, the ability to perform ASR for SDR presents some

unique challenges. These include i) a diverse range of audio

recording conditions, ii) the ability to search output text mate-

rials with variable levels of recognition (i.e., word-error-rate:

WER) performance, and iii) decisions on what material/con-

tent should be extracted for transcript knowledge to be used for

SDR (e.g., text content, speaker identiﬁcation or tracking, en-

vironmental snifﬁng [93], [94], etc.). For some audio streams

such as voice-mail, which normally contain only one speaker,

or two-way telephone conversations with two speakers, tran-

scription using ASR technology is possible since the task pri-

marily focuses on detecting silence/speech activity and then en-

gaging the recognizer appropriate for that speaker. However,

audio streams from NGSW encompass one of the widest range

of audio materials available today. Fig. 1 presents an overview

of the types of audio ﬁles and recording structure seen in the

audio. The types of audio include the following:

•

Monologs: single speaker talking spontaneously or

reading prepared/prompted text in clean conditions;

• Two-Way Conversations: telephone conversations be-

tween two subjects that are spontaneous and could con-

tain periods with both talking;

• Speeches: audio data where a person (e.g., politician)

is speaking to an audience–primarily one talker, but

background audience noise could be present, and room

echo or noise is possible; typically read/prepared text;

714 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005

Fig. 1. Structure of i) NGSW audio recordings—speakers, microphone(s), recording media and ii) segmentation and classiﬁcation, speech recognition, and

transcript generation.

• Interviews/Debates: audio streams where a person is

being interviewed by a TV or radio person. Debates

could include a moderator and/or various audience

participation (e.g., questions, applause, interruptions,

etc.); typically two speakers with spontaneous speech

in question/answer format;

• Radio/TV News Broadcasts: includes traditional news

anchor with periods of both prompted read speech, talk

radio, spontaneous speech, background music, call-in

speakers, commercials, other background audio con-

tent (e.g., ofﬁce noise such as typewriter, etc.). Audio

content would come from TV or radio studio settings

(e.g., public radio such as NPR or 60 Minutes TV

show);

• Field News Broadcasts: audio content coming from

news reporters in the ﬁeld (e.g., emergency or war loca-

tions, city streets, etc.); contains a wide range of back-

ground noise content of unpredictable origin. Commu-

nication channels also impact frequency content of the

audio;

• Recording Media/Transmission: audio properties

can be transformed based on the type of recording

equipment used (e.g., microphones, Edison cylinder

disks, reel-to-reel tape, cassette tape, DAT, CD, etc.)

or transmission (e.g., AM, FM , voice compression

methods—CELP, MELP, ADPCM, etc.);

• Meetings/Hearings: public formal inquiries (Water-

gate hearings, U.S. Supreme Court, etc.);

• Debates

: presidential, formal and informal

(Nixon–Kennedy, Clinton–Dole, etc.);

• Historical Recordings: NASA: walk on the moon,

Nixon: “I’m not a crook,” M. L. King: “Ihavea

dream,” etc.

Therefore, NGSWaudio content includes a diverse range of

audio formats, recording media, and diverse time periods

including names, places, topics, and choice of vocabulary. The

following issues arise for transcript generation for SDR: Do we

transcribe commercials? Do we transcribe background acoustic

noise/events? Do we identify speakers with the text? Do we

identify from where the speakers are speaking (i.e., the environ-

ment/location)? How do we deal with errors in ASR (i.e., “dirty

transcripts”)? Since automatic transcription for such a diverse

range of audio materials will lead to signiﬁcant variability in

WER, SDR employing text-based search of such transcripts

will be an important research issue to consider. For our initial

system, we focus on transcript generation of individual speech

and disable transcription production for music/commercials.

To illustrate the range of NGSW recording conditions, three

example spectrograms are shown in Fig. 2. The recordings are

(a) Thomas Edison, “my work as an electrician” [talking about

contributions of 19th century scientists; original Edison cylinder

disk recording, 1908], (b) Thomas Watson, “as Bell was about

to speak into the new instrument,” [talking about the ﬁrst tele-

phone message from A. G. Bell on March 10, 1876; recorded

in 1926], and (c) President Bill Clinton, “tonight I stand before

you,” [State of the Union Address on economic expansion, Jan.

19, 1999]. These examples indicate the wide range of distor-

tions present in the speech corpus. Some of these include severe

bandwidth restrictions (e.g., Edison style cylinder disks), poor

audio from scratchy, used, or aging recording media, differences

in microphone type and placement, reverberation for speeches

from public ﬁgures, recordings from telephone, radio, or TV

broadcasts, background noise including audience and multiple

speakers or interviewers, a wide range of speaking styles and

accents, etc.

As another example, we show in Fig. 3 a summary of U.S.

Presidential speeches, consisting mostly of state-of-the-union

or campaign speeches from 1893 to the present. For each presi-

dential speech, we employed the NIST speech-to-noise ratio es-

timation scheme (STNR) to identify the mean speech and noise

decibel values. As we see from this ﬁgure, the resulting digitized

speech levels are typically near 80 dB, whereas background

noise levels can vary signiﬁcantly (42–78 dB). We obtained the

STNR values for each presidential speech, which ranged be-

tween 4–37 dB. Clearly, the estimated STNR only has meaning

if frequency content is consistent, but as we see in this ﬁgure, the

estimated frequency bandwidth for early Edison cylinder disks

is about 1–2.5 kHz, whereas recordings of today are closer to

7 kHz, with AM/FM radio bandwidths in the 5–10 kHz range

(note that while the audio format is 44.1 kHz, 16 bit data, tran-

script generation uses a sample rate of 16 kHz; therefore, our

maximum bandwidth from these recordings for speech content

would have been 8 kHz). Recordings for Wilson and Hoover

were extremely noisy, with background audience and echo dis-

tortion, as well as poor scratchy recording equipment. In addi-

tion, vocabulary selection varies signiﬁcantly over the 110-year

HANSEN et al.: SPEECHFIND: ADVANCES IN SPOKEN DOCUMENT RETRIEVAL FOR A NATIONAL GALLERY OF THE SPOKEN WORD 715

Fig. 2. Example audio stream (8 kHz) spectrograms from NGSW. (a) Thomas Edison, recorded in 1908. (b) Thomas Watson, recorded in 1926. (c) President

William J. Clinton, recorded in 1999.

Fig. 3. Summary of presidential speeches (typically state-of-the-union addresses) from 1893 to present. Shown is each president, mean noise signal level (in

decibels) (top bars in each pair), mean speech signal level (in decibels) (bottom bars in each pair); approximate frequency bandwidth (BW) of each recording, with

an estimated speech-to-noise-ratio (STNR) varied from 4.5—37.25 dB.

period. Clearly, the ability to achieve reliable phrase recogni-

tion search for such data is an unparalleled challenge in speech

recognition.

III. C

OPYRIGHT ISSUES IN NGSW

When considering distribution of audio material via the

WWW, one primary logistics issue concerns copyright own-

ership. Research on watermarking digital sound is integral to

the creation of the NGSW. Most sound recordings have some

form of copyright protection under existing law. The US copy-

right Law (Title 17 of the US Code) explicitly protects sound

recordings made since 1978. Some famous speeches have been

heavily litigated. An example is Martin Luther King’s “IHave

a Dream” speech [20].

Many rights holders are willing to make their sound record-

ings available for educational purposes, but they often require

some form of technological protection to prevent legitimate

educational copies from being used for unauthorized commer-

(DMCA) introduced penalties for circumventing technological

protections. Many in the academic community object to these

penalties because they create a contradiction in U.S. law: many

716 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005

Fig. 4. Overview of SpeechFind system architecture (http://SpeechFind.colorado.edu).

(a) (b)

Fig. 5. (a) Automatic transcript generation (SDR). (b) Statistical information retrieval (SIR).

legal “fair uses” of technologically protected works can be exer-

cised only through illegal circumvention. Audio watermarking

is a desirable technological protection mechanism because it

does not abrogate fair use rights.

There is evidence that the courts consider audio watermarks

to be a legitimate form of copyright protection. The Napster

music ﬁle-sharing case, for example, mentions both the lack of

watermarking on MP3 ﬁles and the intention to include it in the

future [19]. Watermarking is therefore not a preventative.

Prevention is attractive to those who put signiﬁcant cap-

ital toward the creation of audio works and who fear the

loss of investment and future proﬁts. However, prevention

is fundamentally inconsistent with most US copyright law,

which instead emphasizes mechanisms for redress once an in-

fringement has occurred. Watermarking facilitates redress and

represents a copyright protection technology that universities

can use without being inconsistent with their interest in and

commitment to sharing knowledge. Further treatment of copy-

right issues and fair use can be found in [7] and [21]–[24]. In

Section VI, we consider advances made in digital watermarking

for the NGSW project. For the present experimental online

SDR system, digital watermarking is employed to both protect

ownership as well as help ensure integrity of the audio content.

IV. S

PEECHFIND SYSTEM OVERVIEW

Here, we present an overview of the SpeechFind system (see

Fig. 4) and describe several key modules. The system is con-

structed in two phases: i) enrollment and ii) query and retrieval.

In the enrollment phase, large audio sets are submitted for

audio segmentation and transcription generation and metadata

construction (EAD: extended archive descriptor). Once this

phase is completed, the audio material is available through the

online audio search engine (i.e., “query and retrieval” phase).

The system includes the following modules: an audio spider and

transcoder, spoken document transcriber, “rich” transcription

database, and an online public accessible search engine. As

shown in the ﬁgure, the audio spider and transcoder are respon-

sible for automatically fetching available audio archives from

a range of available servers and transcoding the heterogeneous

incoming audio ﬁles into uniform 16-kHz, 16-bit linear PCM

raw audio data (note that in general, the transcoding process

is done ofﬂine prior to being available for user retrieval). In

addition, for those audio documents with metadata labels, this

module also parses the metadata and extracts relevant informa-

tion into a “rich” transcript database for guiding information

retrieval.

The spoken document transcriber includes two components,

namely, the audio segmenter and transcriber. The audio seg-

menter partitions audio data into manageable small segments by

detecting speaker, channel, and environmental change points.

The transcriber decodes every speech segment into text. If

human transcripts are available for any of the audio documents,

the segmenter is still applied to detect speaker, channel, and

environmental changes in a guided manner, with the decoder

being reduced to a forced aligner for each speech segment to

tag timing information for spoken words. Fig. 5(a) shows that

for the proposed SpeechFind system, transcript generation is

ﬁrst performed, which requires reliable acoustic and language

models that are appropriate for the type of audio stream and

time period. After transcript generation, Fig. 5(b) shows that

three associated ﬁles are linked together, namely i) the audio

SpeechFind: advances in spoken document retrieval for a National Gallery of the Spoken Word

Figures

Citations

Content-based multimedia information retrieval: State of the art and challenges

An overview of automatic speaker diarization systems

Retrieval and browsing of spoken content

Review: Speaker segmentation and clustering

A review on speaker diarization systems and approaches

References

An Introduction to Multivariate Statistical Analysis

SRILM – An Extensible Language Modeling Toolkit

An introduction to latent semantic analysis

Introduction To Multivariate Statistical Analysis

Digital Watermarking

Related Papers (5)

Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion

The TREC spoken document retrieval track: a success story

Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition

Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

Frequently Asked Questions (9)

Q1. What could be used to guide retrieval tasks?

Q2. What is the SMF problem used to design?

Q3. What is the way to search for key words in a news story?

Q4. Why is the prediction residual used for the stegosignal?

Q5. What is the common approach to minimizing the shift from the well-trained baseline model?

Q6. How many different types of Gaussian states exist in the acoustic model?

Q7. What constraint was used to determine the fidelity of the watermarking algorithm?

Q8. How many terms are in the transcribed audio?

Q9. What is the new tfidf weighting scheme?