scispace - formally typeset
Open AccessJournal ArticleDOI

SpeechFind: advances in spoken document retrieval for a National Gallery of the Spoken Word

TLDR
This study discusses a number of issues for audio stream phrase recognition for information retrieval for a new National Gallery of the Spoken Word (NGSW), and proposes a system diagram and discusses critical tasks associated with effective audio information retrieval.
Abstract
Advances in formulating spoken document retrieval for a new National Gallery of the Spoken Word (NGSW) are addressed. NGSW is the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings from the 20th century. After presenting an overview of the audio stream content of the NGSW, with sample audio files from U.S. Presidents from 1893 to the present, an overall system diagram is proposed with a discussion of critical tasks associated with effective audio information retrieval. These include advanced audio segmentation, speech recognition model adaptation for acoustic background noise and speaker variability, and information retrieval using natural language processing for text query requests that include document and query expansion. For segmentation, a new evaluation criterion entitled fused error score (FES) is proposed, followed by application of the CompSeg segmentation scheme on DARPA Hub4 Broadcast News (30.5% relative improvement in FES) and NGSW data. Transcript generation is demonstrated for a six-decade portion of the NGSW corpus. Novel model adaptation using structure maximum likelihood eigenspace mapping shows a relative 21.7% improvement. Issues regarding copyright assessment and metadata construction are also addressed for the purposes of a sustainable audio collection of this magnitude. Advanced parameter-embedded watermarking is proposed with evaluations showing robustness to correlated noise attacks. Our experimental online system entitled "SpeechFind" is presented, which allows for audio retrieval from a portion of the NGSW corpus. Finally, a number of research challenges such as language modeling and lexicon for changing time periods, speaker trait and identification tracking, as well as new directions, are discussed in order to address the overall task of robust phrase searching in unrestricted audio corpora.

read more

Content maybe subject to copyright    Report

712 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005
SpeechFind: Advances in Spoken Document Retrieval
for a National Gallery of the Spoken Word
John H. L. Hansen, Senior Member, IEEE, Rongqing Huang, Student Member, IEEE, Bowen Zhou, Member, IEEE,
Michael Seadle, J. R. Deller, Jr, Fellow, IEEE, Aparna R. Gurijala, Mikko Kurimo, and
Pongtep Angkititrakul, Member, IEEE
Abstract—Advances in formulating spoken document retrieval
for a new National Gallery of the Spoken Word (NGSW) are
addressed. NGSW is the first large-scale repository of its kind,
consisting of speeches, news broadcasts, and recordings from the
20th century. After presenting an overview of the audio stream
content of the NGSW, with sample audio files from U.S. Presidents
from 1893 to the present, an overall system diagram is proposed
with a discussion of critical tasks associated with effective audio in-
formation retrieval. These include advanced audio segmentation,
speech recognition model adaptation for acoustic background
noise and speaker variability, and information retrieval using
natural language processing for text query requests that include
document and query expansion. For segmentation, a new eval-
uation criterion entitled fused error score (FES) is proposed,
followed by application of the CompSeg segmentation scheme on
DARPA Hub4 Broadcast News (30.5% relative improvement in
FES) and NGSW data. Transcript generation is demonstrated
for a six-decade portion of the NGSW corpus. Novel model adap-
tation using structure maximum likelihood eigenspace mapping
shows a relative 21.7% improvement. Issues regarding copyright
assessment and metadata construction are also addressed for
the purposes of a sustainable audio collection of this magnitude.
Advanced parameter-embedded watermarking is proposed with
evaluations showing robustness to correlated noise attacks. Our
experimental online system entitled “SpeechFind” is presented,
which allows for audio retrieval from a portion of the NGSW
corpus. Finally, a number of research challenges such as language
modeling and lexicon for changing time periods, speaker trait and
identification tracking, as well as new directions, are discussed in
Manuscript received July 1, 2004; revised April 4, 2005. This work was sup-
ported by the National Science Foundation (NSF) Cooperative Agreement IIS-
9817485. Any opinions, findings, and conclusions expressed are those of the
authors and do not necessarily reflect the views of the NSF. The associate editor
coordinating the review of this manuscript and approving it for publication was
Dr. Mazin Gilbert.
J. H. L. Hansen and R. Huang were with the Robust Speech Processing
Group, Center for Spoken Language Research, University of Colorado Boulder,
Boulder, CO 80302 USA. They are now with the Center for Robust Speech
Systems, University of Texas at Dallas, Richardson, TX 75083 USA (e-mail:
John.Hansen@utdallas.edu).
B. Zhou was with Robust Speech Processing Group, Center for Spoken Lan-
guage Research, University of Colorado Boulder, Boulder, CO 80302 USA. He
is now with the IBM T. J. Watson Research Center, Yorktown Heights, NY
10598 USA.
M. Seadle is with the Main Library, Michigan State University, East Lansing,
MI 48824 USA.
J. R. Deller, Jr. and A. R. Gurijala are with the Department of Electrical
and Computer Engineering, Michigan State University, East Lansing, MI 48824
USA.
M. Kurimo is with the Neural Networks Research Center, Helsinki University
of Technology, Helsinki, Finland.
P. Angkititrakul was with the Robust Speech Processing Group, Center for
Spoken Language Research, University of Colorado Boulder, Boulder, CO
80302 USA. He is now with Eliza Corp, Beverly, MA 01915 USA.
Digital Object Identifier 10.1109/TSA.2005.852088
order to address the overall task of robust phrase searching in
unrestricted audio corpora.
Index Terms—Accent classification, broadcast news, document
expansion, environmental sniffing, fidelity, fused error score, in-
formation retrieval, language modeling, model adaptation, query
expansion, robust speech recognition, robustness, security, speech
segmentation, spoken document retrieval, watermarking.
I. I
NTRODUCTION
T
HE problem of reliable speech recognition for spoken
document/information retrieval is a challenging problem
when data is recorded across different media, equipment, and
time periods. In this paper, we address a number of issues
associated with audio stream phrase recognition, copyright/wa-
termarking, and audio content delivery for a new National
Gallery of the Spoken Word (NGSW) [1]. This is the first
large-scale repository of its kind, consisting of speeches, news
broadcasts, and recordings that are of significant historical con-
tent. The U.S. National Science Foundation recently established
an initiative to provide better transition of library services to
digital format. As part of this Phase-II Digital Libraries Initia-
tive, researchers from Michigan State University (MSU) and
the University of Colorado at Boulder (CU) have teamed to
establish a fully searchable, online WWW database of spoken
word collections that span the 20th century [5]. The database
draws primarily from holdings of MSU’s Vincent Voice Li-
brary (VVL) that include more than 60 000 hr of recordings
(from Thomas Edison’s first cylinder disk recordings to famous
speeches such as man’s first steps on the moon “One Small
Step for Man, to American presidents over the past 100 years).
In this partnership, MSU digitizes and houses the collection,
as well as cataloging, organizing, and providing meta-tagging
information. A networked client-server configuration has been
established between MSU and CU to provide automatic tran-
script generation for seamless audio content delivery. MSU
is also responsible for several engineering challenges such as
digital watermarking and effective compression strategies [6],
[7]. The Robust Speech Processing Group—Center for Spoken
Language Research (RSPG-CSLR) (CU) is responsible for
developing robust automatic speech recognition for transcript
generation and proto-type audio/metadata/transcript-based user
search engine, which is called
SpeechFind [2].
In the field of robust speech recognition, there is a variety
challenging problems that persist, such as reliable speech recog-
nition across wireless communications channels, recognition of
1063-6676/$20.00 © 2005 IEEE

HANSEN et al.: SPEECHFIND: ADVANCES IN SPOKEN DOCUMENT RETRIEVAL FOR A NATIONAL GALLERY OF THE SPOKEN WORD 713
speech across changing speaker conditions (emotion and stress
[25][27], accent [28], [29]), or recognition of speech from
unknown or changing acoustic environments. The ability to
achieve effective performance in changing speaker conditions
for large vocabulary continuous speech recognition (LVCSR)
remains a challenge, as demonstrated in recent DARPA evalua-
tions focused on Broadcast News (BN) versus previous results
from the Wall Street Journal (WSJ) corpus. Although the
problem of audio stream search is relatively new, it is related
to a number of previous research problems. Systems developed
for streaming video search based on audio [30] or closed-cap-
tioning can be effective but often assume either an associated
text stream or a clean audio stream. Information retrieval
via audio and audio mining have recently produced several
commercial approaches [32], [33]; however, these methods
generally focus on relatively clean single-speaker recording
conditions. Alternative methods have considered ways to
time-compress or modify speech in order to give human lis-
teners the ability to more quickly skim through recorded audio
data [34]. In general, keyword spotting systems can be used
for topic or gisting
1
applications. However, for phrase search,
the system must be able to recover from errors in both the user
requested text-sequence and rank-ordered detected phrase sites
within the stream. Phrase search focuses more on locating a
single requested occurrence, whereas keyword/topic spotting
systems assume a number of possible searched outcomes. Great
strides have also been made in LVCSR for spoken document
retrieval for BN in English [31], [35][39], German [40], [41],
Italian [42], Korean [43], Japanese [44][47], Mandarin/Chi-
nese [48][52], [57], Finnish [104], Portuguese [53], Arabic
[54], and French [55]. The American English BN corpus re-
ects a wider range of acoustic environments than many large
vocabulary corpora (e.g., WSJ, TIMIT). However, the recogni-
tion of speech in BN reects a homogeneous data corpus (i.e.,
recordings from TV and radio news broadcasts from the 1990s,
organized into seven classes from F0: clean, to FX: low delity
with cross-talk). One natural solution to audio stream search is
to perform forced transcription for the entire dataset and simply
search the synchronized text stream. Whereas this may be a
manageable task for BN (consisting of about 100 hr), the initial
offering for NGSW will be 5000 hr (with a potential of 60 000
total hr), and it will not be possible to achieve accurate forced
transcription since text data will generally not be available.
Other studies have also considered web-based spoken docu-
ment retrieval (SDR) [3], [4], [56]. Transcript generation of
broadcast news can also be conducted in an effort to obtain near
real-time closed-captioning [58]. Instead of generating exact
transcripts, some studies have considered summarization and
topic indexing [59][61] or, more specically, topic detection
and tracking [64]; others have considered lattice-based search
[101]. Some of these ideas are related to speaker clustering [62],
[63], which is needed to improve acoustic model adaptation
for BN transcription generation. Language model adaptation
[65] and multiple/alternative language modeling [66] have also
been considered for SDR. Finally, cross and multilingual-based
1
Here, the word gisting refers to systems that identify the main topic or
gist of the audio material.
studies have also been performed for SDR [67], [68]. Advances
represented by the cited BN and SDR studies notwithstanding,
the NGSW database involves a level of complexity in terms of
the range and extent of acoustic distortion, speaker variability,
and audio quality that has not been approached in existing
research. Probably the only corpus-based study that comes
close to NGSW is one focused on Holocaust Survivors [69],
consisting of a broad range of speakers in structured two-person
interview formats.
In this paper, we introduce SpeechFind: an experimental on-
line spoken document retrieval system for the NGSW. In Sec-
tion II, we discuss the structure of the audio materials contained
in the VVL including time periods, recording conditions, audio
format, and acoustic conditions. Section III considers a brief dis-
cussion on copyright issues for NGSW. Section IV presents an
overview of the SpeechFind system including transcript genera-
tion and text-based search. Next, Section V addresses transcript-
generation based on i) unsupervised segmentation, ii) model
adaptation, iii) LVCSR, and iv) text-based information retrieval.
Section VI revisits copyright issues, with a treatment of digital
watermarking strategies. Section VII considers additional audio
stream tagging and language model concepts for next-genera-
tion SDR. Finally, Section VIII summarizes the main contribu-
tions and areas for future research.
II. A
UDIO CORPUS STRUCTURE OF
NGSW
Spoken document retrieval focuses on employing text-based
search strategies from transcripts of audio materials. The tran-
scripts, in turn, have reverse index timing information that al-
lows audio segments to be returned for user access. Whereas
automatic speech recognition (ASR) technology has advanced
signicantly, the ability to perform ASR for SDR presents some
unique challenges. These include i) a diverse range of audio
recording conditions, ii) the ability to search output text mate-
rials with variable levels of recognition (i.e., word-error-rate:
WER) performance, and iii) decisions on what material/con-
tent should be extracted for transcript knowledge to be used for
SDR (e.g., text content, speaker identication or tracking, en-
vironmental snifng [93], [94], etc.). For some audio streams
such as voice-mail, which normally contain only one speaker,
or two-way telephone conversations with two speakers, tran-
scription using ASR technology is possible since the task pri-
marily focuses on detecting silence/speech activity and then en-
gaging the recognizer appropriate for that speaker. However,
audio streams from NGSW encompass one of the widest range
of audio materials available today. Fig. 1 presents an overview
of the types of audio les and recording structure seen in the
audio. The types of audio include the following:
Monologs: single speaker talking spontaneously or
reading prepared/prompted text in clean conditions;
Two-Way Conversations: telephone conversations be-
tween two subjects that are spontaneous and could con-
tain periods with both talking;
Speeches: audio data where a person (e.g., politician)
is speaking to an audienceprimarily one talker, but
background audience noise could be present, and room
echo or noise is possible; typically read/prepared text;

714 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005
Fig. 1. Structure of i) NGSW audio recordingsspeakers, microphone(s), recording media and ii) segmentation and classication, speech recognition, and
transcript generation.
Interviews/Debates: audio streams where a person is
being interviewed by a TV or radio person. Debates
could include a moderator and/or various audience
participation (e.g., questions, applause, interruptions,
etc.); typically two speakers with spontaneous speech
in question/answer format;
Radio/TV News Broadcasts: includes traditional news
anchor with periods of both prompted read speech, talk
radio, spontaneous speech, background music, call-in
speakers, commercials, other background audio con-
tent (e.g., ofce noise such as typewriter, etc.). Audio
content would come from TV or radio studio settings
(e.g., public radio such as NPR or 60 Minutes TV
show);
Field News Broadcasts: audio content coming from
news reporters in the eld (e.g., emergency or war loca-
tions, city streets, etc.); contains a wide range of back-
ground noise content of unpredictable origin. Commu-
nication channels also impact frequency content of the
audio;
Recording Media/Transmission: audio properties
can be transformed based on the type of recording
equipment used (e.g., microphones, Edison cylinder
disks, reel-to-reel tape, cassette tape, DAT, CD, etc.)
or transmission (e.g., AM, FM , voice compression
methodsCELP, MELP, ADPCM, etc.);
Meetings/Hearings: public formal inquiries (Water-
gate hearings, U.S. Supreme Court, etc.);
Debates
: presidential, formal and informal
(NixonKennedy, ClintonDole, etc.);
Historical Recordings: NASA: walk on the moon,
Nixon: Im not a crook, M. L. King: Ihavea
dream, etc.
Therefore, NGSWaudio content includes a diverse range of
audio formats, recording media, and diverse time periods
including names, places, topics, and choice of vocabulary. The
following issues arise for transcript generation for SDR: Do we
transcribe commercials? Do we transcribe background acoustic
noise/events? Do we identify speakers with the text? Do we
identify from where the speakers are speaking (i.e., the environ-
ment/location)? How do we deal with errors in ASR (i.e., dirty
transcripts)? Since automatic transcription for such a diverse
range of audio materials will lead to signicant variability in
WER, SDR employing text-based search of such transcripts
will be an important research issue to consider. For our initial
system, we focus on transcript generation of individual speech
and disable transcription production for music/commercials.
To illustrate the range of NGSW recording conditions, three
example spectrograms are shown in Fig. 2. The recordings are
(a) Thomas Edison, my work as an electrician [talking about
contributions of 19th century scientists; original Edison cylinder
disk recording, 1908], (b) Thomas Watson, as Bell was about
to speak into the new instrument, [talking about the rst tele-
phone message from A. G. Bell on March 10, 1876; recorded
in 1926], and (c) President Bill Clinton, tonight I stand before
you, [State of the Union Address on economic expansion, Jan.
19, 1999]. These examples indicate the wide range of distor-
tions present in the speech corpus. Some of these include severe
bandwidth restrictions (e.g., Edison style cylinder disks), poor
audio from scratchy, used, or aging recording media, differences
in microphone type and placement, reverberation for speeches
from public gures, recordings from telephone, radio, or TV
broadcasts, background noise including audience and multiple
speakers or interviewers, a wide range of speaking styles and
accents, etc.
As another example, we show in Fig. 3 a summary of U.S.
Presidential speeches, consisting mostly of state-of-the-union
or campaign speeches from 1893 to the present. For each presi-
dential speech, we employed the NIST speech-to-noise ratio es-
timation scheme (STNR) to identify the mean speech and noise
decibel values. As we see from this gure, the resulting digitized
speech levels are typically near 80 dB, whereas background
noise levels can vary signicantly (4278 dB). We obtained the
STNR values for each presidential speech, which ranged be-
tween 437 dB. Clearly, the estimated STNR only has meaning
if frequency content is consistent, but as we see in this gure, the
estimated frequency bandwidth for early Edison cylinder disks
is about 12.5 kHz, whereas recordings of today are closer to
7 kHz, with AM/FM radio bandwidths in the 510 kHz range
(note that while the audio format is 44.1 kHz, 16 bit data, tran-
script generation uses a sample rate of 16 kHz; therefore, our
maximum bandwidth from these recordings for speech content
would have been 8 kHz). Recordings for Wilson and Hoover
were extremely noisy, with background audience and echo dis-
tortion, as well as poor scratchy recording equipment. In addi-
tion, vocabulary selection varies signicantly over the 110-year

HANSEN et al.: SPEECHFIND: ADVANCES IN SPOKEN DOCUMENT RETRIEVAL FOR A NATIONAL GALLERY OF THE SPOKEN WORD 715
Fig. 2. Example audio stream (8 kHz) spectrograms from NGSW. (a) Thomas Edison, recorded in 1908. (b) Thomas Watson, recorded in 1926. (c) President
William J. Clinton, recorded in 1999.
Fig. 3. Summary of presidential speeches (typically state-of-the-union addresses) from 1893 to present. Shown is each president, mean noise signal level (in
decibels) (top bars in each pair), mean speech signal level (in decibels) (bottom bars in each pair); approximate frequency bandwidth (BW) of each recording, with
an estimated speech-to-noise-ratio (STNR) varied from 4.537.25 dB.
period. Clearly, the ability to achieve reliable phrase recogni-
tion search for such data is an unparalleled challenge in speech
recognition.
III. C
OPYRIGHT ISSUES IN NGSW
When considering distribution of audio material via the
WWW, one primary logistics issue concerns copyright own-
ership. Research on watermarking digital sound is integral to
the creation of the NGSW. Most sound recordings have some
form of copyright protection under existing law. The US copy-
right Law (Title 17 of the US Code) explicitly protects sound
recordings made since 1978. Some famous speeches have been
heavily litigated. An example is Martin Luther Kings IHave
a Dream speech [20].
Many rights holders are willing to make their sound record-
ings available for educational purposes, but they often require
some form of technological protection to prevent legitimate
educational copies from being used for unauthorized commer-
cial purposes. The 1998 Digital Millennium Copyright Act
(DMCA) introduced penalties for circumventing technological
protections. Many in the academic community object to these
penalties because they create a contradiction in U.S. law: many

716 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005
Fig. 4. Overview of SpeechFind system architecture (http://SpeechFind.colorado.edu).
(a) (b)
Fig. 5. (a) Automatic transcript generation (SDR). (b) Statistical information retrieval (SIR).
legal fair uses of technologically protected works can be exer-
cised only through illegal circumvention. Audio watermarking
is a desirable technological protection mechanism because it
does not abrogate fair use rights.
There is evidence that the courts consider audio watermarks
to be a legitimate form of copyright protection. The Napster
music le-sharing case, for example, mentions both the lack of
watermarking on MP3 les and the intention to include it in the
future [19]. Watermarking is therefore not a preventative.
Prevention is attractive to those who put signicant cap-
ital toward the creation of audio works and who fear the
loss of investment and future prots. However, prevention
is fundamentally inconsistent with most US copyright law,
which instead emphasizes mechanisms for redress once an in-
fringement has occurred. Watermarking facilitates redress and
represents a copyright protection technology that universities
can use without being inconsistent with their interest in and
commitment to sharing knowledge. Further treatment of copy-
right issues and fair use can be found in [7] and [21][24]. In
Section VI, we consider advances made in digital watermarking
for the NGSW project. For the present experimental online
SDR system, digital watermarking is employed to both protect
ownership as well as help ensure integrity of the audio content.
IV. S
PEECHFIND SYSTEM OVERVIEW
Here, we present an overview of the SpeechFind system (see
Fig. 4) and describe several key modules. The system is con-
structed in two phases: i) enrollment and ii) query and retrieval.
In the enrollment phase, large audio sets are submitted for
audio segmentation and transcription generation and metadata
construction (EAD: extended archive descriptor). Once this
phase is completed, the audio material is available through the
online audio search engine (i.e., query and retrieval phase).
The system includes the following modules: an audio spider and
transcoder, spoken document transcriber, rich transcription
database, and an online public accessible search engine. As
shown in the gure, the audio spider and transcoder are respon-
sible for automatically fetching available audio archives from
a range of available servers and transcoding the heterogeneous
incoming audio les into uniform 16-kHz, 16-bit linear PCM
raw audio data (note that in general, the transcoding process
is done ofine prior to being available for user retrieval). In
addition, for those audio documents with metadata labels, this
module also parses the metadata and extracts relevant informa-
tion into a rich transcript database for guiding information
retrieval.
The spoken document transcriber includes two components,
namely, the audio segmenter and transcriber. The audio seg-
menter partitions audio data into manageable small segments by
detecting speaker, channel, and environmental change points.
The transcriber decodes every speech segment into text. If
human transcripts are available for any of the audio documents,
the segmenter is still applied to detect speaker, channel, and
environmental changes in a guided manner, with the decoder
being reduced to a forced aligner for each speech segment to
tag timing information for spoken words. Fig. 5(a) shows that
for the proposed SpeechFind system, transcript generation is
rst performed, which requires reliable acoustic and language
models that are appropriate for the type of audio stream and
time period. After transcript generation, Fig. 5(b) shows that
three associated les are linked together, namely i) the audio

Citations
More filters
Journal ArticleDOI

Content-based multimedia information retrieval: State of the art and challenges

TL;DR: This survey reviews 100+ recent articles on content-based multimedia information retrieval and discusses their role in current research directions which include browsing and search paradigms, user studies, affective computing, learning, semantic queries, new features and media types, high performance indexing, and evaluation techniques.
Journal ArticleDOI

An overview of automatic speaker diarization systems

TL;DR: An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.
Journal ArticleDOI

Retrieval and browsing of spoken content

TL;DR: The issue of handling the errorful or incomplete output provided by ASR systems for spoken audio documents is focused on, focusing on the usage case where a user enters search terms into a search engine and is returned a collection of spoken document hits.
Journal ArticleDOI

Review: Speaker segmentation and clustering

TL;DR: This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering, and model-based, metric- based, and hybrid speakers segmentation algorithms are reviewed.
Journal ArticleDOI

A review on speaker diarization systems and approaches

TL;DR: A complete framework for speaker indexing is proposed, which is aimed to be domain independent and parameter free and applicable for both online and offline applications.
References
More filters
Book

An Introduction to Multivariate Statistical Analysis

TL;DR: In this article, the distribution of the Mean Vector and the Covariance Matrix and the Generalized T2-Statistic is analyzed. But the distribution is not shown to be independent of sets of Variates.
Proceedings Article

SRILM – An Extensible Language Modeling Toolkit

TL;DR: The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.
Journal ArticleDOI

An introduction to latent semantic analysis

TL;DR: The adequacy of LSA's reflection of human knowledge has been established in a variety of ways, for example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word‐word and passage‐word lexical priming data.

Introduction To Multivariate Statistical Analysis

Anja Vogler
TL;DR: The introduction to multivariate statistical analysis is universally compatible with any devices to read, and will help you to cope with some harmful bugs inside their desktop computer.
Book

Digital Watermarking

TL;DR: Digital Watermarking covers the crucial research findings in the field and explains the principles underlying digital watermarking technologies, describes the requirements that have given rise to them, and discusses the diverse ends to which these technologies are being applied.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What could be used to guide retrieval tasks?

richer information such as accent, stress, emotion, and speaker identification contained in spoken segments could also be extracted and used to guide retrieval tasks. 

The SMF problem is used to design systems that are affine-in-parameters (but not necessarily in the data), subject to a bound on the absolute error between a desired sequence and a linearly filtered version of another sequence. 

For BN stories, WERs can be low with much redundancy in the news stories, and therefore, search for key words over longer sequences is a reasonable approach. 

Because the prediction residual associated with the coversignal is used for reconstructing the stegosignal, the autocorrelation values of the stegosignal are different from the modified autocorrelation values derived from the perturbed LP coefficients and the prediction residual . 

A conservative approach is to minimize the shift from the well-trained baseline model parameters, given the constraint of no loss of discrimination power along the first dominant eigendirections in the test speaker eigenspace:(5)By substituting (4) into (5) and minimizing the objective function using the Lagrange Multiplier method, the adapted mean can be obtained from using a linear transformation , in which is an nonsingular matrix given by(6)and where is an identity matrix. 

The baseline speaker-independent acoustic model has 6275 context-dependent tied states, each having 16 mixture component Gaussians (i.e., in total, 100 400 diagonal mixture component Gaussians exist in the acoustic model). 

2) SMF-Based Fidelity Criterion: In [71], a general parameter-embedding problem was considered whose solution is subject to an fidelity constraint on the signal. 

Since the transcribed audio segments have considerable length variations, the authors make equal to some percentage of the number of terms in each original automatic audio transcription (which achieves better performance than picking a fixed number of terms for all spoken documents). 

the tfidf weighting scheme is replaced with Okapi weighting [90], and several query and document expansion technologies are incorporated.