scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Speaker association with signal-level audiovisual fusion

01 Jun 2004-IEEE Transactions on Multimedia (IEEE)-Vol. 6, Iss: 3, pp 406-413
TL;DR: A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence and nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains.
Abstract: Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.

Summary (2 min read)

Introduction

  • The authors are interested facilitating untethered and casual conversational interaction, and address the problem of how to temporally segregate the speech of multiple users interacting with a system.
  • The authors approach this problem from a signal-processing perspective, and develop a statistical measure of whether two signals come from a common source.
  • The authors make no assumptions about the content of the audio signal or the Manuscript received December 1, 2002; revised November 15, 2003.
  • The core of their approach is a technique for jointly modeling audio and video variation to identify cross-modal correspondences.

III. SIGNAL-LEVEL AUDIOVISUAL ASSOCIATION

  • The authors propose an independent cause model to capture the relationship between generated signals in each individual modality.
  • For their purposes, will be vectors of spectral measurments.
  • Estimating the mutual information between signals is, in this sense, equivalent to computing log-likelihood ratio statistic for the hypothesis test of (1).
  • A significant issue, and what distinguishes their approach from others, is how one models the probability density terms of (2).
  • The authors then present a probabilistic model for cross-modal signal generation, and show how audiovisual correspondences can be found by identifying components with maximal mutual information.

IV. PROBABILISTIC MODELS OF AUDIOVISUAL FUSION

  • The authors consider multimodal scenes which can be modeled probabilistically with one joint audiovisual source and distinct background interference sources for each modality.
  • The authors purpose here is to analyze under which conditions and in what sense their methodology uncovers the underlying cause of their observation without explicitly defining or its exact relationship to and .
  • In this case the authors get the graph of Fig. 1(c) and from that graph they can extract the Markov chain which contains elements related only to .
  • Of course, the authors are still left with the formidable task of finding a decomposition, but given the decomposition it can be shown, using the data processing inequality [14], that the following inequality holds:.
  • The implication is that fusion in such a manner discovers the underlying cause of the observations, that is, the joint density of is strongly related to and in that sense captures elements of the generative model of audio and video.

V. MAXIMALLY INFORMATIVE PROJECTIONS

  • The authors now describe a method for learning maximally informative projections.
  • Following [17], the authors use a nonparametric model of joint density for which an analytic gradient of the mutual information with respect to projection parameters is available.
  • The linear projection defined by and maps A/V samples to low dimensional features and .
  • Both and are vector-valued functions ( -dimensional) and is the support of the output (i.e., a hyper-cube with volume ).
  • In the experiments that follow with 150 to 300 iterations.

A. Capacity Control

  • In [17] early results were demonstrated using this method for the video-based localization of a speaking user.
  • To improve on the method, the authors thus introduce a capacity control mechanism in the form of a prior bias to small weights.
  • This term is more easily computed in the frequency domain (see [19]) and is equivalent to prewhitening the images using the inverse of the average power spectrum.
  • It is the moving edges (lips, chin, etc.) which the authors expect to convey the most information about the audio.
  • The projection coefficients related to the audio signal, , are solved in a similar fashion without the initial prewhitening step.

VI. EXPERIMENTS

  • The authors motivating scenario for this application is a group of users interacting with an anonymous handheld device or kiosk using spoken commands.
  • Fig. 2(b) shows an image of the pixel-wise standard deviations of the image sequence.
  • Figs. 2(d) and 3(d) show the resulting when the alternate audio sequence is used.
  • For Fig. 2 the estimate of mutual information was 0.68 relative to the maximum possible value for the correct audio sequence.
  • Fig. 5 shows the result tracking two users speaking in turns in front of a single camera and microphone, and detecting which is most likely to be speaking based on the measured audiovisual consistency.

Did you find this useful? Give us your feedback

Figures (6)

Content maybe subject to copyright    Report

406 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004
Speaker Association With Signal-Level
Audiovisual Fusion
John W. Fisher, III, Member, IEEE, and Trevor Darrell, Member, IEEE
Abstract—Audio and visual signals arriving from a common
source are detected using a signal-level fusion technique. A
probabilistic multimodal generation model is introduced and used
to derive an information theoretic measure of cross-modal corre-
spondence. Nonparametric statistical density modeling techniques
can characterize the mutual information between signals from
different domains. By comparing the mutual information between
different pairs of signals, it is possible to identify which person is
speaking a given utterance and discount errant motion or audio
from other utterances or nonspeech events.
Index Terms—Audiovisual correspondence, multimodal data as-
sociation, mutual information.
I. INTRODUCTION
C
ONVERSATIONAL dialog systems have become practi-
cally useful in many application domains, including travel
reservations, traffic information, and database access. However
most existing conversational speech systems require tethered in-
teraction, and work primarily for a single user. Users must wear
an attached microphone or speak into a telephone handset, and
do so one at a time. This limits the range of use of dialog sys-
tems, since in many applications users might expect to freely
approach and interact with a device. Worse, they may wish to
arrive as a group, and talk among themselves while interacting
with the system. To date it has been difficult for speech recogni-
tion systems to handle such conditions, and correctly recognize
the utterances intended for the device. We are interested facili-
tating untethered and casual conversational interaction, and ad-
dress the problem of how to temporally segregate the speech of
multiple users interacting with a system.
With a single modality, properly associating speech from
multiple unknown speakers is quite difficult. However, if other
modalities are available they can often provide disambiguating
information. In particular, visual information can be valuable
for deciding whether an individual user is speaking a particular
utterance. We wish to solve a conversational audiovisual
correspondence problem: given sets of audio visual signals,
decide which audiovisual pairs are consistent and could have
come from a single speaker. We approach this problem from a
signal-processing perspective, and develop a statistical measure
of whether two signals come from a common source. We make
no assumptions about the content of the audio signal or the
Manuscript received December 1, 2002; revised November 15, 2003. The
associate editor coordinating the review of this manuscript and approving it for
publication was Dr. Jun Ohya.
The authors are with the Computer Science and Artificial Intelligence Lab-
oratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
(e-mail: trevor@ai.mit.edu).
Digital Object Identifier 10.1109/TMM.2004.827503
visual appearance, and use a general information-theoretic
approach. Our method works without learning a specific
lip or language model, and is therefore robust to a range of
appearances and acoustic environments.
The core of our approach is a technique for jointly modeling
audio and video variation to identify cross-modal correspon-
dences. It is driven by the simple hypothesis of whether a re-
gion of interest in an image sequence (perhaps the entire image)
is associated with a separately measured audio signal. We for-
mulate the problem within a nonparametric hypothesis testing
framework, from which information theoretic quantities natu-
rally arise as the measure of association between two high-di-
mensional signals. We show how this approach can detect which
user is speaking when several are facing a device and distracting
motion is present. This allows the segregation of users’ utter-
ances from each other’s speech, and from background noise
events.
II. R
ELATED WORK
Humans routinely perform tasks in which ambiguous audi-
tory and visual data are combined in order to support accu-
rate perception. In contrast, automated approaches for statistical
processing of multimodal data sources lag far behind. This is
primarily due to the fact that few methods adequately model
the complexity of the audio/visual relationship. Classical ap-
proaches to multimodal fusion at a signal processing level often
either assume a statistical relationship which is too simple (e.g.,
jointly Gaussian) or defer fusion to the decision level when
many of the joint (and useful) properties have been lost. While
such pragmatic choices may lead to simple statistical measures,
they do so at the cost of modeling capacity.
An information theoretic approach motivates fusion at the
measurement level without regard to specific parametric den-
sities. The idea of using information-theoretic principles in an
adaptive framework is not new (e.g., see [1] for an overview)
with many approaches suggested over the last 30 years. Crit-
ical distinctions in most information theoretic approaches lie in
how densities are modeled (either explicitly or implicitly), how
entropy (and by extension mutual information) is approximated
or estimated, and the types of mappings which are used (e.g.,
linear versus nonlinear). Early approaches used a Gaussian as-
sumption, e.g., Plumbley [2], [3] and Becker [4].
There has been substantial progress on feature-level integra-
tion of speech and vision. For example, Meier et al. [5], Stork
[6], and others have built visual speech reading systems that
can improve speech recognition results dramatically. Our goal
is not recognition, but to be able to detect and disambiguate
cases where audio and video signals are coming from different
1520-9210/04$20.00 © 2004 IEEE

FISHER AND DARRELL: SPEAKER ASSOCIATION WITH SIGNAL-LEVEL AUDIOVISUAL FUSION 407
sources. Hershey and Movellan [7] addressed this problem
using the per-pixel correlation relative to the energy of an audio
track as a measure of their dependence. An inherent assumption
of this method was that the joint statistics were Gaussian. As
this is a per-pixel measure there is no straightforward way to
integrate the measure over an image region for purposes of
association without making simplifying assumptions which
will not hold in practice (e.g., pixels are independent of each
other conditioned on the speech signal). We should note that
the objective of their work was to locate the source of an audio
signal in an image sequence, association is an implicit step.
A more general approach was taken by Slaney and Covell [8]
which looked specifically at optimizing temporal alignment
between audio and video tracks using canonical correlations
which is equivalent to the maximum mutual information
projection in the jointly Gaussian case. They did not address
the problem of detecting whether two signals came from the
same person, although their method could be adapted to do so.
Nock et al. [9] consider two mutual information approaches
and one HMM based approach for assessing face and speech
consistency. The mutual information approaches compare a
histogram based estimate over vector quantized codebooks to
a Gaussian estimate over feature vectors. They report that the
Gaussian method gave superior results when using a cepstral
representation of the audio and a discrete cosine transform
representation of the video. All three methods utilize a training
corpus in order estimate a prior model, thereafter associations
and/or likelihoods are computed under the trained model. A
time-delay neural network approach was suggested in [10]
demonstrating location detection for a single visual appearance
on a small test set. Each of [8][10] require training data in
order to estimate model parameters. Here, and in contrast to
the previous methods, we develop a methodology for testing
audiovideo association in the absence of either a prior model
and without the requirement of training data with which to
construct one.
III. S
IGNAL-LEVEL AUDIOVISUAL
ASSOCIATION
We propose an independent cause model to capture the rela-
tionship between generated signals in each individual modality.
Using principles from information theory and nonparametric
statistics we show how an approach for learning maximally
informative joint subspaces can find cross-modal correspon-
dences. We first show how audiovisual association problem can
be formulated as a hypothesis test and giving a relationship to
mutual information based association methods (see [11] for an
extensive treatment). Following that we present an information
theoretic analysis of a graphical model of multimodal signal
generation which gives some incite on the relationship between
data association and learning a generative audiovisual model.
Given an audiovideo sequence, let us denote the sequence
of
images (or a region within each image) as where
indicates (discrete) time. Similarly denote audio measurements
as
. For our purposes, will be vectors of spectral measur-
ments. Treating the audio and video measurements as i.i.d. sam-
ples from the random variables
and , respectively, allows
us to cast the audiovisual association problem as a simple hy-
pothesis test:
(1)
where
states that the measurements are statistically inde-
pendent (i.e., their joint density is expressed as a product of
marginal densities) and
states that the measurements are
statistically dependent (or equivalently associated). Perceptual
grouping problems, in which there are multiple sources of both
video and audio can be stated in a similar, albeit more compli-
cated, fashion [12], [13]. Plugging the measurements into a (nor-
malized) log-likelihood ratio statistic, using a consistent proba-
bility density estimator for
, , , and taking
the expectation with respect to the joint probability density of
and yields
(2)
(3)
(4)
where
is the mutual information between the
random variables
and . Mutual information can be
expressed as a combination of the differential entropy terms
, , [14]. Consequently, estimating the
mutual information between signals is, in this sense, equivalent
to computing log-likelihood ratio statistic for the hypothesis
test of (1). For more complex perceptual grouping hypotheses
consisting of only pairwise relationships it has been shown
[12], [13] that the sufficient statistics involve pairwise mutual
information estimates. We elaborate on this in the empirical
section. A significant issue, and what distinguishes our ap-
proach from others, is how one models the probability density
terms of (2). Another important issue, which we address later,
arises when direct density estimation is infeasible as is the case
when measurements are of high dimension ( e.g., audio video
measurements).
Nonparametric density estimators, such as the Parzen kernel
density estimator [15], are useful for capturing complex sta-
tistical dependencies between random variables. The resulting
models can then be used to measure the degree of mutual infor-
mation in complex phenomena [16] which we apply to audio/vi-
sual data. This technique simultaneously learns projections of
images in the video sequence and projections of sequences of
periodograms taken from the audio sequence. The projections
are computed adaptively such that the video and audio projec-
tions have maximum mutual information (MI).
We now review our basic method for audiovisual fusion
and information theoretic adaptive methods. We then present
a probabilistic model for cross-modal signal generation, and
show how audiovisual correspondences can be found by
identifying components with maximal mutual information.
In an experiment comparing the audio and video of every

408 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004
combination of a group of eight users, our technique was
able to perfectly match the corresponding audio and video for
each user. These results are based purely on the instantaneous
cross-modal mutual information between the
projections of the
two signals, and do not rely on any prior experience or model
of users speech or appearance.
IV. P
ROBABILISTIC
MODELS OF
AUDIOVISUAL
FUSION
We consider multimodal scenes which can be modeled prob-
abilistically with one joint audiovisual source and distinct back-
ground interference sources for each modality. Each observation
is a combination of information from the joint source, and in-
formation from the background interferer for that channel. We
use a graphical model (Fig. 1) to represent this relationship. In
the diagrams,
represents the joint source, while and rep-
resent single modality background interference. Recall that the
test of association is formulated as a measure of dependence
between the measurements
and . By conjecturing a la-
tent variable structure via
measurement dependence
is explained solely through the hidden cause
. Our purpose
here is to analyze under which conditions and in what sense our
methodology uncovers the underlying cause of our observation
without explicitly defining
or its exact relationship to and
.
Fig. 1(a) shows an independent cause model for our typical
case, where
are unobserved random variables rep-
resenting the causes of our (high-dimensional) observations in
each modality
. In general there may be more causes
and more measurements, but this simple case can be used to il-
lustrate our algorithm. An important aspect is that the measure-
ments have common dependence on a single cause. The joint
statistical model consistent with the graph of Fig. 1(a) is
Given the independent cause model a simple application of
Bayes rule (or the equivalent graphical manipulation) yields the
graph of Fig. 1(b) which is consistent with
which shows that information about contained in is con-
veyed through the joint statistics of
and . The consequence
being that, in general, we cannot disambiguate the influences
that
and have on the measurements. A similar graph is ob-
tained by conditioning on
. Suppose, however, that decom-
positions of the measurement
and exist such that the
following joint densities can be written:
where and . An example for
our specific application would be segmenting the video image
(or filtering the audio signal). In this case we get the graph of
Fig. 1(c) and from that graph we can extract the Markov chain
which contains elements related only to
. Fig. 1(d) shows
Fig. 1. Graphs illustrating the various statistical models exploited by the
algorithm: (a) the independent cause model
X
and
X
are independent of
each other conditioned on
f
A; B; C
g
, (b) information about
X
contained in
X
is conveyed through joint statistics of
A
and
B
, (c) the graph implied by
the existence of a separating function, and (d) two equivalent Markov chains
which can be extracted from the graphs if the separating functions can be found.
equivalent graphs of the extracted Markov chain. As a conse-
quence, there is no influence due to
or .
Of course, we are still left with the formidable task of finding
a decomposition, but given the decomposition it can be shown,
using the data processing inequality [14], that the following in-
equality holds:
More importantly, these inequalities hold for any functions of
and (e.q. and ).
That is
(5)
(6)
and finally one can show (see [12]) that
(7)
The inequalities of (5) and (6) show that by maximizing the
mutual information between
we necessarily increase
the mutual information between
and and and . The
implication is that fusion in such a manner discovers the un-
derlying cause of the observations, that is, the joint density of
is strongly related to and in that sense captures
elements of the generative model of audio and video. Note that
this is the case without ever specifying the exact form of
or its
relationship to the measurements. Additionally, the inequality
of (7) shows that by maximizing
we are also maxi-
mizing a lower bound on the likelihood statistic, (3), of the as-
sociation hypothesis test. Finally, with an approximation we de-
scribe shortly, we can optimize this criterion without estimating
the separating function directly. In the event that a perfect de-
composition does not exist, it can be shown that the method will
approach a good solution in the KullbackLeibler sense. From

FISHER AND DARRELL: SPEAKER ASSOCIATION WITH SIGNAL-LEVEL AUDIOVISUAL FUSION 409
the perspective of information theory, estimating separate pro-
jections of the audiovideo measurements which have high mu-
tual information has intuitive appeal as such features will be pre-
dictive of each other. An additional advantage is that the form of
those statistics are not subject to the strong parametric assump-
tions (e.g., joint Gaussianity) which we wish to avoid.
V. M
AXIMALLY
INFORMATIVE
PROJECTIONS
We now describe a method for learning maximally informa-
tive projections. The method uses a technique that maximizes
the mutual information between the projections of the audio-
visual measurements. Following [17], we use a nonparametric
model of joint density for which an analytic gradient of the mu-
tual information with respect to projection parameters is avail-
able. In principle the method may be applied to any function of
the measurements,
, which is differentiable in the
parameters
(e.g., as shown in [17] ). Here, we restrict ourselves
to linear functions of the measurements resulting in a significant
computational savings at a minimal cost to the representational
power. Note that while the projections are linear, the joint den-
sity is estimated nonparametrically allowing for more complex
joint dependencies than can be captured by Gaussian assump-
tions. We parameterize the projections as
(8)
(9)
where
and are lexicographic samples of
images and periodograms, respectively, from an A/V sequence.
The linear projection defined by
and
maps A/V samples to low dimensional features
and . Treating s and s as samples from a
random variable our goal is to choose
and to maximize the
mutual information,
, of the derived measurements.
Mutual information for continuous random variables can be
expressed in several ways as a combination of differential en-
tropy terms [14]
(10)
Mutual information indicates the amount of information that
one random variable conveys on average about another. The
usual difficulty of MI as a criterion for adaptation is that it is an
integral function of probability densities. Furthermore, in gen-
eral we are not given the densities themselves, but samples from
which they must be inferred. To overcome this problem, we re-
place each entropy term in (10) with a second-order Taylor-
series approximation as in [16], [18]
(11)
(12)
where
is the support of one feature output, is the sup-
port of the other,
is the uniform density over that support,
and
is a Parzen density [15] estimated over the projected
samples. The Parzen density estimate is defined as
(13)
where
is a gaussian kernel (in our case) and is the stan-
dard deviation. The Parzen density estimate has the capacity to
capture relationships with more complex structure than typical
parametric families of densities.
Note that this is essentially an integrated squared error com-
parison between the density of the projections to the uniform
density (which has maximum entropy over a finite region). An
advantage of this particular combination of second-order en-
tropy approximation and nonparametric density estimator is that
the gradient terms (appropriately combined to approximate mu-
tual information as in (12)) with respect to the projection coef-
ficients can be computed exactly by evaluating a finite number
of functions at a finite number of sample locations in the output
space as shown in [16], [18]. The update term for the individual
entropy terms in (12) (note the negative sign on the third term)
of the
th feature vector at iteration as a function of the value
of the feature vector at iteration
is (where denotes a
sample of either
or or their concatenation depending on
which term of (12) is being computed)
(14)
(15)
(16)
where
, ,or depending on the en-
tropy term. Both
and are vector-valued func-
tions (
-dimensional) and is the support of the output (i.e.,
a hyper-cube with volume
). The notation indicates
the
th element of . Adaptation consists of the update rule
above followed by a modified least squares solution for
and
until a local maximum is reached. In the experiments that
follow
with 150 to 300 iterations.

410 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 3, JUNE 2004
A. Capacity Control
In [17] early results were demonstrated using this method for
the video-based localization of a speaking user. However, the
technique lacked robustness as the projection coefficients were
under-determined. To improve on the method, we thus introduce
a capacity control mechanism in the form of a prior bias to small
weights. The method of [16] requires that the projection be dif-
ferentiable, which it is in this case. The specific means of ca-
pacity control that we utilize is to impose an
penalty on the
projection coefficients of
and . Furthermore, we impose
the criterion that if we consider the projection
as a filter, it
has low output energy when convolved with images in the se-
quence (on average). This constraint is the same as that proposed
by Mahalanobis et al. [19] for designing optimized correlators
the difference being that in their case the projection output was
designed explicitly while in our case it is derived from the MI
optimization in the output space.
The adaptation criterion, which we maximize in practice, is
then a combination of the approximation to MI (11) and the
regularization terms:
(17)
where the last term derives from the output energy constraint
and
is average autocorrelation function (taken over all im-
ages in the sequence). This term is more easily computed in the
frequency domain (see [19]) and is equivalent to prewhitening
the images using the inverse of the average power spectrum.
The scalar weighting terms
, , , were set using a data
dependent heuristic for all experiments. Note that there is a
straightforward probabilistic interperetation of the each of
terms where
relates to the hypothesis test and the
remaining terms represent Gaussian priors on the coefficients
of the projections (but not on the resulting the projections of
the measurements).
Computing
can be decomposed into three stages:
1) Prewhiten the images once (using the average spectrum
of the images) followed by iterations of
2) Updating the feature values (
s) using (14), and
3) Solving for the projection coefficients using least squares
and the
penalty.
The prewhitening interpretation has intuitive appeal for the
images as it accentuates edges in the input image. It is the
moving edges (lips, chin, etc.) which we expect to convey the
most information about the audio. Furthermore, by including
a prewhitening filter as a preproecessing step one can exclude
the final term of (17) which is what we do in practice.
The projection coefficients related to the audio signal,
, are
solved in a similar fashion (simultaneously) without the initial
prewhitening step.
VI. E
XPERIMENTS
Our motivating scenario for this application is a group of
users interacting with an anonymous handheld device or kiosk
using spoken commands. Given a received audio signal, we
would like to verify whether the person speaking the command
Fig. 2. Video sequence contains one speaker and monitor which is flickering:
(a) one image from the sequence, (b) pixel-wise image of standard deviations
taken over the entire sequence, (c) image of the learned projection,
h
, and
(d) image of
h
for incorrect audio.
Fig. 3. Video sequence containing one speaker (person on left) and one person
who is randomly moving their mouth/head (but not speaking): (a) one image
from the sequence, (b) pixel-wise image of standard deviations taken over the
entire sequence, (c) image of the learned projection,
h
, and (d) image of
h
for incorrect audio.
is in the field of view of the camera on the device, and if so
to localize which person is speaking. Simple techniques which
check only for the presence of a face (or moving face) would fail
when two people were looking at their individual devices and
one spoke a command. Since interaction may be anonymous,
we presume no prior model of the voice or appearance of users
are available to perform the verification and localization.
In the first experiment
1
we collected audiovideo data from
eight subjects. In all cases, the video data was collected at 29.97
frames per second at a resolution of 360
240. The audio signal
was collected at 48 000 KHz, but only 10 KHz of frequency con-
tent was used. All subjects were asked to utter a specific phrase.
This typically yielded 22.5 s of data. Video frames were pro-
cessed as is, while the audio signal was transformed to a series
1
A portion of these results appear in [20]

Citations
More filters
Journal ArticleDOI
TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
Abstract: Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

706 citations

Proceedings ArticleDOI
10 Oct 2008
TL;DR: This paper deals with a novel entropy approximation method for Gaussian mixture random vectors, which is based on a component-wise Taylor-series expansion of the logarithm of aGaussian mixture and on a splitting method of Gaussia mixture components.
Abstract: For many practical probability density representations such as for the widely used Gaussian mixture densities, an analytic evaluation of the differential entropy is not possible and thus, approximate calculations are inevitable. For this purpose, the first contribution of this paper deals with a novel entropy approximation method for Gaussian mixture random vectors, which is based on a component-wise Taylor-series expansion of the logarithm of a Gaussian mixture and on a splitting method of Gaussian mixture components. The employed order of the Taylor-series expansion and the number of components used for splitting allows balancing between accuracy and computational demand. The second contribution is the determination of meaningful and efficiently to calculate lower and upper bounds of the entropy, which can be also used for approximation purposes. In addition, a refinement method for the more important upper bound is proposed in order to approach the true entropy value.

252 citations


Cites methods from "Speaker association with signal-lev..."

  • ...In [11], [12], a deterministic approximation is developed by replacing (1) with the squared integral difference between f(~) and a uniform density....

    [...]

Proceedings ArticleDOI
20 Jun 2005
TL;DR: This work presents a stable and robust algorithm which grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution based on canonical correlation analysis (CCA), which effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels.
Abstract: People and animals fuse auditory and visual information to obtain robust perception. A particular benefit of such cross-modal analysis is the ability to localize visual events associated with sound sources. We aim to achieve this using computer-vision aided by a single microphone. Past efforts encountered problems stemming from the huge gap between the dimensions involved and the available data. This has led to solutions suffering from low spatio-temporal resolutions. We present a rigorous analysis of the fundamental problems associated with this task. Then, we present a stable and robust algorithm which overcomes past deficiencies. It grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution. The algorithm effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels. It is based on canonical correlation analysis (CCA), where we remove inherent ill-posedness by exploiting the typical spatial sparsity of audio-visual events. The algorithm is simple and efficient thanks to its reliance on linear programming and is free of user-defined parameters. To quantitatively assess the performance, we devise a localization criterion. The algorithm capabilities were demonstrated in experiments, where it overcame substantial visual distractions and audio noise.

198 citations


Cites background or methods from "Speaker association with signal-lev..."

  • ...It affects methods based on MI as well [13]....

    [...]

  • ...Audio-visual association can also be performed by optimizing the mutual information (MI) of modal representations [13], while trading off (2)-based regularization terms....

    [...]

Journal ArticleDOI
TL;DR: A new multimodal fusion strategy for open-set speaker identification using a combination of early and late integration following canonical correlation analysis (CCA) of speech and lip texture features is proposed.
Abstract: It is well-known that early integration (also called data fusion) is effective when the modalities are correlated, and late integration (also called decision or opinion fusion) is optimal when modalities are uncorrelated. In this paper, we propose a new multimodal fusion strategy for open-set speaker identification using a combination of early and late integration following canonical correlation analysis (CCA) of speech and lip texture features. We also propose a method for high precision synchronization of the speech and lip features using CCA prior to the proposed fusion. Experimental results show that i) the proposed fusion strategy yields the best equal error rates (EER), which are used to quantify the performance of the fusion strategy for open-set speaker identification, and ii) precise synchronization prior to fusion improves the EER; hence, the best EER is obtained when the proposed synchronization scheme is employed together with the proposed fusion strategy. We note that the proposed fusion strategy outperforms others because the features used in the late integration are truly uncorrelated, since they are output of the CCA analysis.

188 citations


Cites methods from "Speaker association with signal-lev..."

  • ...In [10], the speaker association problem is addressed via an information theoretic method, which aims to maximize the mutual information between the projections of audiovisual measurements so as to detect the parts of video, that are highly correlated with the speech signal....

    [...]

Journal ArticleDOI
TL;DR: A new method able to integrate audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone is presented.
Abstract: In the context of the automated surveillance field, automatic scene analysis and understanding systems typically consider only visual information, whereas other modalities, such as audio, are typically disregarded. This paper presents a new method able to integrate audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone. Visual information is analyzed by a standard visual background/foreground (BG/FG) modelling module, enhanced with a novelty detection stage and coupled with an audio BG/FG modelling scheme. These processes permit one to detect separate audio and visual patterns representing unusual unimodal events in a scene. The integration of audio and visual data is subsequently performed by exploiting the concept of synchrony between such events. The audio-visual (AV) association is carried out online and without need for training sequences, and is actually based on the computation of a characteristic feature called audio-video concurrence matrix, allowing one to detect and segment AV events, as well as to discriminate between them. Experimental tests involving classification and clustering of events show all the potentialities of the proposed approach, also in comparison with the results obtained by employing the single modalities and without considering the synchrony issue

181 citations

References
More filters
Book
01 Jan 1991
TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Abstract: Preface to the Second Edition. Preface to the First Edition. Acknowledgments for the Second Edition. Acknowledgments for the First Edition. 1. Introduction and Preview. 1.1 Preview of the Book. 2. Entropy, Relative Entropy, and Mutual Information. 2.1 Entropy. 2.2 Joint Entropy and Conditional Entropy. 2.3 Relative Entropy and Mutual Information. 2.4 Relationship Between Entropy and Mutual Information. 2.5 Chain Rules for Entropy, Relative Entropy, and Mutual Information. 2.6 Jensen's Inequality and Its Consequences. 2.7 Log Sum Inequality and Its Applications. 2.8 Data-Processing Inequality. 2.9 Sufficient Statistics. 2.10 Fano's Inequality. Summary. Problems. Historical Notes. 3. Asymptotic Equipartition Property. 3.1 Asymptotic Equipartition Property Theorem. 3.2 Consequences of the AEP: Data Compression. 3.3 High-Probability Sets and the Typical Set. Summary. Problems. Historical Notes. 4. Entropy Rates of a Stochastic Process. 4.1 Markov Chains. 4.2 Entropy Rate. 4.3 Example: Entropy Rate of a Random Walk on a Weighted Graph. 4.4 Second Law of Thermodynamics. 4.5 Functions of Markov Chains. Summary. Problems. Historical Notes. 5. Data Compression. 5.1 Examples of Codes. 5.2 Kraft Inequality. 5.3 Optimal Codes. 5.4 Bounds on the Optimal Code Length. 5.5 Kraft Inequality for Uniquely Decodable Codes. 5.6 Huffman Codes. 5.7 Some Comments on Huffman Codes. 5.8 Optimality of Huffman Codes. 5.9 Shannon-Fano-Elias Coding. 5.10 Competitive Optimality of the Shannon Code. 5.11 Generation of Discrete Distributions from Fair Coins. Summary. Problems. Historical Notes. 6. Gambling and Data Compression. 6.1 The Horse Race. 6.2 Gambling and Side Information. 6.3 Dependent Horse Races and Entropy Rate. 6.4 The Entropy of English. 6.5 Data Compression and Gambling. 6.6 Gambling Estimate of the Entropy of English. Summary. Problems. Historical Notes. 7. Channel Capacity. 7.1 Examples of Channel Capacity. 7.2 Symmetric Channels. 7.3 Properties of Channel Capacity. 7.4 Preview of the Channel Coding Theorem. 7.5 Definitions. 7.6 Jointly Typical Sequences. 7.7 Channel Coding Theorem. 7.8 Zero-Error Codes. 7.9 Fano's Inequality and the Converse to the Coding Theorem. 7.10 Equality in the Converse to the Channel Coding Theorem. 7.11 Hamming Codes. 7.12 Feedback Capacity. 7.13 Source-Channel Separation Theorem. Summary. Problems. Historical Notes. 8. Differential Entropy. 8.1 Definitions. 8.2 AEP for Continuous Random Variables. 8.3 Relation of Differential Entropy to Discrete Entropy. 8.4 Joint and Conditional Differential Entropy. 8.5 Relative Entropy and Mutual Information. 8.6 Properties of Differential Entropy, Relative Entropy, and Mutual Information. Summary. Problems. Historical Notes. 9. Gaussian Channel. 9.1 Gaussian Channel: Definitions. 9.2 Converse to the Coding Theorem for Gaussian Channels. 9.3 Bandlimited Channels. 9.4 Parallel Gaussian Channels. 9.5 Channels with Colored Gaussian Noise. 9.6 Gaussian Channels with Feedback. Summary. Problems. Historical Notes. 10. Rate Distortion Theory. 10.1 Quantization. 10.2 Definitions. 10.3 Calculation of the Rate Distortion Function. 10.4 Converse to the Rate Distortion Theorem. 10.5 Achievability of the Rate Distortion Function. 10.6 Strongly Typical Sequences and Rate Distortion. 10.7 Characterization of the Rate Distortion Function. 10.8 Computation of Channel Capacity and the Rate Distortion Function. Summary. Problems. Historical Notes. 11. Information Theory and Statistics. 11.1 Method of Types. 11.2 Law of Large Numbers. 11.3 Universal Source Coding. 11.4 Large Deviation Theory. 11.5 Examples of Sanov's Theorem. 11.6 Conditional Limit Theorem. 11.7 Hypothesis Testing. 11.8 Chernoff-Stein Lemma. 11.9 Chernoff Information. 11.10 Fisher Information and the Cram-er-Rao Inequality. Summary. Problems. Historical Notes. 12. Maximum Entropy. 12.1 Maximum Entropy Distributions. 12.2 Examples. 12.3 Anomalous Maximum Entropy Problem. 12.4 Spectrum Estimation. 12.5 Entropy Rates of a Gaussian Process. 12.6 Burg's Maximum Entropy Theorem. Summary. Problems. Historical Notes. 13. Universal Source Coding. 13.1 Universal Codes and Channel Capacity. 13.2 Universal Coding for Binary Sequences. 13.3 Arithmetic Coding. 13.4 Lempel-Ziv Coding. 13.5 Optimality of Lempel-Ziv Algorithms. Compression. Summary. Problems. Historical Notes. 14. Kolmogorov Complexity. 14.1 Models of Computation. 14.2 Kolmogorov Complexity: Definitions and Examples. 14.3 Kolmogorov Complexity and Entropy. 14.4 Kolmogorov Complexity of Integers. 14.5 Algorithmically Random and Incompressible Sequences. 14.6 Universal Probability. 14.7 Kolmogorov complexity. 14.9 Universal Gambling. 14.10 Occam's Razor. 14.11 Kolmogorov Complexity and Universal Probability. 14.12 Kolmogorov Sufficient Statistic. 14.13 Minimum Description Length Principle. Summary. Problems. Historical Notes. 15. Network Information Theory. 15.1 Gaussian Multiple-User Channels. 15.2 Jointly Typical Sequences. 15.3 Multiple-Access Channel. 15.4 Encoding of Correlated Sources. 15.5 Duality Between Slepian-Wolf Encoding and Multiple-Access Channels. 15.6 Broadcast Channel. 15.7 Relay Channel. 15.8 Source Coding with Side Information. 15.9 Rate Distortion with Side Information. 15.10 General Multiterminal Networks. Summary. Problems. Historical Notes. 16. Information Theory and Portfolio Theory. 16.1 The Stock Market: Some Definitions. 16.2 Kuhn-Tucker Characterization of the Log-Optimal Portfolio. 16.3 Asymptotic Optimality of the Log-Optimal Portfolio. 16.4 Side Information and the Growth Rate. 16.5 Investment in Stationary Markets. 16.6 Competitive Optimality of the Log-Optimal Portfolio. 16.7 Universal Portfolios. 16.8 Shannon-McMillan-Breiman Theorem (General AEP). Summary. Problems. Historical Notes. 17. Inequalities in Information Theory. 17.1 Basic Inequalities of Information Theory. 17.2 Differential Entropy. 17.3 Bounds on Entropy and Relative Entropy. 17.4 Inequalities for Types. 17.5 Combinatorial Bounds on Entropy. 17.6 Entropy Rates of Subsets. 17.7 Entropy and Fisher Information. 17.8 Entropy Power Inequality and Brunn-Minkowski Inequality. 17.9 Inequalities for Determinants. 17.10 Inequalities for Ratios of Determinants. Summary. Problems. Historical Notes. Bibliography. List of Symbols. Index.

45,034 citations

Journal ArticleDOI
TL;DR: In this paper, the problem of the estimation of a probability density function and of determining the mode of the probability function is discussed. Only estimates which are consistent and asymptotically normal are constructed.
Abstract: : Given a sequence of independent identically distributed random variables with a common probability density function, the problem of the estimation of a probability density function and of determining the mode of a probability function are discussed. Only estimates which are consistent and asymptotically normal are constructed. (Author)

10,114 citations


"Speaker association with signal-lev..." refers background in this paper

  • ...Nonparametric density estimators, such as the Parzen kernel density estimator [15], are useful for capturing complex statistical dependencies between random variables....

    [...]

  • ...where is the support of one feature output, is the support of the other, is the uniform density over that support, and is a Parzen density [15] estimated over the projected...

    [...]

Book
01 Jan 1959

7,235 citations

Journal ArticleDOI

1,489 citations


"Speaker association with signal-lev..." refers background in this paper

  • ...We first show how audiovisual association problem can be formulated as a hypothesis test and giving a relationship to mutual information based association methods (see [11] for an extensive treatment)....

    [...]

Journal ArticleDOI
TL;DR: The synthesis of a new category of spatial filters that produces sharp output correlation peaks with controlled peak values is considered, and these filters are referred to as minimum average correlation energy filters.
Abstract: The synthesis of a new category of spatial filters that produces sharp output correlation peaks with controlled peak values is considered. The sharp nature of the correlation peak is the major feature emphasized, since it facilitates target detection. Since these filters minimize the average correlation plane energy as the first step in filter synthesis, we refer to them as minimum average correlation energy filters. Experimental laboratory results from optical implementation of the filters are also presented and discussed.

741 citations


"Speaker association with signal-lev..." refers background or methods in this paper

  • ...This term is more easily computed in the frequency domain (see [19]) and is equivalent to prewhitening the images using the inverse of the average power spectrum....

    [...]

  • ...[19] for designing optimized correlators the difference being that in their case the projection output was designed explicitly while in our case it is derived from the MI...

    [...]

Frequently Asked Questions (6)
Q1. What have the authors contributed in "Speaker association with signal-level audiovisual fusion" ?

In this paper, a probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. 

Computing can be decomposed into three stages:1) Prewhiten the images once (using the average spectrum of the images) followed by iterations of 2) Updating the feature values ( ’s) using (14), and 3) Solving for the projection coefficients using least squaresand the penalty. 

Nonparametric statistical density models can be used to represent complex joint densities of projected signals, and to successfully estimate mutual information. 

Using principles from information theory and nonparametric statistics the authors show how an approach for learning maximally informative joint subspaces can find cross-modal correspondences. 

The adaptation criterion, which the authors maximize in practice, is then a combination of the approximation to MI (11) and the regularization terms:(17)where the last term derives from the output energy constraint and is average autocorrelation function (taken over all images in the sequence). 

Mutual information for continuous random variables can be expressed in several ways as a combination of differential entropy terms [14](10)Mutual information indicates the amount of information that one random variable conveys on average about another.