Speaker association with signal-level audiovisual fusion
Summary (2 min read)
Introduction
- The authors are interested facilitating untethered and casual conversational interaction, and address the problem of how to temporally segregate the speech of multiple users interacting with a system.
- The authors approach this problem from a signal-processing perspective, and develop a statistical measure of whether two signals come from a common source.
- The authors make no assumptions about the content of the audio signal or the Manuscript received December 1, 2002; revised November 15, 2003.
- The core of their approach is a technique for jointly modeling audio and video variation to identify cross-modal correspondences.
III. SIGNAL-LEVEL AUDIOVISUAL ASSOCIATION
- The authors propose an independent cause model to capture the relationship between generated signals in each individual modality.
- For their purposes, will be vectors of spectral measurments.
- Estimating the mutual information between signals is, in this sense, equivalent to computing log-likelihood ratio statistic for the hypothesis test of (1).
- A significant issue, and what distinguishes their approach from others, is how one models the probability density terms of (2).
- The authors then present a probabilistic model for cross-modal signal generation, and show how audiovisual correspondences can be found by identifying components with maximal mutual information.
IV. PROBABILISTIC MODELS OF AUDIOVISUAL FUSION
- The authors consider multimodal scenes which can be modeled probabilistically with one joint audiovisual source and distinct background interference sources for each modality.
- The authors purpose here is to analyze under which conditions and in what sense their methodology uncovers the underlying cause of their observation without explicitly defining or its exact relationship to and .
- In this case the authors get the graph of Fig. 1(c) and from that graph they can extract the Markov chain which contains elements related only to .
- Of course, the authors are still left with the formidable task of finding a decomposition, but given the decomposition it can be shown, using the data processing inequality [14], that the following inequality holds:.
- The implication is that fusion in such a manner discovers the underlying cause of the observations, that is, the joint density of is strongly related to and in that sense captures elements of the generative model of audio and video.
V. MAXIMALLY INFORMATIVE PROJECTIONS
- The authors now describe a method for learning maximally informative projections.
- Following [17], the authors use a nonparametric model of joint density for which an analytic gradient of the mutual information with respect to projection parameters is available.
- The linear projection defined by and maps A/V samples to low dimensional features and .
- Both and are vector-valued functions ( -dimensional) and is the support of the output (i.e., a hyper-cube with volume ).
- In the experiments that follow with 150 to 300 iterations.
A. Capacity Control
- In [17] early results were demonstrated using this method for the video-based localization of a speaking user.
- To improve on the method, the authors thus introduce a capacity control mechanism in the form of a prior bias to small weights.
- This term is more easily computed in the frequency domain (see [19]) and is equivalent to prewhitening the images using the inverse of the average power spectrum.
- It is the moving edges (lips, chin, etc.) which the authors expect to convey the most information about the audio.
- The projection coefficients related to the audio signal, , are solved in a similar fashion without the initial prewhitening step.
VI. EXPERIMENTS
- The authors motivating scenario for this application is a group of users interacting with an anonymous handheld device or kiosk using spoken commands.
- Fig. 2(b) shows an image of the pixel-wise standard deviations of the image sequence.
- Figs. 2(d) and 3(d) show the resulting when the alternate audio sequence is used.
- For Fig. 2 the estimate of mutual information was 0.68 relative to the maximum possible value for the correct audio sequence.
- Fig. 5 shows the result tracking two users speaking in turns in front of a single camera and microphone, and detecting which is most likely to be speaking based on the measured audiovisual consistency.
Did you find this useful? Give us your feedback
Citations
706 citations
252 citations
Cites methods from "Speaker association with signal-lev..."
...In [11], [12], a deterministic approximation is developed by replacing (1) with the squared integral difference between f(~) and a uniform density....
[...]
[...]
198 citations
Cites background or methods from "Speaker association with signal-lev..."
...It affects methods based on MI as well [13]....
[...]
...Audio-visual association can also be performed by optimizing the mutual information (MI) of modal representations [13], while trading off (2)-based regularization terms....
[...]
188 citations
Cites methods from "Speaker association with signal-lev..."
...In [10], the speaker association problem is addressed via an information theoretic method, which aims to maximize the mutual information between the projections of audiovisual measurements so as to detect the parts of video, that are highly correlated with the speech signal....
[...]
181 citations
References
45,034 citations
10,114 citations
"Speaker association with signal-lev..." refers background in this paper
...Nonparametric density estimators, such as the Parzen kernel density estimator [15], are useful for capturing complex statistical dependencies between random variables....
[...]
...where is the support of one feature output, is the support of the other, is the uniform density over that support, and is a Parzen density [15] estimated over the projected...
[...]
1,489 citations
"Speaker association with signal-lev..." refers background in this paper
...We first show how audiovisual association problem can be formulated as a hypothesis test and giving a relationship to mutual information based association methods (see [11] for an extensive treatment)....
[...]
741 citations
"Speaker association with signal-lev..." refers background or methods in this paper
...This term is more easily computed in the frequency domain (see [19]) and is equivalent to prewhitening the images using the inverse of the average power spectrum....
[...]
...[19] for designing optimized correlators the difference being that in their case the projection output was designed explicitly while in our case it is derived from the MI...
[...]
Related Papers (5)
Frequently Asked Questions (6)
Q2. What is the criterion for a prewhitening filter?
Computing can be decomposed into three stages:1) Prewhiten the images once (using the average spectrum of the images) followed by iterations of 2) Updating the feature values ( ’s) using (14), and 3) Solving for the projection coefficients using least squaresand the penalty.
Q3. How can nonparametric statistical density models be used to represent complex joint densities of projected?
Nonparametric statistical density models can be used to represent complex joint densities of projected signals, and to successfully estimate mutual information.
Q4. How can the authors learn the relationship between audio and video?
Using principles from information theory and nonparametric statistics the authors show how an approach for learning maximally informative joint subspaces can find cross-modal correspondences.
Q5. What is the adaptation criterion for the projections?
The adaptation criterion, which the authors maximize in practice, is then a combination of the approximation to MI (11) and the regularization terms:(17)where the last term derives from the output energy constraint and is average autocorrelation function (taken over all images in the sequence).
Q6. What is the way to estimate the mutual information of continuous random variables?
Mutual information for continuous random variables can be expressed in several ways as a combination of differential entropy terms [14](10)Mutual information indicates the amount of information that one random variable conveys on average about another.