Eigen and multimodal analysis for localizing moving sounding objects

doi:10.1109/ICASSP.2016.7472179

Home
/
Papers
/
Eigen and multimodal analysis for localizing moving sounding objects

Proceedings Article•DOI•

Eigen and multimodal analysis for localizing moving sounding objects

Shreya Khare¹, Akshay Bhandari¹, Hema A. Murthy¹•Institutions (1)

Indian Institute of Technology Madras¹

20 Mar 2016-pp 2757-2761

TL;DR: This paper identifies moving objects in a video that are associated to the corresponding audio, by exploiting the correlation of audio and video features using canonical correlation analysis.

read less

Abstract: This paper identifies moving objects in a video that are associated to the corresponding audio, by exploiting the correlation of audio and video features. The proposed technique is based on the correlation of motion features of eigen moving objects with audio mel frequency cepstral coefficients features using canonical correlation analysis. We propose two strategies to detect the eigen moving objects: (i) Per-frame mapped eigen moving object (PFEMO) and (ii) Temporally coherent eigen moving object (TCEMO). While PFEMO segments each frame using superpixel segmentation, TCEMO exploits supervoxel based video segmentation to identify eigen moving objects. Qualitative (mean-opinion score) and quantitative (precision, recall, area under the curve, hit ratio) analysis shows that the performance of the proposed techniques is superior to those of the state-of-the-art methods.

...read moreread less

Citations

PDF

Open Access

More filters

Audio- and video-based biometric person authentication

[...]

Stephen J. McKenna, Shaogang Gong, Rolf P. Würtz, J Tanner, D Banin - Show less +1 more

01 Jan 1997

411 citations

Evaluation of Super Voxel Methods for Early Video Processing (Author's Manuscript)

[...]

Chenliang Xu, Jason J Corso

26 Jul 2012

TL;DR: Five supervoxel algorithms are studied in the context of what is considered to be a good supervoxels: namely, spatiotemporal uniformity, object/region boundary detection, region compression and parsimony, leading to conclusive evidence that the hierarchical graph-based and segmentation by weighted aggregation methods perform best and almost equally-well on nearly all the metrics.

...read moreread less

Abstract: : Supervoxel segmentation has strong potential to be incorporated into early video analysis as superpixel segmentation has in image analysis. However, there are many plausible supervoxel methods and little understanding as to when and where each is most appropriate. Indeed, we are not aware of a single comparative study on supervoxel segmentation. To that end, we study five supervoxel algorithms in the context of what we consider to be a good supervoxel: namely, spatiotemporal uniformity, object/region boundary detection, region compression and parsimony. For the evaluation we propose a comprehensive suite of 3D volumetric quality metrics to measure these desirable supervoxel characteristics. We use three benchmark video data sets with a variety of content-types and varying amounts of human annotations. Our findings have led us to conclusive evidence that the hierarchical graph-based and segmentation by weighted aggregation methods perform best and almost equally-well on nearly all the metrics and are the methods of choice given our proposed assumptions.

...read moreread less

210 citations

Proceedings Article•DOI•

Sound-Indicated Visual Object Detection for Robotic Exploration

[...]

Feng Wang¹, Di Guo¹, Huaping Liu¹, Junfeng Zhou¹, Fuchun Sun¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

01 May 2019

TL;DR: A novel robotic sound-indicated visual object detection framework is established, and a two-stream weakly-supervised deep learning architecture is developed to connect the visual and audio modalities for localizing the sounding object.

...read moreread less

Abstract: Robots are usually equipped with microphones and cameras to perceive and understand the physical world. Though visual object detection technology has achieved great success, the detection in other modalities remains unsolved. In this paper, we establish a novel robotic sound-indicated visual object detection framework, and develop a two-stream weakly-supervised deep learning architecture to connect the visual and audio modalities for localizing the sounding object. A dataset is constructed from the AudioSet to validate the proposed method and some promising applications are demonstrated on robotic platforms.

...read moreread less

7 citations

Cites background or methods from "Eigen and multimodal analysis for l..."

...In [10], a method is proposed to associate eigen moving objects with the corresponding audio....
[...]
...Motion information from visual frames is also very closely related to the audio information [9] [10]....
[...]

Journal Article•DOI•

Active Object Discovery and Localization Using Sound-Induced Attention

[...]

Huaping Liu¹, Feng Wang¹, Di Guo¹, Xinzhu Liu¹, Xinyu Zhang¹, Fuchun Sun¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

01 Mar 2021-IEEE Transactions on Industrial Informatics

TL;DR: A novel sound-induced attention framework is established for the visual object detection, and a two-stream weakly supervised deep learning architecture is developed to combine the visual and audio modalities for localizing the sounding object.

...read moreread less

Abstract: Industrial intelligent devices are usually equipped with both microphones and cameras to perceive and understand the physical world. Though visual object detection technology has achieved a great success, its combination with other sensing modalities remains unsolved. In this article, we establish a novel sound-induced attention framework for the visual object detection, and develop a two-stream weakly supervised deep learning architecture to combine the visual and audio modalities for localizing the sounding object. A dataset is constructed from the Audio Set to validate the proposed method and some realistic experiments are conducted to demonstrate the effectiveness of the proposed system.

...read moreread less

2 citations

Cites background from "Eigen and multimodal analysis for l..."

...In [22], eigen moving objects were proposed to be associated with the corresponding audio....
[...]
...[22] indicated that audio information is quite relevant to...
[...]

Patent•

Method of locating sound source from video

[...]

Liu Huaping, Feng Wang, Di Guo, Junfeng Zhou, Sun Fuchun - Show less +1 more

16 Apr 2019

TL;DR: In this paper, a method for locating a sound source from a video, and belongs to the field of cross-modal learning, is presented, where a training sample video is acquired in a training stage and preprocessed, a neural network composed of a full connection layer and a sound-source positioning neural network is constructed, the sound source is trained by using a pre-processed training sample, and the trained sound source positioning neural networks are obtained.

...read moreread less

Abstract: The invention provides a method for locating a sound source from a video, and belongs to the field of cross-modal learning. According to the method, a training sample video is acquired in a training stage and preprocessed, a neural network composed of a full connection layer and a sound source positioning neural network composed of a positioning network are constructed, the sound source positioning neural network is trained by using a preprocessed training sample, and the trained sound source positioning neural network is obtained. In the test stage, a test video is obtained and preprocessed,then the test video is input into the trained sound source positioning neural network, the similarity is calculated, synchronization of sound and video images and sound source positioning after synchronization are further carried out through the similarity, and therefore the problem of sound source positioning of the asynchronous video is solved. The method can automatically find the correspondingrelation between each object and the sound in the video picture, is high in positioning accuracy, is high in position precision, and is very high in application value.

...read moreread less

1 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

SLIC Superpixels Compared to State-of-the-Art Superpixel Methods

[...]

Radhakrishna Achanta¹, Appu Shaji¹, Kevin Smith², Aurelien Lucchi, Pascal Fua, Sabine Süsstrunk¹ - Show less +2 more•Institutions (2)

École Normale Supérieure¹, ETH Zurich²

01 Nov 2012-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.

...read moreread less

Abstract: Computer vision applications have come to rely increasingly on superpixels in recent years, but it is not always clear what constitutes a good superpixel algorithm. In an effort to understand the benefits and drawbacks of existing methods, we empirically compare five state-of-the-art superpixel algorithms for their ability to adhere to image boundaries, speed, memory efficiency, and their impact on segmentation performance. We then introduce a new superpixel algorithm, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels. Despite its simplicity, SLIC adheres to boundaries as well as or better than previous methods. At the same time, it is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.

...read moreread less

7,849 citations

"Eigen and multimodal analysis for l..." refers methods in this paper

...In the first pass, each frame is segmented into a large number of regions (superpixels) using Simple Linear Iterative Clustering (SLIC) [15]....
[...]

Book Chapter•DOI•

Relations Between Two Sets of Variates

[...]

Harold Hotelling¹•Institutions (1)

Columbia University¹

01 Dec 1936-Biometrika

TL;DR: The concept of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions as discussed by the authors, where the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting.

...read moreread less

Abstract: Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions. Marksmen side by side firing simultaneous shots at targets, so that the deviations are in part due to independent individual errors and in part to common causes such as wind, provide a familiar introduction to the theory of correlation; but only the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting. The wind at two places may be compared, using both components of the velocity in each place. A fluctuating vector is thus matched at each moment with another fluctuating vector. The study of individual differences in mental and physical traits calls for a detailed study of the relations between sets of correlated variates. For example the scores on a number of mental tests may be compared with physical measurements on the same persons. The questions then arise of determining the number and nature of the independent relations of mind and body shown by these data to exist, and of extracting from the multiplicity of correlations in the system suitable characterizations of these independent relations. As another example, the inheritance of intelligence in rats might be studied by applying not one but s different mental tests to N mothers and to a daughter of each

...read moreread less

6,122 citations

"Eigen and multimodal analysis for l..." refers methods in this paper

...CCA [21] is a method that determines the correlation between two sets of random variables of different dimension by projecting them on a common coordinate system....
[...]
...A threshold of 0.5 was used to choose eigen moving objects from CCA....
[...]
...Audio sources were identified by correlating Mel Frequency Cepstral Coefficients (MFCC) of the audio using CCA....
[...]
...This can be attributed to two reasons (i) incorporation of eigen analysis removes the extraneous clusters and hence improves CCA....
[...]
...(ii) The proposed methods, PFEMO and TCEMO, reduce the number of clusters that are correlated with the audio features when compared to [11], thus increasing the performance of CCA and ensuring higher precision and recall....
[...]

Journal Article•DOI•

Efficient Graph-Based Image Segmentation

[...]

Pedro F. Felzenszwalb¹, Daniel P. Huttenlocher²•Institutions (2)

Massachusetts Institute of Technology¹, Cornell University²

01 Sep 2004-International Journal of Computer Vision

TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.

...read moreread less

Abstract: This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.

...read moreread less

5,791 citations

Journal Article•DOI•

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

[...]

S. Davis, Paul Mermelstein¹•Institutions (1)

bell northern research¹

01 Aug 1980-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.

...read moreread less

Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

...read moreread less

4,822 citations

Journal Article•DOI•

A Bayesian computer vision system for modeling human interactions

[...]

Nuria Oliver¹, Barbara Rosario², Alex Pentland³•Institutions (3)

Microsoft¹, University of California, Berkeley², Massachusetts Institute of Technology³

01 Aug 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A real-time computer vision and machine learning system for modeling and recognizing human behaviors in a visual surveillance task and demonstrates the ability to use these a priori models to accurately classify real human behaviors and interactions with no additional tuning or training.

...read moreread less

Abstract: We describe a real-time computer vision and machine learning system for modeling and recognizing human behaviors in a visual surveillance task. The system deals in particularly with detecting when interactions between people occur and classifying the type of interaction. Examples of interesting interaction behaviors include following another person, altering one's path to meet another, and so forth. Our system combines top-down with bottom-up information in a closed feedback loop, with both components employing a statistical Bayesian approach. We propose and compare two different state-based learning architectures, namely, HMMs and CHMMs for modeling behaviors and interactions. Finally, a synthetic "Alife-style" training system is used to develop flexible prior models for recognizing human interactions. We demonstrate the ability to use these a priori models to accurately classify real human behaviors and interactions with no additional tuning or training.

...read moreread less

1,831 citations

Additional excerpts

...T comprising of T frames, an eigenspace model is formed by taking X frames with X << T , that captures the stationarity across X images [14]....
[...]