scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Eigen and multimodal analysis for localizing moving sounding objects

TL;DR: This paper identifies moving objects in a video that are associated to the corresponding audio, by exploiting the correlation of audio and video features using canonical correlation analysis.
Abstract: This paper identifies moving objects in a video that are associated to the corresponding audio, by exploiting the correlation of audio and video features. The proposed technique is based on the correlation of motion features of eigen moving objects with audio mel frequency cepstral coefficients features using canonical correlation analysis. We propose two strategies to detect the eigen moving objects: (i) Per-frame mapped eigen moving object (PFEMO) and (ii) Temporally coherent eigen moving object (TCEMO). While PFEMO segments each frame using superpixel segmentation, TCEMO exploits supervoxel based video segmentation to identify eigen moving objects. Qualitative (mean-opinion score) and quantitative (precision, recall, area under the curve, hit ratio) analysis shows that the performance of the proposed techniques is superior to those of the state-of-the-art methods.
Citations
More filters
26 Jul 2012
TL;DR: Five supervoxel algorithms are studied in the context of what is considered to be a good supervoxels: namely, spatiotemporal uniformity, object/region boundary detection, region compression and parsimony, leading to conclusive evidence that the hierarchical graph-based and segmentation by weighted aggregation methods perform best and almost equally-well on nearly all the metrics.
Abstract: : Supervoxel segmentation has strong potential to be incorporated into early video analysis as superpixel segmentation has in image analysis. However, there are many plausible supervoxel methods and little understanding as to when and where each is most appropriate. Indeed, we are not aware of a single comparative study on supervoxel segmentation. To that end, we study five supervoxel algorithms in the context of what we consider to be a good supervoxel: namely, spatiotemporal uniformity, object/region boundary detection, region compression and parsimony. For the evaluation we propose a comprehensive suite of 3D volumetric quality metrics to measure these desirable supervoxel characteristics. We use three benchmark video data sets with a variety of content-types and varying amounts of human annotations. Our findings have led us to conclusive evidence that the hierarchical graph-based and segmentation by weighted aggregation methods perform best and almost equally-well on nearly all the metrics and are the methods of choice given our proposed assumptions.

210 citations

Proceedings ArticleDOI
Feng Wang1, Di Guo1, Huaping Liu1, Junfeng Zhou1, Fuchun Sun1 
01 May 2019
TL;DR: A novel robotic sound-indicated visual object detection framework is established, and a two-stream weakly-supervised deep learning architecture is developed to connect the visual and audio modalities for localizing the sounding object.
Abstract: Robots are usually equipped with microphones and cameras to perceive and understand the physical world. Though visual object detection technology has achieved great success, the detection in other modalities remains unsolved. In this paper, we establish a novel robotic sound-indicated visual object detection framework, and develop a two-stream weakly-supervised deep learning architecture to connect the visual and audio modalities for localizing the sounding object. A dataset is constructed from the AudioSet to validate the proposed method and some promising applications are demonstrated on robotic platforms.

7 citations


Cites background or methods from "Eigen and multimodal analysis for l..."

  • ...In [10], a method is proposed to associate eigen moving objects with the corresponding audio....

    [...]

  • ...Motion information from visual frames is also very closely related to the audio information [9] [10]....

    [...]

Journal ArticleDOI
Huaping Liu1, Feng Wang1, Di Guo1, Xinzhu Liu1, Xinyu Zhang1, Fuchun Sun1 
TL;DR: A novel sound-induced attention framework is established for the visual object detection, and a two-stream weakly supervised deep learning architecture is developed to combine the visual and audio modalities for localizing the sounding object.
Abstract: Industrial intelligent devices are usually equipped with both microphones and cameras to perceive and understand the physical world. Though visual object detection technology has achieved a great success, its combination with other sensing modalities remains unsolved. In this article, we establish a novel sound-induced attention framework for the visual object detection, and develop a two-stream weakly supervised deep learning architecture to combine the visual and audio modalities for localizing the sounding object. A dataset is constructed from the Audio Set to validate the proposed method and some realistic experiments are conducted to demonstrate the effectiveness of the proposed system.

2 citations


Cites background from "Eigen and multimodal analysis for l..."

  • ...In [22], eigen moving objects were proposed to be associated with the corresponding audio....

    [...]

  • ...[22] indicated that audio information is quite relevant to...

    [...]

Patent
16 Apr 2019
TL;DR: In this paper, a method for locating a sound source from a video, and belongs to the field of cross-modal learning, is presented, where a training sample video is acquired in a training stage and preprocessed, a neural network composed of a full connection layer and a sound-source positioning neural network is constructed, the sound source is trained by using a pre-processed training sample, and the trained sound source positioning neural networks are obtained.
Abstract: The invention provides a method for locating a sound source from a video, and belongs to the field of cross-modal learning. According to the method, a training sample video is acquired in a training stage and preprocessed, a neural network composed of a full connection layer and a sound source positioning neural network composed of a positioning network are constructed, the sound source positioning neural network is trained by using a preprocessed training sample, and the trained sound source positioning neural network is obtained. In the test stage, a test video is obtained and preprocessed,then the test video is input into the trained sound source positioning neural network, the similarity is calculated, synchronization of sound and video images and sound source positioning after synchronization are further carried out through the similarity, and therefore the problem of sound source positioning of the asynchronous video is solved. The method can automatically find the correspondingrelation between each object and the sound in the video picture, is high in positioning accuracy, is high in position precision, and is very high in application value.

1 citations

References
More filters
Journal ArticleDOI
TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.
Abstract: Computer vision applications have come to rely increasingly on superpixels in recent years, but it is not always clear what constitutes a good superpixel algorithm. In an effort to understand the benefits and drawbacks of existing methods, we empirically compare five state-of-the-art superpixel algorithms for their ability to adhere to image boundaries, speed, memory efficiency, and their impact on segmentation performance. We then introduce a new superpixel algorithm, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels. Despite its simplicity, SLIC adheres to boundaries as well as or better than previous methods. At the same time, it is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.

7,849 citations


"Eigen and multimodal analysis for l..." refers methods in this paper

  • ...In the first pass, each frame is segmented into a large number of regions (superpixels) using Simple Linear Iterative Clustering (SLIC) [15]....

    [...]

Book ChapterDOI
TL;DR: The concept of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions as discussed by the authors, where the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting.
Abstract: Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions. Marksmen side by side firing simultaneous shots at targets, so that the deviations are in part due to independent individual errors and in part to common causes such as wind, provide a familiar introduction to the theory of correlation; but only the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting. The wind at two places may be compared, using both components of the velocity in each place. A fluctuating vector is thus matched at each moment with another fluctuating vector. The study of individual differences in mental and physical traits calls for a detailed study of the relations between sets of correlated variates. For example the scores on a number of mental tests may be compared with physical measurements on the same persons. The questions then arise of determining the number and nature of the independent relations of mind and body shown by these data to exist, and of extracting from the multiplicity of correlations in the system suitable characterizations of these independent relations. As another example, the inheritance of intelligence in rats might be studied by applying not one but s different mental tests to N mothers and to a daughter of each

6,122 citations


"Eigen and multimodal analysis for l..." refers methods in this paper

  • ...CCA [21] is a method that determines the correlation between two sets of random variables of different dimension by projecting them on a common coordinate system....

    [...]

  • ...A threshold of 0.5 was used to choose eigen moving objects from CCA....

    [...]

  • ...Audio sources were identified by correlating Mel Frequency Cepstral Coefficients (MFCC) of the audio using CCA....

    [...]

  • ...This can be attributed to two reasons (i) incorporation of eigen analysis removes the extraneous clusters and hence improves CCA....

    [...]

  • ...(ii) The proposed methods, PFEMO and TCEMO, reduce the number of clusters that are correlated with the audio features when compared to [11], thus increasing the performance of CCA and ensuring higher precision and recall....

    [...]

Journal ArticleDOI
TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.
Abstract: This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.

5,791 citations

Journal ArticleDOI
TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

4,822 citations

Journal ArticleDOI
TL;DR: A real-time computer vision and machine learning system for modeling and recognizing human behaviors in a visual surveillance task and demonstrates the ability to use these a priori models to accurately classify real human behaviors and interactions with no additional tuning or training.
Abstract: We describe a real-time computer vision and machine learning system for modeling and recognizing human behaviors in a visual surveillance task. The system deals in particularly with detecting when interactions between people occur and classifying the type of interaction. Examples of interesting interaction behaviors include following another person, altering one's path to meet another, and so forth. Our system combines top-down with bottom-up information in a closed feedback loop, with both components employing a statistical Bayesian approach. We propose and compare two different state-based learning architectures, namely, HMMs and CHMMs for modeling behaviors and interactions. Finally, a synthetic "Alife-style" training system is used to develop flexible prior models for recognizing human interactions. We demonstrate the ability to use these a priori models to accurately classify real human behaviors and interactions with no additional tuning or training.

1,831 citations


Additional excerpts

  • ...T comprising of T frames, an eigenspace model is formed by taking X frames with X << T , that captures the stationarity across X images [14]....

    [...]