Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects

doi:10.1109/TMM.2012.2228476

Journal ArticleDOI

Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects

Hamid Izadinia, +2 more

- 01 Feb 2013 -

IEEE Transactions on Multimedia

- Vol. 15, Iss: 2, pp 378-390

Chats0

TLDR

A novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio to solve the problem of audio-video synchronization and is used to aid interactive segmentation.

Abstract:

In this paper, we propose a novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio. Our approach consists of a two-step spatiotemporal segmentation mechanism that relies on velocity and acceleration of moving objects as visual features. Each frame of the video is segmented into regions based on motion and appearance cues using the QuickShift algorithm, which are then clustered over time using K-means, so as to obtain a spatiotemporal video segmentation. The video is represented by motion features computed over individual segments. The Mel-Frequency Cepstral Coefficients (MFCC) of the audio signal, and their first order derivatives are exploited to represent audio. The proposed framework assumes there is a non-trivial correlation between these audio features and the velocity and acceleration of the moving and sounding objects. The canonical correlation analysis (CCA) is utilized to identify the moving objects which are most correlated to the audio signal. In addition to moving-sounding object identification, the same framework is also exploited to solve the problem of audio-video synchronization, and is used to aid interactive segmentation. We evaluate the performance of our proposed method on challenging videos. Our experiments demonstrate significant increase in performance over the state-of-the-art both qualitatively and quantitatively, and validate the feasibility and superiority of our approach.

Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects

Citations

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

The Sound of Pixels

The Sound of Motions

Learning to Separate Object Sounds by Watching Unlabeled Video

Audio Surveillance: A Systematic Review

References

Atomic Decomposition by Basis Pursuit

Fundamentals of speech recognition

Relations Between Two Sets of Variates

Atomic Decomposition by Basis Pursuit

Canonical Correlation Analysis: An Overview with Application to Learning Methods

Related Papers (5)

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds

The Sound of Pixels

Objects that Sound

Look, Listen and Learn