scispace - formally typeset
Journal ArticleDOI

Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects

Reads0
Chats0
TLDR
A novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio to solve the problem of audio-video synchronization and is used to aid interactive segmentation.
Abstract
In this paper, we propose a novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio. Our approach consists of a two-step spatiotemporal segmentation mechanism that relies on velocity and acceleration of moving objects as visual features. Each frame of the video is segmented into regions based on motion and appearance cues using the QuickShift algorithm, which are then clustered over time using K-means, so as to obtain a spatiotemporal video segmentation. The video is represented by motion features computed over individual segments. The Mel-Frequency Cepstral Coefficients (MFCC) of the audio signal, and their first order derivatives are exploited to represent audio. The proposed framework assumes there is a non-trivial correlation between these audio features and the velocity and acceleration of the moving and sounding objects. The canonical correlation analysis (CCA) is utilized to identify the moving objects which are most correlated to the audio signal. In addition to moving-sounding object identification, the same framework is also exploited to solve the problem of audio-video synchronization, and is used to aid interactive segmentation. We evaluate the performance of our proposed method on challenging videos. Our experiments demonstrate significant increase in performance over the state-of-the-art both qualitatively and quantitatively, and validate the feasibility and superiority of our approach.

read more

Citations
More filters
Proceedings Article

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

TL;DR: It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
Book ChapterDOI

The Sound of Pixels

TL;DR: PixelPlayer as discussed by the authors learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel, which can be used to adjust the volume of sound sources.
Proceedings ArticleDOI

The Sound of Motions

TL;DR: Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.
Book ChapterDOI

Learning to Separate Object Sounds by Watching Unlabeled Video

TL;DR: In this paper, a deep multi-instance multi-label learning framework is proposed to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation.
Journal ArticleDOI

Audio Surveillance: A Systematic Review

TL;DR: A general taxonomy, inspired by the more widespread video surveillance field, is proposed to systematically describe the methods covering background subtraction, event classification, object tracking, and situation analysis, highlighting the target applications of each described method and providing the reader with a systematic and schematic view.
References
More filters
Journal ArticleDOI

Atomic Decomposition by Basis Pursuit

TL;DR: Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions.
Book

Fundamentals of speech recognition

TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Book ChapterDOI

Relations Between Two Sets of Variates

TL;DR: The concept of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions as discussed by the authors, where the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting.
Journal ArticleDOI

Atomic Decomposition by Basis Pursuit

TL;DR: This work gives examples exhibiting several advantages over MOF, MP, and BOB, including better sparsity and superresolution, and obtains reasonable success with a primal-dual logarithmic barrier method and conjugate-gradient solver.
Journal ArticleDOI

Canonical Correlation Analysis: An Overview with Application to Learning Methods

TL;DR: A general method using kernel canonical correlation analysis to learn a semantic representation to web images and their associated text and compares orthogonalization approaches against a standard cross-representation retrieval technique known as the generalized vector space model is presented.