Journal ArticleDOI
Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects
Reads0
Chats0
TLDR
A novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio to solve the problem of audio-video synchronization and is used to aid interactive segmentation.Abstract:
In this paper, we propose a novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio. Our approach consists of a two-step spatiotemporal segmentation mechanism that relies on velocity and acceleration of moving objects as visual features. Each frame of the video is segmented into regions based on motion and appearance cues using the QuickShift algorithm, which are then clustered over time using K-means, so as to obtain a spatiotemporal video segmentation. The video is represented by motion features computed over individual segments. The Mel-Frequency Cepstral Coefficients (MFCC) of the audio signal, and their first order derivatives are exploited to represent audio. The proposed framework assumes there is a non-trivial correlation between these audio features and the velocity and acceleration of the moving and sounding objects. The canonical correlation analysis (CCA) is utilized to identify the moving objects which are most correlated to the audio signal. In addition to moving-sounding object identification, the same framework is also exploited to solve the problem of audio-video synchronization, and is used to aid interactive segmentation. We evaluate the performance of our proposed method on challenging videos. Our experiments demonstrate significant increase in performance over the state-of-the-art both qualitatively and quantitatively, and validate the feasibility and superiority of our approach.read more
Citations
More filters
Proceedings Article
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
TL;DR: It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
Book ChapterDOI
The Sound of Pixels
Hang Zhao,Chuang Gan,Chuang Gan,Andrew Rouditchenko,Carl Vondrick,Carl Vondrick,Josh H. McDermott,Antonio Torralba +7 more
TL;DR: PixelPlayer as discussed by the authors learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel, which can be used to adjust the volume of sound sources.
Proceedings ArticleDOI
The Sound of Motions
TL;DR: Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.
Book ChapterDOI
Learning to Separate Object Sounds by Watching Unlabeled Video
TL;DR: In this paper, a deep multi-instance multi-label learning framework is proposed to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation.
Journal ArticleDOI
Audio Surveillance: A Systematic Review
TL;DR: A general taxonomy, inspired by the more widespread video surveillance field, is proposed to systematically describe the methods covering background subtraction, event classification, object tracking, and situation analysis, highlighting the target applications of each described method and providing the reader with a systematic and schematic view.
References
More filters
Journal ArticleDOI
Atomic Decomposition by Basis Pursuit
TL;DR: Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions.
Book
Fundamentals of speech recognition
TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Book ChapterDOI
Relations Between Two Sets of Variates
TL;DR: The concept of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions as discussed by the authors, where the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting.
Journal ArticleDOI
Atomic Decomposition by Basis Pursuit
TL;DR: This work gives examples exhibiting several advantages over MOF, MP, and BOB, including better sparsity and superresolution, and obtains reasonable success with a primal-dual logarithmic barrier method and conjugate-gradient solver.
Journal ArticleDOI
Canonical Correlation Analysis: An Overview with Application to Learning Methods
TL;DR: A general method using kernel canonical correlation analysis to learn a semantic representation to web images and their associated text and compares orthogonalization approaches against a standard cross-representation retrieval technique known as the generalized vector space model is presented.