scispace - formally typeset
Proceedings ArticleDOI

Pixels that sound

Reads0
Chats0
TLDR
This work presents a stable and robust algorithm which grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution based on canonical correlation analysis (CCA), which effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels.
Abstract
People and animals fuse auditory and visual information to obtain robust perception. A particular benefit of such cross-modal analysis is the ability to localize visual events associated with sound sources. We aim to achieve this using computer-vision aided by a single microphone. Past efforts encountered problems stemming from the huge gap between the dimensions involved and the available data. This has led to solutions suffering from low spatio-temporal resolutions. We present a rigorous analysis of the fundamental problems associated with this task. Then, we present a stable and robust algorithm which overcomes past deficiencies. It grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution. The algorithm effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels. It is based on canonical correlation analysis (CCA), where we remove inherent ill-posedness by exploiting the typical spatial sparsity of audio-visual events. The algorithm is simple and efficient thanks to its reliance on linear programming and is free of user-defined parameters. To quantitatively assess the performance, we devise a localization criterion. The algorithm capabilities were demonstrated in experiments, where it overcame substantial visual distractions and audio noise.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Look, Listen and Learn

TL;DR: There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
Book ChapterDOI

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

TL;DR: In this paper, the authors argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and they propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Proceedings Article

On Deep Multi-View Representation Learning

TL;DR: This work finds an advantage for correlation-based representation learning, while the best results on most tasks are obtained with the new variant, deep canonically correlated autoencoders (DCCAE).
Book ChapterDOI

Ambient Sound Provides Supervision for Visual Learning

TL;DR: This work trains a convolutional neural network to predict a statistical summary of the sound associated with a video frame, and shows that this representation is comparable to that of other state-of-the-art unsupervised learning methods.
Book ChapterDOI

The Sound of Pixels

TL;DR: PixelPlayer as discussed by the authors learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel, which can be used to adjust the volume of sound sources.
References
More filters
Journal ArticleDOI

A theory for multiresolution signal decomposition: the wavelet representation

TL;DR: In this paper, it is shown that the difference of information between the approximation of a signal at the resolutions 2/sup j+1/ and 2 /sup j/ (where j is an integer) can be extracted by decomposing this signal on a wavelet orthonormal basis of L/sup 2/(R/sup n/), the vector space of measurable, square-integrable n-dimensional functions.
Book

Information Theory

Robert B. Ash
Journal ArticleDOI

Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization

TL;DR: This article obtains parallel results in a more general setting, where the dictionary D can arise from two or several bases, frames, or even less structured systems, and sketches three applications: separating linear features from planar ones in 3D data, noncooperative multiuser encoding, and identification of over-complete independent component models.
Journal ArticleDOI

For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution

TL;DR: In this article, the authors consider linear equations y = Φx where y is a given vector in ℝn and Φ is a n × m matrix with n 0 so that for large n and for all Φ's except a negligible fraction, the solution x1of the 1-minimization problem is unique and equal to x0.
Journal ArticleDOI

Kernel independent component analysis

TL;DR: A class of algorithms for independent component analysis which use contrast functions based on canonical correlations in a reproducing kernel Hilbert space is presented, showing that these algorithms outperform many of the presently known algorithms.
Related Papers (5)