Pixels that sound

doi:10.1109/CVPR.2005.274

Proceedings ArticleDOI

Pixels that sound

E. Kidron, +2 more

- Vol. 1, pp 88-95

Chats0

TLDR

This work presents a stable and robust algorithm which grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution based on canonical correlation analysis (CCA), which effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels.

Abstract:

People and animals fuse auditory and visual information to obtain robust perception. A particular benefit of such cross-modal analysis is the ability to localize visual events associated with sound sources. We aim to achieve this using computer-vision aided by a single microphone. Past efforts encountered problems stemming from the huge gap between the dimensions involved and the available data. This has led to solutions suffering from low spatio-temporal resolutions. We present a rigorous analysis of the fundamental problems associated with this task. Then, we present a stable and robust algorithm which overcomes past deficiencies. It grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution. The algorithm effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels. It is based on canonical correlation analysis (CCA), where we remove inherent ill-posedness by exploiting the typical spatial sparsity of audio-visual events. The algorithm is simple and efficient thanks to its reliance on linear programming and is free of user-defined parameters. To quantitatively assess the performance, we devise a localization criterion. The algorithm capabilities were demonstrated in experiments, where it overcame substantial visual distractions and audio noise.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Look, Listen and Learn

Relja Arandjelovic, +1 more

TL;DR: There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.

...read moreread less

Book ChapterDOI

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Andrew Owens, +1 more

TL;DR: In this paper, the authors argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and they propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

...read moreread less

Proceedings Article

On Deep Multi-View Representation Learning

Weiran Wang, +3 more

TL;DR: This work finds an advantage for correlation-based representation learning, while the best results on most tasks are obtained with the new variant, deep canonically correlated autoencoders (DCCAE).

...read moreread less

Book ChapterDOI

Ambient Sound Provides Supervision for Visual Learning

Andrew Owens, +5 more

TL;DR: This work trains a convolutional neural network to predict a statistical summary of the sound associated with a video frame, and shows that this representation is comparable to that of other state-of-the-art unsupervised learning methods.

...read moreread less

Book ChapterDOI

The Sound of Pixels

Hang Zhao, +7 more

TL;DR: PixelPlayer as discussed by the authors learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel, which can be used to adjust the volume of sound sources.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A theory for multiresolution signal decomposition: the wavelet representation

Stéphane Mallat

- 01 Jul 1989 -

IEEE Transactions on Pattern Analysis an...

TL;DR: In this paper, it is shown that the difference of information between the approximation of a signal at the resolutions 2/sup j+1/ and 2 /sup j/ (where j is an integer) can be extracted by decomposing this signal on a wavelet orthonormal basis of L/sup 2/(R/sup n/), the vector space of measurable, square-integrable n-dimensional functions.

...read moreread less

Book

Information Theory

Robert B. Ash

Journal ArticleDOI

Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization

David L. Donoho, +1 more

- 04 Mar 2003 -

Proceedings of the National Academy of S...

TL;DR: This article obtains parallel results in a more general setting, where the dictionary D can arise from two or several bases, frames, or even less structured systems, and sketches three applications: separating linear features from planar ones in 3D data, noncooperative multiuser encoding, and identification of over-complete independent component models.

...read moreread less

Journal ArticleDOI

For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution

David L. Donoho

- 01 Jun 2006 -

Communications on Pure and Applied Mathe...

TL;DR: In this article, the authors consider linear equations y = Φx where y is a given vector in ℝn and Φ is a n × m matrix with n 0 so that for large n and for all Φ's except a negligible fraction, the solution x1of the 1-minimization problem is unique and equal to x0.

...read moreread less

Journal ArticleDOI

Kernel independent component analysis

Francis Bach, +1 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: A class of algorithms for independent component analysis which use contrast functions based on canonical correlations in a reproducing kernel Hilbert space is presented, showing that these algorithms outperform many of the presently known algorithms.

...read moreread less

Collapse

Pixels that sound

Citations

Look, Listen and Learn

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

On Deep Multi-View Representation Learning

Ambient Sound Provides Supervision for Visual Learning

The Sound of Pixels

References

A theory for multiresolution signal decomposition: the wavelet representation

Information Theory

Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization

For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution

Kernel independent component analysis

Related Papers (5)

The Sound of Pixels

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Look, Listen and Learn

Ambient Sound Provides Supervision for Visual Learning

Deep Residual Learning for Image Recognition