Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets

Open Access

Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets

- pp 31-36

TLDR

A vision-based approach to note onset detection is considered and it is found that the CNN model is highly sensitive to the optimization algorithm and hyper-parameters, and that treating the problem as binary classification may prevent the joint optimization of precision and recall.

Abstract:

Acoustic events often have a visual counterpart. Knowledge of visual information can aid the understanding of complex auditory scenes, even when only a stereo mixdown is available in the audio domain, \eg identifying which musicians are playing in large musical ensembles. In this paper, we consider a vision-based approach to note onset detection. As a case study we focus on challenging, real-world clarinetist videos and carry out preliminary experiments on a 3D convolutional neural network based on multiple streams and purposely avoiding temporal pooling. We release an audiovisual dataset with 4.5 hours of clarinetist videos together with cleaned annotations which include about 36,000 onsets and the coordinates for a number of salient points and regions of interest. By performing several training trials on our dataset, we learned that the problem is challenging. We found that the CNN model is highly sensitive to the optimization algorithm and hyper-parameters, and that treating the problem as binary classification may prevent the joint optimization of precision and recall. To encourage further research, we publicly share our dataset, annotations and all models and detail which issues we came across during our preliminary experiments.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Deep Audio-visual Learning: A Survey

Hao Zhu, +6 more

- 15 Apr 2021 -

International Journal of Automation and ...

TL;DR: A comprehensive survey of recent audio-visual learning development is provided, dividing the current audio- visual learning tasks into four different subfields: audio- Visual separation and localization, audio-Visual correspondence learning, audio -visual generation, and audio- visuals representation learning.

...read moreread less

Proceedings ArticleDOI

Sight to Sound: An End-to-End Approach for Visual Piano Transcription

A. Sophia Koepke, +3 more

TL;DR: This work proposes an end-to-end deep learning framework that learns to automatically predict note onset events given a video of a person playing the piano and finds that this approach is surprisingly effective in a variety of complex situations, particularly those in which music transcription from audio alone is impossible.

...read moreread less

Journal ArticleDOI

Audiovisual Analysis of Music Performances: Overview of an Emerging Field

Zhiyao Duan, +4 more

- 01 Jan 2019 -

IEEE Signal Processing Magazine

TL;DR: In the physical sciences and engineering domains, music has traditionally been considered an acoustic phenomenon, and existing automated music analysis approaches predominantly focus on audio signals that represent information from the acoustic rendering of music.

...read moreread less

Proceedings Article

Guitar Music Transcription from Silent Video.

Shir Goldstein, +1 more

TL;DR: This work proposes a novel, physics-based method for polyphonic NT of string instruments that can overcome some limitations posed by the relatively low sampling rate of the camera and shows that the visual-based NT method can play an important role in solving the NT problem.

...read moreread less

Proceedings ArticleDOI

Visual Music Transcription of Clarinet Video Recordings Trained with Audio-Based Labelled Data

Emilia Gómez, +3 more

TL;DR: This work addresses the automatic transcription of video recordings when the audio modality is missing or it does not have enough quality, and thus analyzes the visual information to confirm the difficulty of performing visual vs audio automatic transcription.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Posted Content

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue, +6 more

- 17 Nov 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Posted Content

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, +5 more

- 02 Dec 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, the authors proposed a simple and effective approach for spatio-temporal feature learning using deep 3D convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.

...read moreread less

Posted Content

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer, +2 more

- 22 Apr 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.

...read moreread less

Proceedings ArticleDOI

Lip Reading Sentences in the Wild

Joon Son Chung, +3 more

TL;DR: The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin, and it is demonstrated that if audio is available, then visual information helps to improve speech recognition performance.

...read moreread less

IEEE Transactions on Audio, Speech, and ...

Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

Yu Wu, +1 more

AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning

Sanchita Ghose, +1 more

- 01 Jan 2021 -

IEEE Transactions on Multimedia

Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets

Citations

Deep Audio-visual Learning: A Survey

Sight to Sound: An End-to-End Approach for Visual Piano Transcription

Audiovisual Analysis of Music Performances: Overview of an Emerging Field

Guitar Music Transcription from Silent Video.

Visual Music Transcription of Clarinet Video Recordings Trained with Audio-Based Labelled Data

References

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Learning Spatiotemporal Features with 3D Convolutional Networks

Convolutional Two-Stream Network Fusion for Video Action Recognition

Lip Reading Sentences in the Wild

Related Papers (5)

Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos

Towards Automatic Learning of Procedures From Web Instructional Videos

Weakly Supervised Representation Learning for Audio-Visual Scene Analysis

Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning