scispace - formally typeset
Open Access

Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets

TLDR
A vision-based approach to note onset detection is considered and it is found that the CNN model is highly sensitive to the optimization algorithm and hyper-parameters, and that treating the problem as binary classification may prevent the joint optimization of precision and recall.
Abstract
Acoustic events often have a visual counterpart. Knowledge of visual information can aid the understanding of complex auditory scenes, even when only a stereo mixdown is available in the audio domain, \eg identifying which musicians are playing in large musical ensembles. In this paper, we consider a vision-based approach to note onset detection. As a case study we focus on challenging, real-world clarinetist videos and carry out preliminary experiments on a 3D convolutional neural network based on multiple streams and purposely avoiding temporal pooling. We release an audiovisual dataset with 4.5 hours of clarinetist videos together with cleaned annotations which include about 36,000 onsets and the coordinates for a number of salient points and regions of interest. By performing several training trials on our dataset, we learned that the problem is challenging. We found that the CNN model is highly sensitive to the optimization algorithm and hyper-parameters, and that treating the problem as binary classification may prevent the joint optimization of precision and recall. To encourage further research, we publicly share our dataset, annotations and all models and detail which issues we came across during our preliminary experiments.

read more

Citations
More filters
Journal ArticleDOI

Deep Audio-visual Learning: A Survey

TL;DR: A comprehensive survey of recent audio-visual learning development is provided, dividing the current audio- visual learning tasks into four different subfields: audio- Visual separation and localization, audio-Visual correspondence learning, audio -visual generation, and audio- visuals representation learning.
Proceedings ArticleDOI

Sight to Sound: An End-to-End Approach for Visual Piano Transcription

TL;DR: This work proposes an end-to-end deep learning framework that learns to automatically predict note onset events given a video of a person playing the piano and finds that this approach is surprisingly effective in a variety of complex situations, particularly those in which music transcription from audio alone is impossible.
Journal ArticleDOI

Audiovisual Analysis of Music Performances: Overview of an Emerging Field

TL;DR: In the physical sciences and engineering domains, music has traditionally been considered an acoustic phenomenon, and existing automated music analysis approaches predominantly focus on audio signals that represent information from the acoustic rendering of music.
Proceedings Article

Guitar Music Transcription from Silent Video.

TL;DR: This work proposes a novel, physics-based method for polyphonic NT of string instruments that can overcome some limitations posed by the relatively low sampling rate of the camera and shows that the visual-based NT method can play an important role in solving the NT problem.
Proceedings ArticleDOI

Visual Music Transcription of Clarinet Video Recordings Trained with Audio-Based Labelled Data

TL;DR: This work addresses the automatic transcription of video recordings when the audio modality is missing or it does not have enough quality, and thus analyzes the visual information to confirm the difficulty of performing visual vs audio automatic transcription.
References
More filters
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Posted Content

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Posted Content

Learning Spatiotemporal Features with 3D Convolutional Networks

TL;DR: In this article, the authors proposed a simple and effective approach for spatio-temporal feature learning using deep 3D convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.
Posted Content

Convolutional Two-Stream Network Fusion for Video Action Recognition

TL;DR: In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.
Proceedings ArticleDOI

Lip Reading Sentences in the Wild

TL;DR: The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin, and it is demonstrated that if audio is available, then visual information helps to improve speech recognition performance.
Related Papers (5)