scispace - formally typeset
Open AccessJournal ArticleDOI

Creating A Multi-track Classical Musical Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

TLDR
The dataset as mentioned in this paper consists of 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks and provides the musical score in MIDI format, audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground truth annotation files including frame-level and note-level transcriptions.
Abstract
We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including frame-level and note-level transcriptions. We describe our methodology for the creation of the dataset, particularly highlighting our approaches for addressing the challenges involved in maintaining synchronization and expressiveness. We demonstrate the high quality of synchronization achieved with our proposed approach by comparing the dataset with existing widely-used music audio datasets. We anticipate that the dataset will be useful for the development and evaluation of existing music information retrieval (MIR) tasks, as well as for novel multi-modal tasks. We benchmark two existing MIR tasks (multi-pitch analysis and score-informed source separation) on the dataset and compare with other existing music audio datasets. Additionally, we consider two novel multi-modal MIR tasks (visually informed multi-pitch analysis and polyphonic vibrato analysis) enabled by the dataset and provide evaluation measures and baseline systems for future comparisons (from our recent work). Finally, we propose several emerging research directions that the dataset enables.

read more

Citations
More filters
Book ChapterDOI

Foley Music: Learning to Generate Music from Videos

TL;DR: Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments, is introduced and a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements is presented.
Journal ArticleDOI

Deep Audio-visual Learning: A Survey

TL;DR: A comprehensive survey of recent audio-visual learning development is provided, dividing the current audio- visual learning tasks into four different subfields: audio- Visual separation and localization, audio-Visual correspondence learning, audio -visual generation, and audio- visuals representation learning.
Proceedings ArticleDOI

Temporally Guided Music-to-Body-Movement Generation

TL;DR: This work represents the first attempt to generate 3-D violinists?
Proceedings ArticleDOI

Multi-instrument Music Synthesis with Spectrogram Diffusion

TL;DR: This work compares training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and finds that the DDPM approach is superior both qualita-tively and as measured by audio reconstruction and Fréchet distance metrics.
Proceedings ArticleDOI

Temporally Guided Music-to-Body-Movement Generation

TL;DR: In this article, a neural network model was proposed to generate virtual violinist's 3D skeleton movements from music audio, which incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences.
Related Papers (5)