scispace - formally typeset
Open AccessJournal ArticleDOI

Automatic music transcription: challenges and future directions

TLDR
Limits of current transcription methods are analyzed and promising directions for future research are identified, including the integration of information from multiple algorithms and different musical aspects.
Abstract
Automatic music transcription is considered by many to be a key enabling technology in music signal processing. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper we analyse limitations of current methods and identify promising directions for future research. Current transcription methods use general purpose models which are unable to capture the rich diversity found in music signals. One way to overcome the limited performance of transcription systems is to tailor algorithms to specific use-cases. Semi-automatic approaches are another way of achieving a more reliable transcription. Also, the wealth of musical scores and corresponding audio data now available are a rich potential source of training data, via forced alignment of audio to scores, but large scale utilisation of such data has yet to be attempted. Other promising approaches include the integration of information from multiple algorithms and different musical aspects.

read more

Content maybe subject to copyright    Report

City, University of London Institutional Repository
Citation: Benetos, E., Dixon, S., Giannoulis, D., Kirchhoff, H. and Klapuri, A. (2013).
Automatic music transcription: challenges and future directions. Journal of Intelligent
Information Systems, pp. 1-28. doi: 10.1007/s10844-013-0258-3
This is the unspecified version of the paper.
This version of the publication may differ from the final published
version.
Permanent repository link: https://openaccess.city.ac.uk/id/eprint/2524/
Link to published version: http://dx.doi.org/10.1007/s10844-013-0258-3
Copyright: City Research Online aims to make research outputs of City,
University of London available to a wider audience. Copyright and Moral
Rights remain with the author(s) and/or copyright holders. URLs from
City Research Online may be freely distributed and linked to.
Reuse: Copies of full items can be used for personal research or study,
educational, or not-for-profit purposes without prior permission or
charge. Provided that the authors, title and full bibliographic details are
credited, a hyperlink and/or URL is given for the original metadata page
and the content is not changed in any way.
City Research Online: http://openaccess.city.ac.uk/ publications@city.ac.uk
City Research Online

Journal of Intelligent Information Systems manuscript No.
(will be inserted by the editor)
Automatic Music Transcription:
Challenges and Future Directions
Emmanouil Benetos · Simon Dixon ·
Dimitrios Giannoulis ·
Holger Kirchhoff · Anssi Klapuri
Received: date / Accepted: date
Abstract Automatic music transcription is considered by many to be a key en-
abling technology in music signal processing. However, the perfo rm an ce of tran-
scription systems is still significantly below that of a human expert, and accuracies
reported in recent years seem to have reached a limit, although t he field is still
very active. In this paper we analyse limitations of current methods and identify
promising directions for future research. Current transcription m et hods use gen-
eral purpose models which are unable to capture the rich diversity foun d in music
signals. One way to overcome the limited performance of transcription systems is
to tailor algorithms to specific use-cases. Semi- au to ma ti c approaches are another
way of achieving a more re lia bl e transcription. Also, the wealth of musical scores
and corresp on din g audio data now available are a rich potential source of training
data, via forced alignment of audio to scores, but large scale utilisation of such
data h as yet to be attempted. Other promising approaches include the integration
of info rm at ion from multiple algorithms and different musical aspects.
Keywords Music sig na l analysis · Music information retrieval · Automatic music
transcription
Equally contributing authors.
E. Benetos
Department of Computer Science
City University London
Tel.: +44 20 7040 4154
E-mail: emmanouil.benetos.1@city.ac.uk
S. Dixon, D. Giannoulis, H. Kirchhoff
Centre for Digital Music
Queen Mary University of London
Tel.: +44 20 7882 7681
E-mail: {simon.dixon, dimitrios.giannoulis, holger.kirchhoff}@eecs.qmul.ac.uk
A. Klapuri
Ovelin and
Tampere University of Technology
E-mail: anssi.klapuri@tu t.fi
E. Benetos and A. Klapuri were at the Centre for Digital Music, Queen Mary University of
London

2 E. Benetos, S. Dix on, D. Giannoulis, H. Kirchhoff, an d A. Klapuri
1 Introduction
Automatic music transcription (AMT) is the process of converting an acoustic
musical signal into some form of musical notation. In [24] it is defined as the process
of converting an audio record in g into a piano-roll notation (a two-dimensional
representation of musical notes across time), w h ile in [75] it is defined as the process
of converting a recording into common music notation (i.e. a score). Even for
expert musicians, transc rib ing polyphonic pieces of music is not a trivial task (see
Chapter 1 of [75] and [77]), and while the problem of automatic pitch estimation
for m on op ho ni c signals might be considered solved, the creation o f an automa te d
system able to transcribe polyphonic music without restrictions on the degree of
polyphony or the instrume nt type still remain s open.
The most immediate application of automatic music transcription is for allow-
ing musicians to record the notes of an improvised performance in order to be
able to reproduce it. AMT also has great value in musical styles where no score
exists, e.g. music from oral traditions, jazz, pop, etc. In the past years, the prob-
lem of automatic music transcription has gained considerable research interest due
to the numerous ap pl ica tio ns associated with the area, such as automatic search
and annotation of musical in form a tio n, interactive music systems (e.g. computer
participation in live human performances, score following, and rhythm tracking),
as well as musicological analysis [9,55,75]. An example of the transcription pr ocess
can be seen in Figure 1.
The AMT problem can be divided into several subt asks, which include: multi-
pitch detection, note onset/offset detection, loudness estimation and quantisation,
instrument recognition, extraction of rhythmic information, and time quantisation.
The core problem in automatic transcription is the estimation of concurrent pitches
in a time frame, also called multiple-F0 or multi-pitch detection.
In this work we address challen ge s and future directions for automatic tran-
scription of polyphonic Western music, expanding u pon the work presented in [13].
The related problem of melody transcription, i.e. the estimation of the pr ed om i-
nant pitch, usually performed by a solo instrument or a lead singer, is n ot addressed
in this paper; for an overview of melody transcription approaches the reader can
refer to [108]. Also, the field of content-based music information retrieval, which
refers to automated processing of music for search and retrieval purposes and in-
cludes the AMT problem , is discussed in [22]. A recent state-of-the-art review of
music signal analysis (which includes AMT) is given in [92] while the work by
Grosche et al. [61] includes a recent state-of-the-art section on AMT systems.
2 State of the Art
2.1 Multi-pitch Detection and Note Tracking
In polyphonic music transcription, we are interested in detecting notes which might
occur concurrently and could be produced by several instrument sources. The
core problem for creating a system for polyphonic music transcription is thus
multi-pitch estimation. The vast majority of AMT systems restrict their scope to
performing multi-pitch detection and note tracking (either jointly or sequentially).

Automatic Mu sic Transcription: Challenges and Future Directions 3
Fig. 1 An automatic music transcription example using the first bar of J.S. Bach’s Prelude in
D major. Th e top panel show s the time-domain audio signal, the middle p anel shows a time-
frequency representation with detected pitches superimposed, and the bottom panel shows the
final score.
In [127], multi-pitch detection systems were classified according to their esti-
mation type as either joint or iterative. The iterative estimation approach extracts
the most prominent pitch in each iteration, until no additional F0s can be esti-
mated. Generally, iterative estimation models tend to accumulate errors at each
iteration step, but are computationally inexpen sive. O n the contrary, joint esti-
mation methods evaluate F0 combinations, leading to more accurate estimates
but with increased com pu ta ti on al cost. Recent developments in AMT show that
the vast majority of proposed approaches now falls within the ‘joi nt’ category.
Thus, t he classification that will be presented in this paper organises multi-pitch
detection systems according to the core techniques or models employed.
2.1.1 Feature-based multi-pitch detection
Most multiple-F0 estimation and note tracking systems employ methods derived
from signal processing; a specific model is not employed, and notes are detected
using audio features derived from the input time-frequency representation either

4 E. Benetos, S. Dix on, D. Giannoulis, H. Kirchhoff, an d A. Klapuri
in a joint or an iterative fashion. Typically, multiple-F0 estimation occurs using a
pitch salience function (also called pitch strength function) or a pitch candidate
set score function [74,106,127]. These feature-based techniques have produced the
best results in the Music Infor ma tio n Retrieval Evaluation eXchange (MIREX)
multi-F0 (frame-wise) and note tracking evaluations [7, 91].
The best performing method in the MIREX multi-F0 and note tracking tasks
for 2009-2011 was the work by Yeh [127], who proposed a joint pitch estimation
algorithm based on a pitch candidate set score function. Given a set of pitch
candidates, the overlapping partials are detected and sm oothed according to the
spectral smoothness principle, which states that the spectral envelope of a mu -
sical tone ten d s to be slowly varying as a function of frequency. The weighted
score function for the pitch candidate set consists of 4 features: harmonicity, me an
bandwidth, spectral centroid, and “synchronicity” (synchrony). A polyphony in-
ference mechanism based on the score function increase selects the optimal pitch
candidate set.
For 2012, the best performing method for the MIREX multi-F0 estimation
and note tracking tasks was by Dressler [39]. As an input time/frequency repre-
sentation, a multiresolution Fast Fourier Transform analysis is employed, where
the magnitude for each spectral bin is multiplied with the bin’s instantaneous
frequency. Pitch estimation is made by identifying spectral peaks and performing
pair-wise analysis on them, resulting on r an ked peaks according to harmonicity,
smoothness, the ap pearance of intermediate peaks, and harm o ni c number. Finally,
the system tracks tones over time using an adaptive magnitude and a harmonic
magnitude threshold.
Other notable feature-based AM T systems include the work by Pertusa and
I˜nesta [106], who proposed a computationally inexpensive method for multi-pitch
detection which computes a pitch salience function and evaluates combinations of
pitch candidates using a measure of distance between a harmonic partial sequence
(HPS) and a smoothed HPS. Another approach for feature-based AMT was pro-
posed in [113], which uses genetic algorith m s for estimating a transcription by mu-
tating the solution until it matches a similarity criterion between the original signal
and the synthesized transcribed signal. More rec ently, Grosche et a l. [61] proposed
an AMT method based on a mid-level representation derived from a multiresolu-
tion Fourier transform combi ne d with an instantaneous frequency estimation. The
system also combines onset detection and tuning estimation for computing frame-
based estimates. Finally, Nam et al. [93] proposed a classification-based approach
for piano transcription using features learned from deep belief networks [66] for
computing a mid-level time-pitch representation.
2.1.2 Statistical model-based multi-pitch detection
Many appr oa ches in the literature formulate the multiple-F0 estimation problem
within a statistical framework. Given an observed fram e x and a set C of all possi-
ble fundamental frequency combinations, the frame-based multiple-F0 estimation
problem can then be viewed as a maximum a posteriori (MAP) estimation prob-
lem [4 3]:
ˆ
C
M AP
= arg max
CC
P (C|x) = arg max
CC
P (x|C)P (C)
P (x)
(1)

Citations
More filters
Journal ArticleDOI

Automatic Music Transcription: An Overview

TL;DR: The capability of transcribing music audio into music notation is a fascinating example of human intelligence and comprises several subtasks, including multipitch estimation (MPE), onset and offset detection, instrument recognition, beat and rhythm tracking, interpretation of expressive timing and dynamics, and score typesetting.
Posted Content

Learning Features of Music from Scratch

TL;DR: In this paper, a large-scale music dataset, MusicNet, is introduced to serve as a source of supervision and evaluation of machine learning methods for music research, which consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments.
Journal ArticleDOI

Internet of Musical Things: Vision and Challenges

TL;DR: This paper presents a vision in which the IoMusT enables the connection of digital and physical domains by means of appropriate information and communication technologies, fostering novel musical applications and services and identifies key capabilities missing from today's systems.
Journal ArticleDOI

Novel Audio Features for Music Emotion Recognition

TL;DR: This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features related with musical texture and expressive techniques, and analysing the features relevance and results uncovered interesting relations.
References
More filters

A Short Introduction to Boosting

TL;DR: This short overview paper introduces the boosting algorithm AdaBoost, and explains the underlying theory of boosting, including an explanation of why boosting often does not suffer from overfitting as well as boosting’s relationship to support-vector machines.
Book

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

TL;DR: Spoken Language Processing draws on the latest advances and techniques from multiple fields: computer science, electrical engineering, acoustics, linguistics, mathematics, psychology, and beyond to create the state of the art in spoken language technology.
Proceedings ArticleDOI

Non-negative matrix factorization for polyphonic music transcription

TL;DR: This work presents a methodology for analyzing polyphonic musical passages comprised of notes that exhibit a harmonically fixed spectral profile (such as piano notes), which results in a very simple and compact system that is not knowledge-based, but rather learns notes by observation.
Journal ArticleDOI

Calculation of a constant Q spectral transform

TL;DR: In this article, a constant Q transform with a constant ratio of center frequency to resolution has been proposed to obtain a constant pattern in the frequency domain for sounds with harmonic frequency components.
Book

Applications of Evolutionary Computing

TL;DR: EvoCOMNET Contributions.- Web Application Security through Gene Expression Programming, Location Discovery in Wireless Sensor Networks Using a Two-Stage Simulated Annealing, and more.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What have the authors contributed in "Automatic music transcription: challenges and future directions" ?

However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper the authors analyse limitations of current methods and identify promising directions for future research. Also, the wealth of musical scores and corresponding audio data now available are a rich potential source of training data, via forced alignment of audio to scores, but large scale utilisation of such data has yet to be attempted. Other promising approaches include the integration of information from multiple algorithms and different musical aspects. 

For further insights into current research challenges in MIR, see [ 118 ]. Another promising direction for further research is the combination of multiple processing principles, such as different algorithms with complementary properties which estimate a particular feature, or algorithms which extract various types of musical information, such as the key, metrical structure, and instrument identities, and feed that into a model that provides context for the note detection process. However, the authors believe that AMT research has reached the point where certain practical end-user applications can be built, especially where transcribed notes are used as a basis for extracting higher-level information, and they expect to see many more of these appearing in the near future as the state of AMT research advances. For example, the authors discussed how a genre- or instrumentspecific transcription system can utilise high-level models that are more precise and powerful than their more general counterparts. 

The weighted score function for the pitch candidate set consists of 4 features: harmonicity, mean bandwidth, spectral centroid, and “synchronicity” (synchrony). 

Onset detection (finding the beginnings of notes or events) is the first step towards understanding the underlying periodicities and accents in the music, which ultimately define the rhythm. 

One application area where a score is available is automatic instrument tutoring [14, 36, 124], where a system evaluates the performance of a student based on a reference score and provides feedback. 

To enable progress in these directions, expertise from a range of disciplines will be needed, such as musicology, acoustics, audio engineering, cognitive science and computing. 

in order to produce output in the form of sheet music, additional issues need to be addressed, such as typesetting, estimation of dynamics, fingering, expressive notation and articulation. 

Musicological models could be employed to describe these local sequential dependencies [115] as well as longerterm relationships such as structural repetition and key. 

A possible explanation behind the improved performance of the algorithm could be the more sophisticated note tracking algorithm that is based upon perceptual studies, whereas the standard note tracking systems are simply filtering the note activations. 

One possible use is to cluster the data (for example, according to automatically detected genres) and then train cluster-specific transcription parameters. 

Although this is necessary, considering the complexity of each task, the challenge remains to combine the outputs of the algorithms, or better, the algorithms themselves, to perform joint estimation of all parameters, in order to avoid the cascading of errors when algorithms are combined sequentially. 

The other subtasks involve the estimation of features relating to rhythm, melody, harmony and instrumentation, which carry information which, if integrated, could improve transcription performance. 

Other approaches that would treat notes as time-frequency objects and exploit dynamic time warping or HMMs integrated at a low level could offer a breath of fresh air on research in the field.