What have the authors contributed in "Automatic music transcription: challenges and future directions" ?

However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper the authors analyse limitations of current methods and identify promising directions for future research. Also, the wealth of musical scores and corresponding audio data now available are a rich potential source of training data, via forced alignment of audio to scores, but large scale utilisation of such data has yet to be attempted. Other promising approaches include the integration of information from multiple algorithms and different musical aspects.

What are the future works in "Automatic music transcription: challenges and future directions" ?

For further insights into current research challenges in MIR, see [ 118 ]. Another promising direction for further research is the combination of multiple processing principles, such as different algorithms with complementary properties which estimate a particular feature, or algorithms which extract various types of musical information, such as the key, metrical structure, and instrument identities, and feed that into a model that provides context for the note detection process. However, the authors believe that AMT research has reached the point where certain practical end-user applications can be built, especially where transcribed notes are used as a basis for extracting higher-level information, and they expect to see many more of these appearing in the near future as the state of AMT research advances. For example, the authors discussed how a genre- or instrumentspecific transcription system can utilise high-level models that are more precise and powerful than their more general counterparts.

What is the first step towards understanding the underlying periodicities and accents in the music?

Onset detection (finding the beginnings of notes or events) is the first step towards understanding the underlying periodicities and accents in the music, which ultimately define the rhythm.

What is the common application area for scoring?

One application area where a score is available is automatic instrument tutoring [14, 36, 124], where a system evaluates the performance of a student based on a reference score and provides feedback.

What disciplines are needed to enable progress in these directions?

To enable progress in these directions, expertise from a range of disciplines will be needed, such as musicology, acoustics, audio engineering, cognitive science and computing.

What are the main issues in order to produce output in the form of sheet music?

in order to produce output in the form of sheet music, additional issues need to be addressed, such as typesetting, estimation of dynamics, fingering, expressive notation and articulation.

What could be used to describe the local sequential dependencies of notes and chords?

Musicological models could be employed to describe these local sequential dependencies [115] as well as longerterm relationships such as structural repetition and key.

What could be the reason for the improved performance of the algorithm?

A possible explanation behind the improved performance of the algorithm could be the more sophisticated note tracking algorithm that is based upon perceptual studies, whereas the standard note tracking systems are simply filtering the note activations.

What is the way to train cluster-specific transcription parameters?

One possible use is to cluster the data (for example, according to automatically detected genres) and then train cluster-specific transcription parameters.

What is the main challenge of combining algorithms?

Although this is necessary, considering the complexity of each task, the challenge remains to combine the outputs of the algorithms, or better, the algorithms themselves, to perform joint estimation of all parameters, in order to avoid the cascading of errors when algorithms are combined sequentially.

What other subtasks are used to estimate the features of the music?

The other subtasks involve the estimation of features relating to rhythm, melody, harmony and instrumentation, which carry information which, if integrated, could improve transcription performance.

What are some of the approaches that could be used to treat notes as time-frequency objects?

Other approaches that would treat notes as time-frequency objects and exploit dynamic time warping or HMMs integrated at a low level could offer a breath of fresh air on research in the field.

(Open Access) Automatic music transcription: challenges and future directions (2013) | Emmanouil Benetos

City, University of London Institutional Repository

Citation: Benetos, E., Dixon, S., Giannoulis, D., Kirchhoff, H. and Klapuri, A. (2013).

Automatic music transcription: challenges and future directions. Journal of Intelligent

Information Systems, pp. 1-28. doi: 10.1007/s10844-013-0258-3

This is the unspecified version of the paper.

This version of the publication may differ from the final published

version.

Permanent repository link: https://openaccess.city.ac.uk/id/eprint/2524/

Link to published version: http://dx.doi.org/10.1007/s10844-013-0258-3

University of London available to a wider audience. Copyright and Moral

Rights remain with the author(s) and/or copyright holders. URLs from

City Research Online may be freely distributed and linked to.

Reuse: Copies of full items can be used for personal research or study,

educational, or not-for-profit purposes without prior permission or

charge. Provided that the authors, title and full bibliographic details are

credited, a hyperlink and/or URL is given for the original metadata page

and the content is not changed in any way.

City Research Online: http://openaccess.city.ac.uk/ publications@city.ac.uk

City Research Online

Journal of Intelligent Information Systems manuscript No.

(will be inserted by the editor)

Automatic Music Transcription:

Challenges and Future Directions

Emmanouil Benetos † · Simon Dixon † ·

Dimitrios Giannoulis † ·

Holger Kirchhoﬀ † · Anssi Klapuri †

Received: date / Accepted: date

Abstract Automatic music transcription is considered by many to be a key en-

abling technology in music signal processing. However, the perfo rm an ce of tran-

scription systems is still signiﬁcantly below that of a human expert, and accuracies

reported in recent years seem to have reached a limit, although t he ﬁeld is still

very active. In this paper we analyse limitations of current methods and identify

promising directions for future research. Current transcription m et hods use gen-

eral purpose models which are unable to capture the rich diversity foun d in music

signals. One way to overcome the limited performance of transcription systems is

to tailor algorithms to speciﬁc use-cases. Semi- au to ma ti c approaches are another

way of achieving a more re lia bl e transcription. Also, the wealth of musical scores

and corresp on din g audio data now available are a rich potential source of training

data, via forced alignment of audio to scores, but large scale utilisation of such

data h as yet to be attempted. Other promising approaches include the integration

of info rm at ion from multiple algorithms and diﬀerent musical aspects.

Keywords Music sig na l analysis · Music information retrieval · Automatic music

transcription

† Equally contributing authors.

E. Benetos

Department of Computer Science

City University London

Tel.: +44 20 7040 4154

E-mail: emmanouil.benetos.1@city.ac.uk

S. Dixon, D. Giannoulis, H. Kirchhoﬀ

Centre for Digital Music

Queen Mary University of London

Tel.: +44 20 7882 7681

E-mail: {simon.dixon, dimitrios.giannoulis, holger.kirchhoﬀ}@eecs.qmul.ac.uk

A. Klapuri

Ovelin and

Tampere University of Technology

E-mail: anssi.klapuri@tu t.ﬁ

E. Benetos and A. Klapuri were at the Centre for Digital Music, Queen Mary University of

London

2 E. Benetos, S. Dix on, D. Giannoulis, H. Kirchhoﬀ, an d A. Klapuri

1 Introduction

Automatic music transcription (AMT) is the process of converting an acoustic

musical signal into some form of musical notation. In [24] it is deﬁned as the process

of converting an audio record in g into a piano-roll notation (a two-dimensional

representation of musical notes across time), w h ile in [75] it is deﬁned as the process

of converting a recording into common music notation (i.e. a score). Even for

expert musicians, transc rib ing polyphonic pieces of music is not a trivial task (see

Chapter 1 of [75] and [77]), and while the problem of automatic pitch estimation

for m on op ho ni c signals might be considered solved, the creation o f an automa te d

system able to transcribe polyphonic music without restrictions on the degree of

polyphony or the instrume nt type still remain s open.

The most immediate application of automatic music transcription is for allow-

ing musicians to record the notes of an improvised performance in order to be

able to reproduce it. AMT also has great value in musical styles where no score

exists, e.g. music from oral traditions, jazz, pop, etc. In the past years, the prob-

lem of automatic music transcription has gained considerable research interest due

to the numerous ap pl ica tio ns associated with the area, such as automatic search

and annotation of musical in form a tio n, interactive music systems (e.g. computer

participation in live human performances, score following, and rhythm tracking),

as well as musicological analysis [9,55,75]. An example of the transcription pr ocess

can be seen in Figure 1.

The AMT problem can be divided into several subt asks, which include: multi-

pitch detection, note onset/oﬀset detection, loudness estimation and quantisation,

instrument recognition, extraction of rhythmic information, and time quantisation.

The core problem in automatic transcription is the estimation of concurrent pitches

in a time frame, also called multiple-F0 or multi-pitch detection.

In this work we address challen ge s and future directions for automatic tran-

scription of polyphonic Western music, expanding u pon the work presented in [13].

The related problem of melody transcription, i.e. the estimation of the pr ed om i-

nant pitch, usually performed by a solo instrument or a lead singer, is n ot addressed

in this paper; for an overview of melody transcription approaches the reader can

refer to [108]. Also, the ﬁeld of content-based music information retrieval, which

refers to automated processing of music for search and retrieval purposes and in-

cludes the AMT problem , is discussed in [22]. A recent state-of-the-art review of

music signal analysis (which includes AMT) is given in [92] while the work by

Grosche et al. [61] includes a recent state-of-the-art section on AMT systems.

2 State of the Art

2.1 Multi-pitch Detection and Note Tracking

In polyphonic music transcription, we are interested in detecting notes which might

occur concurrently and could be produced by several instrument sources. The

core problem for creating a system for polyphonic music transcription is thus

multi-pitch estimation. The vast majority of AMT systems restrict their scope to

performing multi-pitch detection and note tracking (either jointly or sequentially).

Automatic Mu sic Transcription: Challenges and Future Directions 3

Fig. 1 An automatic music transcription example using the ﬁrst bar of J.S. Bach’s Prelude in

D major. Th e top panel show s the time-domain audio signal, the middle p anel shows a time-

frequency representation with detected pitches superimposed, and the bottom panel shows the

ﬁnal score.

In [127], multi-pitch detection systems were classiﬁed according to their esti-

mation type as either joint or iterative. The iterative estimation approach extracts

the most prominent pitch in each iteration, until no additional F0s can be esti-

mated. Generally, iterative estimation models tend to accumulate errors at each

iteration step, but are computationally inexpen sive. O n the contrary, joint esti-

mation methods evaluate F0 combinations, leading to more accurate estimates

but with increased com pu ta ti on al cost. Recent developments in AMT show that

the vast majority of proposed approaches now falls within the ‘joi nt’ category.

Thus, t he classiﬁcation that will be presented in this paper organises multi-pitch

detection systems according to the core techniques or models employed.

2.1.1 Feature-based multi-pitch detection

Most multiple-F0 estimation and note tracking systems employ methods derived

from signal processing; a speciﬁc model is not employed, and notes are detected

using audio features derived from the input time-frequency representation either

4 E. Benetos, S. Dix on, D. Giannoulis, H. Kirchhoﬀ, an d A. Klapuri

in a joint or an iterative fashion. Typically, multiple-F0 estimation occurs using a

pitch salience function (also called pitch strength function) or a pitch candidate

set score function [74,106,127]. These feature-based techniques have produced the

best results in the Music Infor ma tio n Retrieval Evaluation eXchange (MIREX)

multi-F0 (frame-wise) and note tracking evaluations [7, 91].

The best performing method in the MIREX multi-F0 and note tracking tasks

for 2009-2011 was the work by Yeh [127], who proposed a joint pitch estimation

algorithm based on a pitch candidate set score function. Given a set of pitch

candidates, the overlapping partials are detected and sm oothed according to the

spectral smoothness principle, which states that the spectral envelope of a mu -

sical tone ten d s to be slowly varying as a function of frequency. The weighted

score function for the pitch candidate set consists of 4 features: harmonicity, me an

bandwidth, spectral centroid, and “synchronicity” (synchrony). A polyphony in-

ference mechanism based on the score function increase selects the optimal pitch

candidate set.

For 2012, the best performing method for the MIREX multi-F0 estimation

and note tracking tasks was by Dressler [39]. As an input time/frequency repre-

sentation, a multiresolution Fast Fourier Transform analysis is employed, where

the magnitude for each spectral bin is multiplied with the bin’s instantaneous

frequency. Pitch estimation is made by identifying spectral peaks and performing

pair-wise analysis on them, resulting on r an ked peaks according to harmonicity,

smoothness, the ap pearance of intermediate peaks, and harm o ni c number. Finally,

the system tracks tones over time using an adaptive magnitude and a harmonic

magnitude threshold.

Other notable feature-based AM T systems include the work by Pertusa and

I˜nesta [106], who proposed a computationally inexpensive method for multi-pitch

detection which computes a pitch salience function and evaluates combinations of

pitch candidates using a measure of distance between a harmonic partial sequence

(HPS) and a smoothed HPS. Another approach for feature-based AMT was pro-

posed in [113], which uses genetic algorith m s for estimating a transcription by mu-

tating the solution until it matches a similarity criterion between the original signal

and the synthesized transcribed signal. More rec ently, Grosche et a l. [61] proposed

an AMT method based on a mid-level representation derived from a multiresolu-

tion Fourier transform combi ne d with an instantaneous frequency estimation. The

system also combines onset detection and tuning estimation for computing frame-

based estimates. Finally, Nam et al. [93] proposed a classiﬁcation-based approach

for piano transcription using features learned from deep belief networks [66] for

computing a mid-level time-pitch representation.

2.1.2 Statistical model-based multi-pitch detection

Many appr oa ches in the literature formulate the multiple-F0 estimation problem

within a statistical framework. Given an observed fram e x and a set C of all possi-

ble fundamental frequency combinations, the frame-based multiple-F0 estimation

problem can then be viewed as a maximum a posteriori (MAP) estimation prob-

lem [4 3]:

M AP

= arg max

C∈C

P (C|x) = arg max

C∈C

P (x|C)P (C)

P (x)

(1)

Automatic music transcription: challenges and future directions

Figures

Citations

Applications of evolutionary computing: EvoWorkshops 2008: EvoCOMNET, EvoFIN, EvoHOT, EvoIASP, EvoMUSART, EvoNUM, EvoSTOC, and EvoTransLog, Naples, Italy, March 26-28, 2008. Proceedings.

Automatic Music Transcription: An Overview

Learning Features of Music from Scratch

Internet of Musical Things: Vision and Challenges

Novel Audio Features for Music Emotion Recognition

References

A Short Introduction to Boosting

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Non-negative matrix factorization for polyphonic music transcription

Calculation of a constant Q spectral transform

Applications of Evolutionary Computing

Related Papers (5)

Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle

Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation

Non-negative matrix factorization for polyphonic music transcription

A discriminative model for polyphonic piano transcription

YIN, a fundamental frequency estimator for speech and music

Frequently Asked Questions (13)

Q1. What have the authors contributed in "Automatic music transcription: challenges and future directions" ?

Q2. What are the future works in "Automatic music transcription: challenges and future directions" ?

Q3. What is the weighted score function for the pitch candidate set?

Q4. What is the first step towards understanding the underlying periodicities and accents in the music?

Q5. What is the common application area for scoring?

Q6. What disciplines are needed to enable progress in these directions?

Q7. What are the main issues in order to produce output in the form of sheet music?

Q8. What could be used to describe the local sequential dependencies of notes and chords?

Q9. What could be the reason for the improved performance of the algorithm?

Q10. What is the way to train cluster-specific transcription parameters?

Q11. What is the main challenge of combining algorithms?

Q12. What other subtasks are used to estimate the features of the music?

Q13. What are some of the approaches that could be used to treat notes as time-frequency objects?