What is the tempo extraction algorithm?

Running the original tempo extraction algorithm of section 3.2 (global maximum of TPS) scored 35.7% and 74.4% for accuracies 1 and 2 respectively, which would have placed it between 5th and 6th place in the 2004 evaluation for accuracy 1, and between 3rd and 4th for accuracy 2.

How many of the 344 of 800 data sets were matched to the system truth tempo?

In order to distinguish between gross disagreements in tempo and more local errors in beat placement, the authors repeated the scoring using only the 344 of 800 (43%) of ground-truth data sets in which the system-estimated tempo matched the ground-truth tempo to within 20%.

What is the reason that the system beats scored worse than 86.6%?

One reason that this scores worse than 86.6% achieved on the 344 sequences that agreed with the system tempo is that the larger set of 747 ground-truth sequences will include more at metrical levels slower than the tatum, or fastest rate present.

What is the reward for using dynamic programming?

Although the requirement of an a priori tempo is a weakness, the reward is a particularly efficient beat-tracking system that is guaranteed to find the set of beat times that optimizes the objective function, thanks to its ability to use the well-known dynamic programming algorithm [Bellman, 1957].

Why is it not possible for any beat tracker to score close to 100% agreement with the data?

Because of the multiplicity of metrical levels reflected in the ground-truth data (as noted in section 3.2), it is not possible for any beat tracker to score close to 100% agreement with this data.

What is the tempo of the beats in the audio beat tracker?

A larger α leads to a tighter adherence to the ideal tempo, since it increases the weight of the ‘transition’ cost associated with non-ideal inter-beat intervals in comparison to the onset waveform.

(Open Access) Beat Tracking by Dynamic Programming (2007) | Daniel P. W. Ellis

Q: What are the contributions in "Beat tracking by dynamic programming" ?

The authors describe a beat tracking system which first estimates a global tempo, uses this tempo to construct a transition cost function, then uses dynamic programming to find the best-scoring set of beat times that reflect the tempo as well as corresponding to moments of high ‘ onset strength ’ in a function derived from the audio. The authors also examine the impact of the assumption of a fixed target tempo, and show that the system is typically able to track tempo changes in a range of ±10 % of the target tempo.

Beat Tracking by Dynamic Programming

Daniel P.W. Ellis

LabROSA, Columbia University, New York

July 16, 2007

Abstract

Beat tracking – i.e. deriving from a music audio signal a sequence of beat instants that

might correspond to when a human listener would tap his foot – involves satisfying two con-

straints: On the one hand, the selected instants should generally correspond to moments in the

audio where a beat is indicated, for instance by the onset of a note played by one of the instru-

ments. On the other hand, the set of beats should reﬂect a locally-constant inter-beat-interval,

since it is this regular spacing between beat times that deﬁnes musical rhythm. These dual

constraints map neatly onto the two constraints optimized in dynamic programming, the local

match, and the transition cost. We describe a beat tracking system which ﬁrst estimates a global

tempo, uses this tempo to construct a transition cost function, then uses dynamic programming

to ﬁnd the best-scoring set of beat times that reﬂect the tempo as well as corresponding to

moments of high ‘onset strength’ in a function derived from the audio. This very simple and

computationally efﬁcient procedure is shown to perform well on the MIREX-06 beat track-

ing training data, achieving an average beat accuracy of just under 60% on the development

data. We also examine the impact of the assumption of a ﬁxed target tempo, and show that the

system is typically able to track tempo changes in a range of ±10% of the target tempo.

1 Introduction

Researchers have been building and testing systems for tracking beat times in music for several

decades, ranging from the ‘foot tapping’ systems of Desain and Honing [1999], which were driven

by symbolically-encoded event times, to the more recent audio-driven systems as evaluated in the

MIREX-06 Audio Beat Tracking evaluation [McKinney and Moelants, 2006a]; a more complete

overview is given in the lead paper in this collection [McKinney et al., 2007].

Here, we describe a system that was part of the latter evaluation, coming among the statistically-

equivalent top-performers of the ﬁve systems evaluated. Our system casts beat tracking into a

simple optimization framework by deﬁning an objective function that seeks to maximize both the

“onset strength” at every hypothesized beat time (where the onset strength function is derived from

the music audio by some suitable mechanism), and the consistency of the inter-onset-interval with

some pre-estimated constant tempo. (We note in passing that human perception of beat instants

tends to smooth out inter-beat-intervals rather than adhering strictly to maxima in onset strength

[Dixon et al., 2006], but this could be modeled as a subsequent, smoothing stage). Although the

requirement of an a priori tempo is a weakness, the reward is a particularly efﬁcient beat-tracking

system that is guaranteed to ﬁnd the set of beat times that optimizes the objective function, thanks

to its ability to use the well-known dynamic programming algorithm [Bellman, 1957].

The idea of using dynamic programming for beat tracking was proposed by Laroche [2003],

where an onset function was compared to a predeﬁned envelope spanning multiple beats that

incorporated expectations concerning how a particular tempo is realized in terms of strong and

weak beats; dynamic programming efﬁciently enforced continuity in both beat spacing and tempo.

Peeters [2007] developed this idea, again allowing for tempo variation and matching of envelope

patterns against templates. By contrast, the current system assumes a constant tempo which allows

a much simpler formulation and realization, at the cost of a more limited scope of application.

The rest of this paper is organized as follows: In section 2, we introduce the key idea of

formulating beat tracking as the optimization of a recursively-calculable cost function. Section

3 describes our implementation, including details of how we derived our onset strength function

from the music audio waveform. Section 4 describes the results of applying this system to MIREX-

06 beat tracking evaluation data, for which human tapping data was available, and in section 5 we

discuss various aspects of this system, including issues of varying tempo, and deciding whether or

not any beat is present.

2 The Dynamic Programming Formulation of Beat Tracking

Let us start by assuming that we have a constant target tempo which is given in advance. The goal

of a beat tracker is to generate a sequence of beat times that correspond both to perceived onsets

in the audio signal at the same time as constituting a regular, rhythmic pattern in themselves. We

can deﬁne a single objective function that combines both of these goals:

C({t

}) =

i=1

O(t

) + α

i=2

F (t

− t

i−1

, τ

) (1)

where {t

} is the sequence of N beat instants found by the tracker, O(t) is an “onset strength

envelope” derived from the audio, which is large at times that would make good choices for beats

based on the local acoustic properties, α is a weighting to balance the importance of the two terms,

and F (∆t, τ

) is a function that measures the consistency between an inter-beat interval ∆t and

the ideal beat spacing τ

deﬁned by the target tempo. For instance, we use a simple squared-error

function applied to the log-ratio of actual and ideal time spacing i.e.

F (∆t, τ) = −



log

∆t



(2)

which takes a maximum value of 0 when ∆t = τ, becomes increasingly negative for larger devi-

ations, and is symmetric on a log-time axis so that F (kτ, τ) = F (τ/k, τ). In what follows, we

assume that time has been quantized on some suitable grid; our system used a 4 ms time step (i.e.

250 Hz sampling rate).

The key property of the objective function is that the best-scoring time sequence can be assem-

bled recursively i.e. to calculate the best possible score C

∗

(t) of all sequences that end at time t,

we deﬁne the recursive relation:

∗

(t) = O(t) + max

τ =0...t

{αF (t − τ, τ

) + C

∗

(τ)} (3)

This equation is based on the observation that the best score for time t is the local onset strength,

plus the the best score to the preceding beat time τ that maximizes the sum of that best score and

the transition cost from that time. While calculating C

∗

, we also record the actual preceding beat

time that gave the best score:

∗

(t) = arg max

τ =0...t

{αF (t − τ, τ

) + C

∗

(τ)} (4)

In practice it is necessary only to search a limited range of τ since the rapidly-growing penalty

term F will make it unlikely that the best predecessor time lies far from t − τ

; we search τ =

t − 2τ

. . . t − τ

/2.

To ﬁnd the set of beat times that optimize the objective function for a given onset envelope,

we start by calculating C

∗

and P

∗

for every time starting from zero. Once this is complete, we

look for the largest value of C

∗

(which will typically be within τ

of the end of the time range);

this forms the ﬁnal beat instant t

– where N, the total number of beats, is still unknown at

this point. We then ‘backtrace’ via P

∗

, ﬁnding the preceding beat time t

N−1

= P

∗

), and

progressively working backwards until we reach the beginning of the signal; this gives us the entire

optimal beat sequence {t

}

∗

. Thanks to dynamic programming, we have effectively searched the

entire exponentially-sized set of all possible time sequences in a linear-time operation. This was

possible because, if a best-scoring beat sequence includes a time t

, the beat instants chosen after

will not inﬂuence the choice (or score contribution) of beat times prior to t

, so the entire best-

scoring sequence up to time t

can be calculated and ﬁxed at time t

without having to consider any

future events. By contrast, a cost function where events subsequent to t

could inﬂuence the cost

contribution of earlier events would not be amenable to this optimization.

To underline its simplicity, ﬁgure 1 shows the complete working Matlab code for core dynamic

programming search, taking an onset strength envelope and target tempo period as input, and

ﬁnding the set of optimal beat times. The two loops (forward calculation and backtrace) consist of

only ten lines of code.

3 The Beat Tracking System

The dynamic programming search for the globally-optimal beat sequence is the heart and the main

novel contribution of our system; in this section, we present the other pieces required for the

complete beat-tracking system. These comprise two parts: the front-end processing to convert the

input audio into the onset strength envelope, O(t), and the global tempo estimation which provides

the target inter-beat interval, τ

3.1 Onset Strength Envelope

Similar to many other onset models (e.g. Goto and Muraoka [1994], Klapuri [1999], Jehan [2005])

we calculate the onset envelope from a crude perceptual model. First the input sound is resampled

to 8 kHz, then we calculate the short-term Fourier transform (STFT) magnitude (spectrogram) us-

ing 32 ms windows and 4 ms advance between frames. This is then converted to an approximate

auditory representation by mapping to 40 Mel bands via a weighted summing of the spectrogram

values [Ellis, 2005]. We use an auditory frequency scale in an effort to balance the perceptual

importance of each frequency band. The Mel spectrogram is converted to dB, and the ﬁrst-order

difference along time is calculated in each band. Negative values are set to zero (half-wave rec-

tiﬁcation), then the remaining, positive differences are summed across all frequency bands. This

signal is passed through a high-pass ﬁlter with a cutoff around 0.4 Hz to make it locally zero-

Beat Tracking by Dynamic Programming

Figures

Citations

librosa: Audio and Music Signal Analysis in Python

Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking

Experimental evidence for synchronization to a musical beat in a nonhuman animal.

Essentia: An Audio Analysis Library for Music Information Retrieval.

Spontaneous motor entrainment to music in multiple vocal mimicking species.

References

Dynamic Programming

Automatic Extraction of Tempo and Beat From Expressive Performances

Sound onset detection by applying psychoacoustic knowledge

Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking

Creating Music by Listening

Related Papers (5)

Analysis of the meter of acoustic musical signals

Context-Dependent Beat Tracking of Musical Audio

A tutorial on onset detection in music signals

Tempo and beat analysis of acoustic musical signals

Automatic Extraction of Tempo and Beat From Expressive Performances

Frequently Asked Questions (11)

Q1. What are the contributions in "Beat tracking by dynamic programming" ?

Q2. What is the purpose of a beat tracker?

Q3. What is the idea of using dynamic programming for beat tracking?

Q4. What is the tempo extraction algorithm?

Q5. How many of the 344 of 800 data sets were matched to the system truth tempo?

Q6. What is the reason that the system beats scored worse than 86.6%?

Q7. What is the reward for using dynamic programming?

Q8. Why is it not possible for any beat tracker to score close to 100% agreement with the data?

Q9. What is the tempo of the beats in the audio beat tracker?

Q10. What is the current system for beat tracking?

Q11. How long have researchers been testing beat tracking systems?