scispace - formally typeset
Open AccessJournal ArticleDOI

Beat Tracking by Dynamic Programming

Daniel P. W. Ellis
- 01 Mar 2007 - 
- Vol. 36, Iss: 1, pp 51-60
Reads0
Chats0
TLDR
In this article, the authors describe a beat tracking system which first estimates a global tempo, uses this tempo to construct a transition cost function, then uses dynamic programming to find the best scoring set of beat times that reflect the tempo as well as corresponding to moments of high "onet strength" in a function derived from the audio.
Abstract
Beat tracking – i.e. deriving from a music audio signal a sequence of beat instants that might correspond to when a human listener would tap his foot – involves satisfying two constraints. On the one hand, the selected instants should generally correspond to moments in the audio where a beat is indicated, for instance by the onset of a note played by one of the instruments. On the other hand, the set of beats should reflect a locally-constant inter-beat-interval, since it is this regular spacing between beat times that defines musical rhythm. These dual constraints map neatly onto the two constraints optimized in dynamic programming, the local match, and the transition cost. We describe a beat tracking system which first estimates a global tempo, uses this tempo to construct a transition cost function, then uses dynamic programming to find the best-scoring set of beat times that reflect the tempo as well as corresponding to moments of high ‘onset strength’ in a function derived from the audio. Th...

read more

Content maybe subject to copyright    Report

Beat Tracking by Dynamic Programming
Daniel P.W. Ellis
LabROSA, Columbia University, New York
July 16, 2007
Abstract
Beat tracking i.e. deriving from a music audio signal a sequence of beat instants that
might correspond to when a human listener would tap his foot involves satisfying two con-
straints: On the one hand, the selected instants should generally correspond to moments in the
audio where a beat is indicated, for instance by the onset of a note played by one of the instru-
ments. On the other hand, the set of beats should reflect a locally-constant inter-beat-interval,
since it is this regular spacing between beat times that defines musical rhythm. These dual
constraints map neatly onto the two constraints optimized in dynamic programming, the local
match, and the transition cost. We describe a beat tracking system which first estimates a global
tempo, uses this tempo to construct a transition cost function, then uses dynamic programming
to find the best-scoring set of beat times that reflect the tempo as well as corresponding to
moments of high ‘onset strength’ in a function derived from the audio. This very simple and
computationally efficient procedure is shown to perform well on the MIREX-06 beat track-
ing training data, achieving an average beat accuracy of just under 60% on the development
data. We also examine the impact of the assumption of a fixed target tempo, and show that the
system is typically able to track tempo changes in a range of ±10% of the target tempo.
1

1 Introduction
Researchers have been building and testing systems for tracking beat times in music for several
decades, ranging from the ‘foot tapping’ systems of Desain and Honing [1999], which were driven
by symbolically-encoded event times, to the more recent audio-driven systems as evaluated in the
MIREX-06 Audio Beat Tracking evaluation [McKinney and Moelants, 2006a]; a more complete
overview is given in the lead paper in this collection [McKinney et al., 2007].
Here, we describe a system that was part of the latter evaluation, coming among the statistically-
equivalent top-performers of the five systems evaluated. Our system casts beat tracking into a
simple optimization framework by defining an objective function that seeks to maximize both the
“onset strength” at every hypothesized beat time (where the onset strength function is derived from
the music audio by some suitable mechanism), and the consistency of the inter-onset-interval with
some pre-estimated constant tempo. (We note in passing that human perception of beat instants
tends to smooth out inter-beat-intervals rather than adhering strictly to maxima in onset strength
[Dixon et al., 2006], but this could be modeled as a subsequent, smoothing stage). Although the
requirement of an a priori tempo is a weakness, the reward is a particularly efficient beat-tracking
system that is guaranteed to find the set of beat times that optimizes the objective function, thanks
to its ability to use the well-known dynamic programming algorithm [Bellman, 1957].
The idea of using dynamic programming for beat tracking was proposed by Laroche [2003],
where an onset function was compared to a predefined envelope spanning multiple beats that
incorporated expectations concerning how a particular tempo is realized in terms of strong and
weak beats; dynamic programming efficiently enforced continuity in both beat spacing and tempo.
Peeters [2007] developed this idea, again allowing for tempo variation and matching of envelope
patterns against templates. By contrast, the current system assumes a constant tempo which allows
a much simpler formulation and realization, at the cost of a more limited scope of application.
The rest of this paper is organized as follows: In section 2, we introduce the key idea of
formulating beat tracking as the optimization of a recursively-calculable cost function. Section
2

3 describes our implementation, including details of how we derived our onset strength function
from the music audio waveform. Section 4 describes the results of applying this system to MIREX-
06 beat tracking evaluation data, for which human tapping data was available, and in section 5 we
discuss various aspects of this system, including issues of varying tempo, and deciding whether or
not any beat is present.
2 The Dynamic Programming Formulation of Beat Tracking
Let us start by assuming that we have a constant target tempo which is given in advance. The goal
of a beat tracker is to generate a sequence of beat times that correspond both to perceived onsets
in the audio signal at the same time as constituting a regular, rhythmic pattern in themselves. We
can define a single objective function that combines both of these goals:
C({t
i
}) =
N
X
i=1
O(t
i
) + α
N
X
i=2
F (t
i
t
i1
, τ
p
) (1)
where {t
i
} is the sequence of N beat instants found by the tracker, O(t) is an “onset strength
envelope” derived from the audio, which is large at times that would make good choices for beats
based on the local acoustic properties, α is a weighting to balance the importance of the two terms,
and F (∆t, τ
p
) is a function that measures the consistency between an inter-beat interval t and
the ideal beat spacing τ
p
defined by the target tempo. For instance, we use a simple squared-error
function applied to the log-ratio of actual and ideal time spacing i.e.
F (∆t, τ) =
log
t
τ
2
(2)
which takes a maximum value of 0 when t = τ, becomes increasingly negative for larger devi-
ations, and is symmetric on a log-time axis so that F (kτ, τ) = F (τ/k, τ). In what follows, we
assume that time has been quantized on some suitable grid; our system used a 4 ms time step (i.e.
3

250 Hz sampling rate).
The key property of the objective function is that the best-scoring time sequence can be assem-
bled recursively i.e. to calculate the best possible score C
(t) of all sequences that end at time t,
we define the recursive relation:
C
(t) = O(t) + max
τ =0...t
{αF (t τ, τ
p
) + C
(τ)} (3)
This equation is based on the observation that the best score for time t is the local onset strength,
plus the the best score to the preceding beat time τ that maximizes the sum of that best score and
the transition cost from that time. While calculating C
, we also record the actual preceding beat
time that gave the best score:
P
(t) = arg max
τ =0...t
{αF (t τ, τ
p
) + C
(τ)} (4)
In practice it is necessary only to search a limited range of τ since the rapidly-growing penalty
term F will make it unlikely that the best predecessor time lies far from t τ
p
; we search τ =
t 2τ
p
. . . t τ
p
/2.
To find the set of beat times that optimize the objective function for a given onset envelope,
we start by calculating C
and P
for every time starting from zero. Once this is complete, we
look for the largest value of C
(which will typically be within τ
p
of the end of the time range);
this forms the final beat instant t
N
where N, the total number of beats, is still unknown at
this point. We then ‘backtrace’ via P
, finding the preceding beat time t
N1
= P
(t
N
), and
progressively working backwards until we reach the beginning of the signal; this gives us the entire
optimal beat sequence {t
i
}
. Thanks to dynamic programming, we have effectively searched the
entire exponentially-sized set of all possible time sequences in a linear-time operation. This was
possible because, if a best-scoring beat sequence includes a time t
i
, the beat instants chosen after
t
i
will not influence the choice (or score contribution) of beat times prior to t
i
, so the entire best-
4

scoring sequence up to time t
i
can be calculated and fixed at time t
i
without having to consider any
future events. By contrast, a cost function where events subsequent to t
i
could influence the cost
contribution of earlier events would not be amenable to this optimization.
To underline its simplicity, figure 1 shows the complete working Matlab code for core dynamic
programming search, taking an onset strength envelope and target tempo period as input, and
finding the set of optimal beat times. The two loops (forward calculation and backtrace) consist of
only ten lines of code.
3 The Beat Tracking System
The dynamic programming search for the globally-optimal beat sequence is the heart and the main
novel contribution of our system; in this section, we present the other pieces required for the
complete beat-tracking system. These comprise two parts: the front-end processing to convert the
input audio into the onset strength envelope, O(t), and the global tempo estimation which provides
the target inter-beat interval, τ
p
.
3.1 Onset Strength Envelope
Similar to many other onset models (e.g. Goto and Muraoka [1994], Klapuri [1999], Jehan [2005])
we calculate the onset envelope from a crude perceptual model. First the input sound is resampled
to 8 kHz, then we calculate the short-term Fourier transform (STFT) magnitude (spectrogram) us-
ing 32 ms windows and 4 ms advance between frames. This is then converted to an approximate
auditory representation by mapping to 40 Mel bands via a weighted summing of the spectrogram
values [Ellis, 2005]. We use an auditory frequency scale in an effort to balance the perceptual
importance of each frequency band. The Mel spectrogram is converted to dB, and the first-order
difference along time is calculated in each band. Negative values are set to zero (half-wave rec-
tification), then the remaining, positive differences are summed across all frequency bands. This
signal is passed through a high-pass filter with a cutoff around 0.4 Hz to make it locally zero-
5

Figures
Citations
More filters
Proceedings ArticleDOI

librosa: Audio and Music Signal Analysis in Python

TL;DR: A brief overview of the librosa library's functionality is provided, along with explanations of the design goals, software development practices, and notational conventions.
Proceedings ArticleDOI

Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking

TL;DR: A system that attempts to identify such a relationship between music audio recordings, including best performance on an independent international evaluation, where the system achieved a mean reciprocal ranking of 0.49 for true cover versions among top-10 returns.
Journal ArticleDOI

Experimental evidence for synchronization to a musical beat in a nonhuman animal.

TL;DR: Experimental evidence for synchronization to a beat in a sulphur-crested cockatoo is reported and it is shown that the animal spontaneously adjusts the tempo of its rhythmic movements to stay synchronized with the beat.
Proceedings ArticleDOI

Essentia: An Audio Analysis Library for Music Information Retrieval.

TL;DR: Comunicacio presentada a la 14th International Society for Music Information Retrieval Conference, celebrada a Curitiba (Brasil) els dies 4 a 8 de novembre de 2013.
Journal ArticleDOI

Spontaneous motor entrainment to music in multiple vocal mimicking species.

TL;DR: In this article, the authors provide comparative data demonstrating the existence of two vocal mimicking nonhuman animals (parrots) that entrain to music, spontaneously producing synchronized movements resembling human dance.
References
More filters
Book

Dynamic Programming

TL;DR: The more the authors study the information processing aspects of the mind, the more perplexed and impressed they become, and it will be a very long time before they understand these processes sufficiently to reproduce them.
Journal ArticleDOI

Automatic Extraction of Tempo and Beat From Expressive Performances

TL;DR: It is shown that estimating the perceptual salience of rhythmic events significantly improves the results of a computer program which is able to estimate the tempo and the times of musical beats in expressively performed music.
Proceedings ArticleDOI

Sound onset detection by applying psychoacoustic knowledge

TL;DR: A system was designed, which is able to detect the perceptual onsets of sounds in acoustic signals and utilizes band-wise processing and a psychoacoustic model of intensity coding to combine the results from the separate frequency bands.
Proceedings ArticleDOI

Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking

TL;DR: A system that attempts to identify such a relationship between music audio recordings, including best performance on an independent international evaluation, where the system achieved a mean reciprocal ranking of 0.49 for true cover versions among top-10 returns.
Patent

Creating Music by Listening

TL;DR: In this paper, a method to create new music by listening to a plurality of music, learning from the plurality, and performing concatenative synthesis based on the listening and the learning to create the new music is described.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What are the contributions in "Beat tracking by dynamic programming" ?

The authors describe a beat tracking system which first estimates a global tempo, uses this tempo to construct a transition cost function, then uses dynamic programming to find the best-scoring set of beat times that reflect the tempo as well as corresponding to moments of high ‘ onset strength ’ in a function derived from the audio. The authors also examine the impact of the assumption of a fixed target tempo, and show that the system is typically able to track tempo changes in a range of ±10 % of the target tempo. 

The goal of a beat tracker is to generate a sequence of beat times that correspond both to perceived onsets in the audio signal at the same time as constituting a regular, rhythmic pattern in themselves. 

The idea of using dynamic programming for beat tracking was proposed by Laroche [2003], where an onset function was compared to a predefined envelope spanning multiple beats that incorporated expectations concerning how a particular tempo is realized in terms of strong and weak beats; dynamic programming efficiently enforced continuity in both beat spacing and tempo. 

Running the original tempo extraction algorithm of section 3.2 (global maximum of TPS) scored 35.7% and 74.4% for accuracies 1 and 2 respectively, which would have placed it between 5th and 6th place in the 2004 evaluation for accuracy 1, and between 3rd and 4th for accuracy 2. 

In order to distinguish between gross disagreements in tempo and more local errors in beat placement, the authors repeated the scoring using only the 344 of 800 (43%) of ground-truth data sets in which the system-estimated tempo matched the ground-truth tempo to within 20%. 

One reason that this scores worse than 86.6% achieved on the 344 sequences that agreed with the system tempo is that the larger set of 747 ground-truth sequences will include more at metrical levels slower than the tatum, or fastest rate present. 

Although the requirement of an a priori tempo is a weakness, the reward is a particularly efficient beat-tracking system that is guaranteed to find the set of beat times that optimizes the objective function, thanks to its ability to use the well-known dynamic programming algorithm [Bellman, 1957]. 

Because of the multiplicity of metrical levels reflected in the ground-truth data (as noted in section 3.2), it is not possible for any beat tracker to score close to 100% agreement with this data. 

A larger α leads to a tighter adherence to the ideal tempo, since it increases the weight of the ‘transition’ cost associated with non-ideal inter-beat intervals in comparison to the onset waveform. 

By contrast, the current system assumes a constant tempo which allows a much simpler formulation and realization, at the cost of a more limited scope of application. 

Researchers have been building and testing systems for tracking beat times in music for several decades, ranging from the ‘foot tapping’ systems of Desain and Honing [1999], which were driven by symbolically-encoded event times, to the more recent audio-driven systems as evaluated in the MIREX-06 Audio Beat Tracking evaluation [McKinney and Moelants, 2006a]; a more complete overview is given in the lead paper in this collection [McKinney et al., 2007].