What is the function of the regularity modulus?

Since increases of are related to the existence of large, transient-like coefficients in the branch , the regularity modulus can effectively act as an onset detection function.

What is the way to detect onsets?

For nontrivial sounds, onset detection schemes benefit from using richer representations of the signal (e.g., a time-frequency representation).

What is the expectation of the log-likelihood ratio?

Under model , the expectation is(15)If the authors assume that the signal initially follows model , and switches to model at some unknown time, then the short-time average of the log-likelihood ratio will change sign.

What is the scheme for detecting transients?

The scheme takes advantage of the correlations across scales of the coefficients: large wavelet coefficients, related to transients in the signal, are not evenly spread within the dyadic plane but rather form “structures”.

What was the threshold for the peak-picking parameters?

All peak-picking parameters (e.g., filter’s cutoff frequency, ) were held constant, except for the threshold which was varied to trace out the performance curve.

(Open Access) A tutorial on onset detection in music signals (2005) | Juan Pablo Bello

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005 1035

A Tutorial on Onset Detection in Music Signals

Juan Pablo Bello, Laurent Daudet, Samer Abdallah, Chris Duxbury, Mike Davies, and

Mark B. Sandler, Senior Member, IEEE

Abstract—Note onset detection and localization is useful in a

number of analysis and indexing techniques for musical signals.

The usual way to detect onsets is to look for “transient” regions in

the signal, a notion that leads to many deﬁnitions: a sudden burst

of energy, a change in the short-time spectrum of the signal or in

the statistical properties, etc. The goal of this paper is to review,

categorize, and compare some of the most commonly used tech-

niques for onset detection, and to present possible enhancements.

We discuss methods based on the use of explicitly predeﬁned signal

features: the signal’s amplitude envelope, spectral magnitudes and

phases, time-frequency representations; and methods based on

probabilistic signal models: model-based change point detection,

surprise signals, etc. Using a choice of test cases, we provide

some guidelines for choosing the appropriate method for a given

application.

Index Terms—Attack transcients, audio, note segmentation, nov-

elty detection.

I. INTRODUCTION

A. Background and Motivation

USIC is to a great extent an event-based phenomenon for

both performer and listener. We nod our heads or tap our

feet to the rhythm of a piece; the performer’s attention is focused

on each successive note. Even in non note-based music, there

are transitions as musical timbre and tone color evolve. Without

change, there can be no musical meaning.

The automatic detection of events in audio signals gives new

possibilities in a number of music applications including con-

tent delivery, compression, indexing and retrieval. Accurate re-

trieval depends on the use of appropriate features to compare

and identify pieces of music. Given the importance of musical

events, it is clear that identifying and characterizing these events

is an important aspect of this process. Equally, as compres-

sion standards advance and the drive for improving quality at

low bit-rates continues, so does accurate event detection be-

come a basic requirement: disjoint audio segments with homo-

geneous statistical properties, delimited by transitions or events,

can be compressed more successfully in isolation than they can

Manuscript received August 6, 2003; revised July 21, 2004. The associate ed-

itor coordinating the review of this manuscript and approving it for publication

was Dr. Gerald Schuller.

J. P. Bello, S. Abdallah, M. Davies, and M. B. Sandler are with the Centre for

Digital Music, Department of Electronic Engineering, Queen Mary, University

of London, London E1 4NS, U.K. (e-mail: juan.bello-correa@elec.qmul.ac.uk;

samer.abdallah@elec.qmul.ac.uk; mike.davies@elec.qmul.ac.uk; mark.san-

dler@elec.qmul.ac.uk).

L. Daudet is with the Laboratoire d’Acoustique Musicale, Université Pierre

et Marie Curie (Paris 6), 75015 Paris, France (e-mail: daudet@lam.jussieu.fr).

C. Duxbury is with the Centre for Digital Music, Department of Elec-

tronic Engineering, Queen Mary, University of London, London E1 4NS,

U.K., and also with WaveCrest Communications Ltd. (e-mail: christo-

pher.duxbury@elec.qmul.ac.uk).

Digital Object Identiﬁer 10.1109/TSA.2005.851998

Fig. 1. “Attack,” “transient,” “decay,” and “onset” in the ideal case of a single

note.

in combination with dissimilar regions. Finally, accurate seg-

mentation allows a large number of standard audio editing al-

gorithms and effects (e.g., time-stretching, pitch-shifting) to be

more signal-adaptive.

There have been many different approaches for onset detec-

tion. The goal of this paper is to give an overview of the most

commonly used techniques, with a special emphasis on the ones

that have been employed in the authors’ different applications.

For the sake of coherence, the discussion will be focused on

the more speciﬁc problem of note onset detection in musical

signals, although we believe that the discussed methods can be

useful for various different tasks (e.g., transient modeling or lo-

calization) and different classes of signals (e.g., environmental

sounds, speech).

B. Deﬁnitions: Transients vs. Onsets vs. Attacks

A central issue here is to make a clear distinction between the

related concepts of transients, onsets and attacks. The reason

for making these distinctions clear is that different applications

have different needs. The similarities and differences between

these key concepts are important to consider; it is similarly im-

portant to categorize all related approaches. Fig. 1 shows, in the

simple case of an isolated note, how one could differentiate these

notions.

• The attack of the note is the time interval during which

the amplitude envelope increases.

1036 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005

• The concept of transient is more difﬁcult to describe pre-

cisely. As a preliminary informaldeﬁnition,transients are

short intervals during which the signal evolves quickly

in some nontrivial or relatively unpredictable way. In the

case of acoustic instruments, the transient often corre-

sponds to the period during which the excitation (e.g., a

hammer strike) is applied and then damped, leaving only

the slow decay at the resonance frequencies of the body.

Central to this time duration problem is the issue of the

useful time resolution: we will assume that the human ear

cannot distinguish between two transients less than 10

ms apart [1]. Note that the release or offset of a sustained

sound can also be considered a transient period.

• The onset of the note is a single instant chosen to mark

the temporally extended transient. In most cases, it will

coincide with the start of the transient, or the earliest

time at which the transient can be reliably detected.

C. General Scheme of Onset Detection Algorithms

In the more realistic case of a possibly noisy polyphonic

signal, where multiple sound objects may be present at a given

time, the above distinctions become less precise. It is generally

not possible to detect onsets directly without ﬁrst quantifying

the time-varying “transientness” of the signal.

Audio signals are both additive (musical objects in poly-

phonic music superimpose and not conceal each other) and

oscillatory. Therefore, it is not possible to look for changes

simply by differentiating the original signal in the time domain;

this has to be done on an intermediate signal that reﬂects, in

a simpliﬁed form, the local structure of the original. In this

paper, we refer to such a signal as a detection function; in the

literature, the term novelty function is sometimes used instead

[2].

Fig. 2 illustrates the procedure employed in the majority

of onset detection algorithms: from the original audio signal,

which can be pre-processed to improve the performance of

subsequent stages, a detection function is derived at a lower

sampling rate, to which a peak-picking algorithm is applied

to locate the onsets. Whereas peak-picking algorithms are

well documented in the literature, the diversity of existing

approaches for the construction of the detection function makes

the comparison between onset detection algorithms difﬁcult for

audio engineers and researchers.

D. Outline of the Paper

The outline of this paper follows the ﬂowchart in Fig. 2. In

Section II, we review a number of preprocessing techniques that

can be employed to enhance the performance of some of the de-

tection methods. Section III presents a representative cross-sec-

tion of algorithms for the construction of the detection function.

In Section IV, we describe some basic peak-picking algorithms;

this allows the comparative study of the performance of a se-

lection of note onset detection methods given in Section V. We

ﬁnish our discussion in Section VI with a review of our ﬁnd-

ings and some thoughts on the future development of these al-

gorithms and their applications.

Fig. 2. Flowchart of a standard onset detection algorithm.

II. PREPROCESSING

The concept of preprocessing implies the transformation of

the original signal in order to accentuate or attenuate various

aspects of the signal according to their relevance to the task in

hand. It is an optional step that derives its relevance from the

process or processes to be subsequently performed.

There are a number of different treatments that can be ap-

plied to a musical signal in order to facilitate the task of onset

detection. However, we will focus only on two processes that

are consistently mentioned in the literature, and that appear to

be of particular relevance to onset detection schemes, especially

when simple reduction methods are implemented: separating

the signal into multiple frequency bands, and transient/steady-

state separation.

A. Multiple Bands

Several onset detection studies have found it useful to in-

dependently analyze information across different frequency

bands. In some cases this preprocessing is needed to satisfy

the needs of speciﬁc applications that require detection in in-

dividual sub-bands to complement global estimates; in others,

such an approach can be justiﬁed as a way of increasing the

robustness of a given onset detection method.

As examples of the ﬁrst scenario, two beat tracking systems

make use of ﬁlter banks to analyze transients across frequencies.

BELLO et al.: A TUTORIAL ON ONSET DETECTION IN MUSIC SIGNALS 1037

Goto [3] slices the spectrogram into spectrum strips and recog-

nizes onsets by detecting sudden changes in energy. These are

used in a multiple-agent architecture to detect rhythmic patterns.

Scheirer [4] implements a six-band ﬁlter bank, using sixth-order

elliptic ﬁlters, and psychoacoustically inspired processing to

produce onset trains. These are fed into comb-ﬁlter resonators

in order to estimate the tempo of the signal.

The second case is illustrated by models such as the percep-

tual onset detector introduced by Klapuri [5]. In this implemen-

tation, a ﬁlter bank divides the signal into eight nonoverlapping

bands. In each band, onset times and intensities are detected and

ﬁnally combined. The ﬁlter-bank model is used as an approxi-

mation to the mechanics of the human cochlea.

Another example is the method proposed by Duxbury et al.

[6], that uses a constant-Q conjugate quadrature ﬁlter bank to

separate the signal into ﬁve subbands. It goes a step further by

proposing a hybrid scheme that considers energy changes in

high-frequency bands and spectral changes in lower bands. By

implementing a multiple-band scheme, the approach effectively

avoids the constraints imposed by the use of a single reduction

method, while having different time resolutions for different fre-

quency bands.

B. Transient/Steady-State Separation

The process of transient/steady-state separation is usually as-

sociated with the modeling of music signals, which is beyond

the scope of this paper. However, there is a ﬁne line between

modeling and detection, and indeed, some modeling schemes

directed at representing transients may hold promise for onset

detection. Below, we brieﬂy describe several methods that pro-

duce modiﬁed signals (residuals, transient signals) that can be,

or have been, used for the purpose of onset detection.

Sinusoidal models, such as “additive synthesis” [7], represent

an audio signal as a sum of sinusoids with slowly varying pa-

rameters. Amongst these methods, spectral modeling synthesis

(SMS) [8] explicitly considers the residual

of the synthesis

method as a Gaussian white noise ﬁltered with a slowly varying

low-order ﬁlter. Levine [9] calculates the residual between the

original signal and a multiresolution SMS model. Signiﬁcant in-

creases in the energy of the residual show a mismatch between

the model and the original, thus effectively marking onsets. An

extension of SMS, transient modeling synthesis, is presented

in [10]. Transient signals are analyzed by a sinusoidal anal-

ysis/synthesis similar to SMS on the discrete cosine transform

of the residual, hence in a pseudo-temporal domain. In [11], the

whole scheme, including tonal and transients extraction is gen-

eralized into a single matching pursuit formulation.

An alternative approach for the segregation of sinusoids from

transient/noise components is proposed by Settel and Lippe [12]

and later reﬁned by Duxbury et al. [13]. It is based on the phase-

vocoder principle of instantaneous frequency (see Section III-

A.3) that allows the classiﬁcation of individual frequency bins

of a spectrogram according to the predictability of their phase

components.

The residual signal results from the subtraction of the modeled signal from

the original waveform. When sinusoidal or harmonic modeling is used, then the

residual is assumed to contain most of the impulse-like, noisy components of

the original signal—e.g., transients.

Other schemes for the separation of tonal from nontonal com-

ponents make use of lapped orthogonal transforms, such as the

modiﬁed discrete cosine transform (MDCT), ﬁrst introduced by

Princen and Bradley [14]. These algorithms, originally designed

for compression [15], [16], make use of the relative sparsity of

MDCT representations of most musical signals: a few large co-

efﬁcients account for most of the signal’s energy. Actually, since

the MDCT atoms are very tone-like (they are cosine functions

slowly modulated in time by a smooth window), the part of the

signal represented by the large MDCT atoms, according to a

given threshold, can be interpreted as the tonal part of the signal

[10], [17]. Transients and noise can be obtained by removing

those large MDCT atoms.

III. R

EDUCTION

In the context of onset detection, the concept of reduction

refers to the process of transforming the audio signal into a

highly subsampled detection function which manifests the oc-

currence of transients in the original signal. This is the key

process in a wide class of onset detection schemes and will

therefore be the focus of most of our review.

We will broadly divide reduction methods in two groups:

methods based on the use of explicitly predeﬁned signal fea-

tures, and methods based on probabilistic signal models.

A. Reduction Based on Signal Features

1) Temporal Features: When observing the temporal evo-

lution of simple musical signals, it is noticeable that the oc-

currence of an onset is usually accompanied by an increase of

the signal’s amplitude. Early methods of onset detection capi-

talized on this by using a detection function which follows the

amplitude envelope of the signal [18]. Such an “envelope fol-

lower” can be easily constructed by rectifying and smoothing

(i.e., low-pass ﬁltering) the signal

(1)

where

is an -point window or smoothing kernel, cen-

tered at

. This yields satisfactory results for certain appli-

cations where strong percussive transients exist against a quiet

background. A variation on this is to follow the local energy,

rather than the amplitude, by squaring, instead of rectifying,

each sample

(2)

Despite the smoothing, this reduced signal in its raw form is

not usually suitable for reliable onset detection by peak picking.

A further reﬁnement, included in a number of standard onset

detection algorithms, is to work with the time derivative of the

energy (or rather the ﬁrst difference for discrete-time signals) so

that sudden rises in energy are transformed into narrow peaks in

the derivative. The energy and its derivative are commonly used

in combination with preprocessing, both with ﬁlter-banks [3]

and transient/steady-state separation [9], [19].

1038 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005

Another reﬁnement takes its cue from psychoacoustics: em-

pirical evidence [20] indicates that loudness is perceived loga-

rithmically. This means that changes in loudness are judged rel-

ative to the overall loudness, since, for a continuous time signal,

. Hence, computing the ﬁrst-dif-

ference of

roughly simulates the ear’s perception of

loudness. An application of this technique to multiple bands [5]

showed a signiﬁcant reduction in the tendency for amplitude

modulation to cause the detection of spurious onsets.

2) Spectral Features: A number of techniques have been

proposed that use the spectral structure of the signal to produce

more reliable detection functions. While reducing the need for

preprocessing (e.g., removal of the tonal part), these methods

are also successful in a number of scenarios, including onset

detection in polyphonic signals with multiple instruments.

Let us consider the short-time Fourier transform (STFT) of

the signal

(3)

where

is again an -point window, and is the hop size,

or time shift, between adjacent windows.

In the spectral domain, energy increases linked to transients

tend to appear as a broadband event. Since the energy of the

signal is usually concentrated at low frequencies, changes due

to transients are more noticeable at high frequencies [21]. To

emphasize this, the spectrum can be weighted preferentially to-

ward high frequencies before summing to obtain a weighted en-

ergy measure

(4)

where

is the frequency dependent weighting. By Parseval’s

theorem, if

, is simply equivalent to the local

energy as previously deﬁned. Note also that a choice of

would give the local energy of the derivative of the signal.

Masri [22] proposes a high frequency content (HFC) function

with

, linearly weighting each bin’s contribution in

proportion to its frequency. The HFC function produces sharp

peaks during attack transients and is notably successful when

faced with percussive onsets, where transients are well modeled

as bursts of white noise.

These spectrally weighted measures are based on the instanta-

neous short-term spectrum of the signal, thus omitting any ex-

plicit consideration of its temporal evolution. Alternatively, a

number of other approaches do consider these changes, using

variations in spectral content between analysis frames in order

to generate a more informative detection function.

Rodet and Jaillet [21] propose a method where the frequency

bands of a sequence of STFTs are analyzed independently

using a piece-wise linear approximation to the magnitude

proﬁle

for , where is a short

temporal window, and

is a ﬁxed value. The parameters of

these approximations are used to generate a set of band-wise

detection functions, later combined to produce ﬁnal onset re-

sults. Detection results are robust for high-frequencies, showing

consistency with Masri’s HFC approach.

A more general approach based on changes in the spectrum

is to formulate the detection function as a “distance” between

successive short-term Fourier spectra, treating them as points

in an

-dimensional space. Depending on the metric chosen to

calculate this distance, different spectral difference, or spectral

ﬂux, detection functions can be constructed: Masri [22] uses the

-norm of the difference between magnitude spectra, whereas

Duxbury [6] uses the

-norm on the rectiﬁed difference

(5)

where

, i.e., zero for negative arguments.

The rectiﬁcation has the effect of counting only those frequen-

cies where there is an increase in energy, and is intended to em-

phasize onsets rather than offsets.

A related form of spectral difference is introduced by Foote

[2] to obtain a measure of “audio novelty”.

A similarity matrix

is calculated using the correlation between STFT feature vectors

(power spectra). The matrix is then correlated with a “checker-

board” kernel to detect the edges between areas of high and low

similarity. The resulting function shows sharp peaks at the times

of these changes, and is effectively an onset detection function

when kernels of small width are used.

3) Spectral Features Using Phase: All the mentioned

methods have in common their use of the magnitude of the

spectrum as their only source of information. However, recent

approaches make also use of the phase spectra to further their

analyses of the behavior of onsets. This is relevant since much

of the temporal structure of a signal is encoded in the phase

spectrum.

Let us deﬁne

the -unwrapped phase of a given STFT

coefﬁcient

. For a steady state sinusoid, the phase ,

as well as the phase in the previous window

, are used

to calculate a value for the instantaneous frequency, an estimate

of the actual frequency of the

STFT component within this

window, as [23]

(6)

where

is the hop size between windows and is the sampling

frequency.

It is expected that, for a locally stationary sinusoid, the in-

stantaneous frequency should be approximately constant over

adjacent windows. Thus, according to (6), this is equivalent to

the phase increment from window to window remaining approx-

imately constant (cf. Fig. 3)

(7)

The term novelty function is common to the literature in machine learning

and communication theory, and is widely used for video segmentation. In the

context of onset detection, our notion of the detection function can be seen also

as a novelty function, in that it tries to measure the extent to which an event is

unusual given a series of observations in the past.

BELLO et al.: A TUTORIAL ON ONSET DETECTION IN MUSIC SIGNALS 1039

Fig. 3. Phase diagram showing instantaneous frequencies as phase derivative

over adjacent frames. For a stationary sinusoid this should stay constant (dotted

line).

Equivalently, the phase deviation can be deﬁned as the second

difference of the phase

(8)

During a transient region, the instantaneous frequency is not

usually well deﬁned, and hence

will tend to be large.

This is illustrated in Fig. 3.

In [24], Bello proposes a method that analyzes the instan-

taneous distribution (in the sense of a probability distribution

or histogram) of phase deviations across the frequency domain.

During the steady-state part of a sound, deviations tend to zero,

thus the distribution is strongly peaked around this value. During

attack transients,

values increase, widening and ﬂat-

tening the distribution. In [24], this behavior is quantiﬁed by

calculating the inter-quartile range and the kurtosis of the dis-

tribution. In [25], a simpler measure of the spread of the distri-

bution is calculated as

(9)

i.e., the mean absolute phase deviation. The method, although

showing some improvement for complex signals, is susceptible

to phase distortion and to noise introduced by the phases of com-

ponents with no signiﬁcant energy.

As an alternative to the sole use of magnitude or phase in-

formation, [26] introduces an approach that works with Fourier

coefﬁcients in the complex domain. The stationarity of the

spectral bin is quantiﬁed by calculating the Euclidean distance

between the observed and that predicted by the

previous frames,

(10)

These distances are summed across the frequency-domain to

generate an onset detection function

(11)

See [27] for an application of this technique to multiple

bands. Other preprocessing, such as the removal of the tonal

part, may introduce distortions to the phase information and thus

adversely affect the performance of subsequent phase-based

onset detection methods.

4) Time-Frequency and Time-Scale Analysis: An alternative

to the analysis of the temporal envelope of the signal and of

Fourier spectral coefﬁcients, is the use of time-scale or time-

frequency representations (TFR).

In [28] a novelty function is calculated by measuring the

dissimilarity between feature vectors corresponding to a dis-

cretized Cohen’s class TFR, in this case the result of convolving

the Wigner-Ville TFR of the function with a Gaussian kernel.

Note that the method could be also seen as a spectral difference

approach, given that by choosing an appropriate kernel, the rep-

resentation becomes equivalent to the spectrogram of the signal.

In [29], an approach for transient detection is described based

on a simple dyadic wavelet decomposition of the residual signal.

This transform, using the Haar wavelet, was chosen for its sim-

plicity and its good time localization at small scales. The scheme

takes advantage of the correlations across scales of the coef-

ﬁcients: large wavelet coefﬁcients, related to transients in the

signal, are not evenly spread within the dyadic plane but rather

form “structures”. Indeed, if a given coefﬁcient has a large am-

plitude, there is a high probability that the coefﬁcients with the

same time localization at smaller scales also have large ampli-

tudes, therefore forming dyadic trees of signiﬁcant coefﬁcients.

The signiﬁcance of full-size branches of coefﬁcients, from the

largest to the smallest scale, can be quantiﬁed by a regularity

modulus, which is a local measure of the regularity of the signal

(12)

where the

are the wavelet coefﬁcients, is the full branch

leading to a given small-scale coefﬁcient

(i.e., the set of

coefﬁcients at larger scale and same time localization), and

a free parameter used to emphasize certain scales ( is

often used in practice). Since increases of

are related to the

existence of large, transient-like coefﬁcients in the branch

the regularity modulus can effectively act as an onset detection

function.

B. Reduction Based on Probability Models

Statistical methods for onset detection are based on the as-

sumption that the signal can be described by some probability

model. A system can then be constructed that makes proba-

bilistic inferences about the likely times of abrupt changes in

the signal, given the available observations. The success of this

approach depends on the closeness of ﬁt between the assumed

model, i.e., the probability distribution described by the model,

and the “true” distribution of the data, and may be quantiﬁed

using likelihood measures or Bayesian model selection criteria.

1) Model-Based Change Point Detection Methods: A well-

known approach is based on the sequential probability ratio test

[30]. It presupposes that the signal samples

are generated

A tutorial on onset detection in music signals

Figures

Citations

Context-Dependent Piano Music Transcription With Convolutional Sparse Coding

A review on techniques for the extraction of transients in musical signals

Motor Learning Induces Plasticity in the Resting Brain—Drumming Up a Connection

Improved estimation of the amplitude envelope of time-domain signals using true envelope cepstral smoothing

A study of intonation in three-part singing using the automatic music performance analysis and comparison toolkit (ampact)

References

Detection of abrupt changes: theory and application

Auditory Scene Analysis: The Perceptual Organization of Sound

Introduction to the Psychology of Hearing

Speech analysis/Synthesis based on a sinusoidal representation

A model for the prediction of thresholds, loudness, and partial loudness

Related Papers (5)

Sound onset detection by applying psychoacoustic knowledge

Tempo and beat analysis of acoustic musical signals

Analysis of the meter of acoustic musical signals

Beat Tracking by Dynamic Programming

YIN, a fundamental frequency estimator for speech and music

Frequently Asked Questions (9)

Q1. What is the function of the regularity modulus?

Q2. What are the contributions mentioned in the paper "A tutorial on onset detection in music signals" ?

Q3. What is the way to detect onsets?

Q4. What is the expectation of the log-likelihood ratio?

Q5. What is the scheme for detecting transients?

Q6. What is the procedure used in the majority of onset detection algorithms?

Q7. What is the alternative to the analysis of the temporal envelope of the signal and of Fourier?

Q8. What is the general approach to detecting a frequency?

Q9. What was the threshold for the peak-picking parameters?