scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A tutorial on onset detection in music signals

TL;DR: Methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations, and methods based on probabilistic signal models are discussed.
Abstract: Note onset detection and localization is useful in a number of analysis and indexing techniques for musical signals. The usual way to detect onsets is to look for "transient" regions in the signal, a notion that leads to many definitions: a sudden burst of energy, a change in the short-time spectrum of the signal or in the statistical properties, etc. The goal of this paper is to review, categorize, and compare some of the most commonly used techniques for onset detection, and to present possible enhancements. We discuss methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations; and methods based on probabilistic signal models: model-based change point detection, surprise signals, etc. Using a choice of test cases, we provide some guidelines for choosing the appropriate method for a given application.
Figures (8)

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005 1035
A Tutorial on Onset Detection in Music Signals
Juan Pablo Bello, Laurent Daudet, Samer Abdallah, Chris Duxbury, Mike Davies, and
Mark B. Sandler, Senior Member, IEEE
Abstract—Note onset detection and localization is useful in a
number of analysis and indexing techniques for musical signals.
The usual way to detect onsets is to look for “transient” regions in
the signal, a notion that leads to many definitions: a sudden burst
of energy, a change in the short-time spectrum of the signal or in
the statistical properties, etc. The goal of this paper is to review,
categorize, and compare some of the most commonly used tech-
niques for onset detection, and to present possible enhancements.
We discuss methods based on the use of explicitly predefined signal
features: the signal’s amplitude envelope, spectral magnitudes and
phases, time-frequency representations; and methods based on
probabilistic signal models: model-based change point detection,
surprise signals, etc. Using a choice of test cases, we provide
some guidelines for choosing the appropriate method for a given
application.
Index Terms—Attack transcients, audio, note segmentation, nov-
elty detection.
I. INTRODUCTION
A. Background and Motivation
M
USIC is to a great extent an event-based phenomenon for
both performer and listener. We nod our heads or tap our
feet to the rhythm of a piece; the performer’s attention is focused
on each successive note. Even in non note-based music, there
are transitions as musical timbre and tone color evolve. Without
change, there can be no musical meaning.
The automatic detection of events in audio signals gives new
possibilities in a number of music applications including con-
tent delivery, compression, indexing and retrieval. Accurate re-
trieval depends on the use of appropriate features to compare
and identify pieces of music. Given the importance of musical
events, it is clear that identifying and characterizing these events
is an important aspect of this process. Equally, as compres-
sion standards advance and the drive for improving quality at
low bit-rates continues, so does accurate event detection be-
come a basic requirement: disjoint audio segments with homo-
geneous statistical properties, delimited by transitions or events,
can be compressed more successfully in isolation than they can
Manuscript received August 6, 2003; revised July 21, 2004. The associate ed-
itor coordinating the review of this manuscript and approving it for publication
was Dr. Gerald Schuller.
J. P. Bello, S. Abdallah, M. Davies, and M. B. Sandler are with the Centre for
Digital Music, Department of Electronic Engineering, Queen Mary, University
of London, London E1 4NS, U.K. (e-mail: juan.bello-correa@elec.qmul.ac.uk;
samer.abdallah@elec.qmul.ac.uk; mike.davies@elec.qmul.ac.uk; mark.san-
dler@elec.qmul.ac.uk).
L. Daudet is with the Laboratoire d’Acoustique Musicale, Université Pierre
et Marie Curie (Paris 6), 75015 Paris, France (e-mail: daudet@lam.jussieu.fr).
C. Duxbury is with the Centre for Digital Music, Department of Elec-
tronic Engineering, Queen Mary, University of London, London E1 4NS,
U.K., and also with WaveCrest Communications Ltd. (e-mail: christo-
pher.duxbury@elec.qmul.ac.uk).
Digital Object Identifier 10.1109/TSA.2005.851998
Fig. 1. “Attack, “transient, “decay, and “onset” in the ideal case of a single
note.
in combination with dissimilar regions. Finally, accurate seg-
mentation allows a large number of standard audio editing al-
gorithms and effects (e.g., time-stretching, pitch-shifting) to be
more signal-adaptive.
There have been many different approaches for onset detec-
tion. The goal of this paper is to give an overview of the most
commonly used techniques, with a special emphasis on the ones
that have been employed in the authors’ different applications.
For the sake of coherence, the discussion will be focused on
the more specific problem of note onset detection in musical
signals, although we believe that the discussed methods can be
useful for various different tasks (e.g., transient modeling or lo-
calization) and different classes of signals (e.g., environmental
sounds, speech).
B. Definitions: Transients vs. Onsets vs. Attacks
A central issue here is to make a clear distinction between the
related concepts of transients, onsets and attacks. The reason
for making these distinctions clear is that different applications
have different needs. The similarities and differences between
these key concepts are important to consider; it is similarly im-
portant to categorize all related approaches. Fig. 1 shows, in the
simple case of an isolated note, how one could differentiate these
notions.
The attack of the note is the time interval during which
the amplitude envelope increases.
1063-6676/$20.00 © 2005 IEEE

1036 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005
The concept of transient is more difcult to describe pre-
cisely. As a preliminary informaldenition,transients are
short intervals during which the signal evolves quickly
in some nontrivial or relatively unpredictable way. In the
case of acoustic instruments, the transient often corre-
sponds to the period during which the excitation (e.g., a
hammer strike) is applied and then damped, leaving only
the slow decay at the resonance frequencies of the body.
Central to this time duration problem is the issue of the
useful time resolution: we will assume that the human ear
cannot distinguish between two transients less than 10
ms apart [1]. Note that the release or offset of a sustained
sound can also be considered a transient period.
The onset of the note is a single instant chosen to mark
the temporally extended transient. In most cases, it will
coincide with the start of the transient, or the earliest
time at which the transient can be reliably detected.
C. General Scheme of Onset Detection Algorithms
In the more realistic case of a possibly noisy polyphonic
signal, where multiple sound objects may be present at a given
time, the above distinctions become less precise. It is generally
not possible to detect onsets directly without rst quantifying
the time-varying transientness of the signal.
Audio signals are both additive (musical objects in poly-
phonic music superimpose and not conceal each other) and
oscillatory. Therefore, it is not possible to look for changes
simply by differentiating the original signal in the time domain;
this has to be done on an intermediate signal that reects, in
a simplied form, the local structure of the original. In this
paper, we refer to such a signal as a detection function; in the
literature, the term novelty function is sometimes used instead
[2].
Fig. 2 illustrates the procedure employed in the majority
of onset detection algorithms: from the original audio signal,
which can be pre-processed to improve the performance of
subsequent stages, a detection function is derived at a lower
sampling rate, to which a peak-picking algorithm is applied
to locate the onsets. Whereas peak-picking algorithms are
well documented in the literature, the diversity of existing
approaches for the construction of the detection function makes
the comparison between onset detection algorithms difcult for
audio engineers and researchers.
D. Outline of the Paper
The outline of this paper follows the owchart in Fig. 2. In
Section II, we review a number of preprocessing techniques that
can be employed to enhance the performance of some of the de-
tection methods. Section III presents a representative cross-sec-
tion of algorithms for the construction of the detection function.
In Section IV, we describe some basic peak-picking algorithms;
this allows the comparative study of the performance of a se-
lection of note onset detection methods given in Section V. We
nish our discussion in Section VI with a review of our nd-
ings and some thoughts on the future development of these al-
gorithms and their applications.
Fig. 2. Flowchart of a standard onset detection algorithm.
II. PREPROCESSING
The concept of preprocessing implies the transformation of
the original signal in order to accentuate or attenuate various
aspects of the signal according to their relevance to the task in
hand. It is an optional step that derives its relevance from the
process or processes to be subsequently performed.
There are a number of different treatments that can be ap-
plied to a musical signal in order to facilitate the task of onset
detection. However, we will focus only on two processes that
are consistently mentioned in the literature, and that appear to
be of particular relevance to onset detection schemes, especially
when simple reduction methods are implemented: separating
the signal into multiple frequency bands, and transient/steady-
state separation.
A. Multiple Bands
Several onset detection studies have found it useful to in-
dependently analyze information across different frequency
bands. In some cases this preprocessing is needed to satisfy
the needs of specic applications that require detection in in-
dividual sub-bands to complement global estimates; in others,
such an approach can be justied as a way of increasing the
robustness of a given onset detection method.
As examples of the rst scenario, two beat tracking systems
make use of lter banks to analyze transients across frequencies.

BELLO et al.: A TUTORIAL ON ONSET DETECTION IN MUSIC SIGNALS 1037
Goto [3] slices the spectrogram into spectrum strips and recog-
nizes onsets by detecting sudden changes in energy. These are
used in a multiple-agent architecture to detect rhythmic patterns.
Scheirer [4] implements a six-band lter bank, using sixth-order
elliptic lters, and psychoacoustically inspired processing to
produce onset trains. These are fed into comb-lter resonators
in order to estimate the tempo of the signal.
The second case is illustrated by models such as the percep-
tual onset detector introduced by Klapuri [5]. In this implemen-
tation, a lter bank divides the signal into eight nonoverlapping
bands. In each band, onset times and intensities are detected and
nally combined. The lter-bank model is used as an approxi-
mation to the mechanics of the human cochlea.
Another example is the method proposed by Duxbury et al.
[6], that uses a constant-Q conjugate quadrature lter bank to
separate the signal into ve subbands. It goes a step further by
proposing a hybrid scheme that considers energy changes in
high-frequency bands and spectral changes in lower bands. By
implementing a multiple-band scheme, the approach effectively
avoids the constraints imposed by the use of a single reduction
method, while having different time resolutions for different fre-
quency bands.
B. Transient/Steady-State Separation
The process of transient/steady-state separation is usually as-
sociated with the modeling of music signals, which is beyond
the scope of this paper. However, there is a ne line between
modeling and detection, and indeed, some modeling schemes
directed at representing transients may hold promise for onset
detection. Below, we briey describe several methods that pro-
duce modied signals (residuals, transient signals) that can be,
or have been, used for the purpose of onset detection.
Sinusoidal models, such as additive synthesis [7], represent
an audio signal as a sum of sinusoids with slowly varying pa-
rameters. Amongst these methods, spectral modeling synthesis
(SMS) [8] explicitly considers the residual
1
of the synthesis
method as a Gaussian white noise ltered with a slowly varying
low-order lter. Levine [9] calculates the residual between the
original signal and a multiresolution SMS model. Signicant in-
creases in the energy of the residual show a mismatch between
the model and the original, thus effectively marking onsets. An
extension of SMS, transient modeling synthesis, is presented
in [10]. Transient signals are analyzed by a sinusoidal anal-
ysis/synthesis similar to SMS on the discrete cosine transform
of the residual, hence in a pseudo-temporal domain. In [11], the
whole scheme, including tonal and transients extraction is gen-
eralized into a single matching pursuit formulation.
An alternative approach for the segregation of sinusoids from
transient/noise components is proposed by Settel and Lippe [12]
and later rened by Duxbury et al. [13]. It is based on the phase-
vocoder principle of instantaneous frequency (see Section III-
A.3) that allows the classication of individual frequency bins
of a spectrogram according to the predictability of their phase
components.
1
The residual signal results from the subtraction of the modeled signal from
the original waveform. When sinusoidal or harmonic modeling is used, then the
residual is assumed to contain most of the impulse-like, noisy components of
the original signale.g., transients.
Other schemes for the separation of tonal from nontonal com-
ponents make use of lapped orthogonal transforms, such as the
modied discrete cosine transform (MDCT), rst introduced by
Princen and Bradley [14]. These algorithms, originally designed
for compression [15], [16], make use of the relative sparsity of
MDCT representations of most musical signals: a few large co-
efcients account for most of the signals energy. Actually, since
the MDCT atoms are very tone-like (they are cosine functions
slowly modulated in time by a smooth window), the part of the
signal represented by the large MDCT atoms, according to a
given threshold, can be interpreted as the tonal part of the signal
[10], [17]. Transients and noise can be obtained by removing
those large MDCT atoms.
III. R
EDUCTION
In the context of onset detection, the concept of reduction
refers to the process of transforming the audio signal into a
highly subsampled detection function which manifests the oc-
currence of transients in the original signal. This is the key
process in a wide class of onset detection schemes and will
therefore be the focus of most of our review.
We will broadly divide reduction methods in two groups:
methods based on the use of explicitly predened signal fea-
tures, and methods based on probabilistic signal models.
A. Reduction Based on Signal Features
1) Temporal Features: When observing the temporal evo-
lution of simple musical signals, it is noticeable that the oc-
currence of an onset is usually accompanied by an increase of
the signals amplitude. Early methods of onset detection capi-
talized on this by using a detection function which follows the
amplitude envelope of the signal [18]. Such an envelope fol-
lower can be easily constructed by rectifying and smoothing
(i.e., low-pass ltering) the signal
(1)
where
is an -point window or smoothing kernel, cen-
tered at
. This yields satisfactory results for certain appli-
cations where strong percussive transients exist against a quiet
background. A variation on this is to follow the local energy,
rather than the amplitude, by squaring, instead of rectifying,
each sample
(2)
Despite the smoothing, this reduced signal in its raw form is
not usually suitable for reliable onset detection by peak picking.
A further renement, included in a number of standard onset
detection algorithms, is to work with the time derivative of the
energy (or rather the rst difference for discrete-time signals) so
that sudden rises in energy are transformed into narrow peaks in
the derivative. The energy and its derivative are commonly used
in combination with preprocessing, both with lter-banks [3]
and transient/steady-state separation [9], [19].

1038 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005
Another renement takes its cue from psychoacoustics: em-
pirical evidence [20] indicates that loudness is perceived loga-
rithmically. This means that changes in loudness are judged rel-
ative to the overall loudness, since, for a continuous time signal,
. Hence, computing the rst-dif-
ference of
roughly simulates the ears perception of
loudness. An application of this technique to multiple bands [5]
showed a signicant reduction in the tendency for amplitude
modulation to cause the detection of spurious onsets.
2) Spectral Features: A number of techniques have been
proposed that use the spectral structure of the signal to produce
more reliable detection functions. While reducing the need for
preprocessing (e.g., removal of the tonal part), these methods
are also successful in a number of scenarios, including onset
detection in polyphonic signals with multiple instruments.
Let us consider the short-time Fourier transform (STFT) of
the signal
(3)
where
is again an -point window, and is the hop size,
or time shift, between adjacent windows.
In the spectral domain, energy increases linked to transients
tend to appear as a broadband event. Since the energy of the
signal is usually concentrated at low frequencies, changes due
to transients are more noticeable at high frequencies [21]. To
emphasize this, the spectrum can be weighted preferentially to-
ward high frequencies before summing to obtain a weighted en-
ergy measure
(4)
where
is the frequency dependent weighting. By Parsevals
theorem, if
, is simply equivalent to the local
energy as previously dened. Note also that a choice of
would give the local energy of the derivative of the signal.
Masri [22] proposes a high frequency content (HFC) function
with
, linearly weighting each bins contribution in
proportion to its frequency. The HFC function produces sharp
peaks during attack transients and is notably successful when
faced with percussive onsets, where transients are well modeled
as bursts of white noise.
These spectrally weighted measures are based on the instanta-
neous short-term spectrum of the signal, thus omitting any ex-
plicit consideration of its temporal evolution. Alternatively, a
number of other approaches do consider these changes, using
variations in spectral content between analysis frames in order
to generate a more informative detection function.
Rodet and Jaillet [21] propose a method where the frequency
bands of a sequence of STFTs are analyzed independently
using a piece-wise linear approximation to the magnitude
prole
for , where is a short
temporal window, and
is a xed value. The parameters of
these approximations are used to generate a set of band-wise
detection functions, later combined to produce nal onset re-
sults. Detection results are robust for high-frequencies, showing
consistency with Masris HFC approach.
A more general approach based on changes in the spectrum
is to formulate the detection function as a distance between
successive short-term Fourier spectra, treating them as points
in an
-dimensional space. Depending on the metric chosen to
calculate this distance, different spectral difference, or spectral
ux, detection functions can be constructed: Masri [22] uses the
-norm of the difference between magnitude spectra, whereas
Duxbury [6] uses the
-norm on the rectied difference
(5)
where
, i.e., zero for negative arguments.
The rectication has the effect of counting only those frequen-
cies where there is an increase in energy, and is intended to em-
phasize onsets rather than offsets.
A related form of spectral difference is introduced by Foote
[2] to obtain a measure of audio novelty.
2
A similarity matrix
is calculated using the correlation between STFT feature vectors
(power spectra). The matrix is then correlated with a checker-
board kernel to detect the edges between areas of high and low
similarity. The resulting function shows sharp peaks at the times
of these changes, and is effectively an onset detection function
when kernels of small width are used.
3) Spectral Features Using Phase: All the mentioned
methods have in common their use of the magnitude of the
spectrum as their only source of information. However, recent
approaches make also use of the phase spectra to further their
analyses of the behavior of onsets. This is relevant since much
of the temporal structure of a signal is encoded in the phase
spectrum.
Let us dene
the -unwrapped phase of a given STFT
coefcient
. For a steady state sinusoid, the phase ,
as well as the phase in the previous window
, are used
to calculate a value for the instantaneous frequency, an estimate
of the actual frequency of the
STFT component within this
window, as [23]
(6)
where
is the hop size between windows and is the sampling
frequency.
It is expected that, for a locally stationary sinusoid, the in-
stantaneous frequency should be approximately constant over
adjacent windows. Thus, according to (6), this is equivalent to
the phase increment from window to window remaining approx-
imately constant (cf. Fig. 3)
(7)
2
The term novelty function is common to the literature in machine learning
and communication theory, and is widely used for video segmentation. In the
context of onset detection, our notion of the detection function can be seen also
as a novelty function, in that it tries to measure the extent to which an event is
unusual given a series of observations in the past.

BELLO et al.: A TUTORIAL ON ONSET DETECTION IN MUSIC SIGNALS 1039
Fig. 3. Phase diagram showing instantaneous frequencies as phase derivative
over adjacent frames. For a stationary sinusoid this should stay constant (dotted
line).
Equivalently, the phase deviation can be dened as the second
difference of the phase
(8)
During a transient region, the instantaneous frequency is not
usually well dened, and hence
will tend to be large.
This is illustrated in Fig. 3.
In [24], Bello proposes a method that analyzes the instan-
taneous distribution (in the sense of a probability distribution
or histogram) of phase deviations across the frequency domain.
During the steady-state part of a sound, deviations tend to zero,
thus the distribution is strongly peaked around this value. During
attack transients,
values increase, widening and at-
tening the distribution. In [24], this behavior is quantied by
calculating the inter-quartile range and the kurtosis of the dis-
tribution. In [25], a simpler measure of the spread of the distri-
bution is calculated as
(9)
i.e., the mean absolute phase deviation. The method, although
showing some improvement for complex signals, is susceptible
to phase distortion and to noise introduced by the phases of com-
ponents with no signicant energy.
As an alternative to the sole use of magnitude or phase in-
formation, [26] introduces an approach that works with Fourier
coefcients in the complex domain. The stationarity of the
spectral bin is quantied by calculating the Euclidean distance
between the observed and that predicted by the
previous frames,
(10)
These distances are summed across the frequency-domain to
generate an onset detection function
(11)
See [27] for an application of this technique to multiple
bands. Other preprocessing, such as the removal of the tonal
part, may introduce distortions to the phase information and thus
adversely affect the performance of subsequent phase-based
onset detection methods.
4) Time-Frequency and Time-Scale Analysis: An alternative
to the analysis of the temporal envelope of the signal and of
Fourier spectral coefcients, is the use of time-scale or time-
frequency representations (TFR).
In [28] a novelty function is calculated by measuring the
dissimilarity between feature vectors corresponding to a dis-
cretized Cohens class TFR, in this case the result of convolving
the Wigner-Ville TFR of the function with a Gaussian kernel.
Note that the method could be also seen as a spectral difference
approach, given that by choosing an appropriate kernel, the rep-
resentation becomes equivalent to the spectrogram of the signal.
In [29], an approach for transient detection is described based
on a simple dyadic wavelet decomposition of the residual signal.
This transform, using the Haar wavelet, was chosen for its sim-
plicity and its good time localization at small scales. The scheme
takes advantage of the correlations across scales of the coef-
cients: large wavelet coefcients, related to transients in the
signal, are not evenly spread within the dyadic plane but rather
form structures. Indeed, if a given coefcient has a large am-
plitude, there is a high probability that the coefcients with the
same time localization at smaller scales also have large ampli-
tudes, therefore forming dyadic trees of signicant coefcients.
The signicance of full-size branches of coefcients, from the
largest to the smallest scale, can be quantied by a regularity
modulus, which is a local measure of the regularity of the signal
(12)
where the
are the wavelet coefcients, is the full branch
leading to a given small-scale coefcient
(i.e., the set of
coefcients at larger scale and same time localization), and
a free parameter used to emphasize certain scales ( is
often used in practice). Since increases of
are related to the
existence of large, transient-like coefcients in the branch
,
the regularity modulus can effectively act as an onset detection
function.
B. Reduction Based on Probability Models
Statistical methods for onset detection are based on the as-
sumption that the signal can be described by some probability
model. A system can then be constructed that makes proba-
bilistic inferences about the likely times of abrupt changes in
the signal, given the available observations. The success of this
approach depends on the closeness of t between the assumed
model, i.e., the probability distribution described by the model,
and the true distribution of the data, and may be quantied
using likelihood measures or Bayesian model selection criteria.
1) Model-Based Change Point Detection Methods: A well-
known approach is based on the sequential probability ratio test
[30]. It presupposes that the signal samples
are generated

Citations
More filters
Journal ArticleDOI
TL;DR: Les approches utilisees pour the recherche d'information musicale (RIM) sont multidisciplinaires : bibliotheconomie and science de l'information, musicologie, theorie musicale, ingenierie du son, informatique, droit et commerce...
Abstract: Les approches utilisees pour la recherche d'information musicale (RIM) sont multidisciplinaires : bibliotheconomie et science de l'information, musicologie, theorie musicale, ingenierie du son, informatique, droit et commerce... L'article vise a identifier et a expliquer la problematique de de la RIM alors qu'elle devient une discipline a part entiere, les influences historiques, l'etat-de-l'art de la recherche et les solutions potentielles. L'information musicale est multifacette - ton, temporalite, harmonie, timbre, edition, texte et bibliographie - , l'acces a chacune de ces facettes constituant un defi pour la recherche et le developpement. Mais la RIM represente egalement un defi multirepresentationnel, multiculturel, multiexperience et multidisciplinaire. Les systemes de RIM deploient differents degres d'exhaustivite representationnelle. Ils relevent generalement de deux types : les systemes analytiques ou de production, et les systemes de localisation. En conclusion, l'A. mentionne quelques ateliers et symposiums recents ainsi que les principaux projets de recherche concernant la RIM.

372 citations

Journal ArticleDOI
01 Dec 2013
TL;DR: Limits of current transcription methods are analyzed and promising directions for future research are identified, including the integration of information from multiple algorithms and different musical aspects.
Abstract: Automatic music transcription is considered by many to be a key enabling technology in music signal processing. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper we analyse limitations of current methods and identify promising directions for future research. Current transcription methods use general purpose models which are unable to capture the rich diversity found in music signals. One way to overcome the limited performance of transcription systems is to tailor algorithms to specific use-cases. Semi-automatic approaches are another way of achieving a more reliable transcription. Also, the wealth of musical scores and corresponding audio data now available are a rich potential source of training data, via forced alignment of audio to scores, but large scale utilisation of such data has yet to be attempted. Other promising approaches include the integration of information from multiple algorithms and different musical aspects.

298 citations

Journal ArticleDOI
TL;DR: An algorithm that predicts musical genre and artist from an audio waveform using the ensemble learner ADABOOST and evidence collected from a variety of popular features and classifiers that the technique of classifying features aggregated over segments of audio is better than classifying either entire songs or individual short-timescale features.
Abstract: We present an algorithm that predicts musical genre and artist from an audio waveform. Our method uses the ensemble learner ADABOOST to select from a set of audio features that have been extracted from segmented audio and then aggregated. Our classifier proved to be the most effective method for genre classification at the recent MIREX 2005 international contests in music information extraction, and the second-best method for recognizing artists. This paper describes our method in detail, from feature extraction to song classification, and presents an evaluation of our method on three genre databases and two artist-recognition databases. Furthermore, we present evidence collected from a variety of popular features and classifiers that the technique of classifying features aggregated over segments of audio is better than classifying either entire songs or individual short-timescale features.

296 citations


Cites background from "A tutorial on onset detection in mu..."

  • ...A survey by Aucouturier and Pachet (2003) describes a number of popular features for music similarity and classification, and research continues (e.g. Bello et al. (2005), Pampalk et al. (2005))....

    [...]

Proceedings ArticleDOI
01 Jun 2016
TL;DR: This paper presents an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick, using a recurrent neural network to predict sound features from videos and then producing a waveform from these features with an example-based synthesis procedure.
Abstract: Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. In this paper, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We present an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick. This algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We show that the sounds predicted by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that they convey significant information about material properties and physical interactions.

284 citations

Proceedings ArticleDOI
27 Oct 2006
TL;DR: Initial experiments show that the algorithm can successfully detect harmonic changes such as chord boundaries in polyphonic audio recordings.
Abstract: We propose a novel method for detecting changes in the harmonic content of musical audio signals. Our method uses a new model for Equal Tempered Pitch Class Space. This model maps 12-bin chroma vectors to the interior space of a 6-D polytope; pitch classes are mapped onto the vertices of this polytope. Close harmonic relations such as fifths and thirds appear as small Euclidian distances. We calculate the Euclidian distance between analysis frames n +1 and n -1 to develop a harmonic change measure for frame n. A peak in the detection function denotes a transition from one harmonically stable region to another. Initial experiments show that the algorithm can successfully detect harmonic changes such as chord boundaries in polyphonic audio recordings.

266 citations


Cites methods from "A tutorial on onset detection in mu..."

  • ...The three performance measures used here are Precision (P), the ratio of Hits to Detected Changes and Recall (R), the ratio of hits to transcribed changes and the f-measure (F) which combines the two (see equation 5) [ 1 ]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A unified framework for the design and the performance analysis of the algorithms for solving change detection problems and links with the analytical redundancy approach to fault detection in linear systems are established.
Abstract: This book is downloadable from http://www.irisa.fr/sisthem/kniga/. Many monitoring problems can be stated as the problem of detecting a change in the parameters of a static or dynamic stochastic system. The main goal of this book is to describe a unified framework for the design and the performance analysis of the algorithms for solving these change detection problems. Also the book contains the key mathematical background necessary for this purpose. Finally links with the analytical redundancy approach to fault detection in linear systems are established. We call abrupt change any change in the parameters of the system that occurs either instantaneously or at least very fast with respect to the sampling period of the measurements. Abrupt changes by no means refer to changes with large magnitude; on the contrary, in most applications the main problem is to detect small changes. Moreover, in some applications, the early warning of small - and not necessarily fast - changes is of crucial interest in order to avoid the economic or even catastrophic consequences that can result from an accumulation of such small changes. For example, small faults arising in the sensors of a navigation system can result, through the underlying integration, in serious errors in the estimated position of the plane. Another example is the early warning of small deviations from the normal operating conditions of an industrial process. The early detection of slight changes in the state of the process allows to plan in a more adequate manner the periods during which the process should be inspected and possibly repaired, and thus to reduce the exploitation costs.

3,830 citations


"A tutorial on onset detection in mu..." refers methods in this paper

  • ...The algorithms described in [30] are concerned with detecting this change of sign....

    [...]

  • ...1) Model-Based Change Point Detection Methods: A wellknown approach is based on the sequential probability ratio test [30]....

    [...]

Book
01 Jun 1990
TL;DR: Auditory Scene Analysis as discussed by the authors addresses the problem of hearing complex auditory environments, using a series of creative analogies to describe the process required of the human auditory system as it analyzes mixtures of sounds to recover descriptions of individual sounds.
Abstract: Auditory Scene Analysis addresses the problem of hearing complex auditory environments, using a series of creative analogies to describe the process required of the human auditory system as it analyzes mixtures of sounds to recover descriptions of individual sounds. In a unified and comprehensive way, Bregman establishes a theoretical framework that integrates his findings with an unusually wide range of previous research in psychoacoustics, speech perception, music theory and composition, and computer modeling.

2,968 citations

Journal ArticleDOI
TL;DR: In this paper, the authors provide an account of current trends in auditory research on a level not too technical for the novice, by relating psychological and perceptual aspects of sound to the underlying physiological mechanisms of hearing in a way that the material can be used as a text to accompany an advanced undergraduate or graduate level course in auditory perception.
Abstract: The author's stated general approach is to relate the psychological and perceptual aspects of sound to the underlying physiological mechanisms of hearing in a way that the material can be used as a text to accompany an advanced undergraduate- or graduate-level course in auditory perception. The attempt is to provide an account of current trends in auditory research on a level not too technical for the novice. Psychoacoustic studies on humans and physiological studies on animals serve as the primary bases for subject matter presentation, and many practical applications are offered. Among the chapters are the following: the nature of sound and the structure of the auditory system; loudness, adaptation, and fatigue; frequency analysis, masking, and critical bands; pitch perception and auditory pattern perception; space perception; and speech perception. Within these chapter headings special attention is given to a number of topics, including signal detection theory, monaural and binaural hearing,

1,956 citations


"A tutorial on onset detection in mu..." refers background in this paper

  • ...Given the importance of musical events, it is clear that identifying and characterizing these events is an important aspect of this process....

    [...]

Journal ArticleDOI
TL;DR: A sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves, which forms the basis for new approaches to the problems of speech transformations including time-scale and pitch-scale modification, and midrate speech coding.
Abstract: A sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves. These parameters are estimated from the short-time Fourier transform using a simple peak-picking algorithm. Rapid changes in the highly resolved spectral components are tracked using the concept of "birth" and "death" of the underlying sine waves. For a given frequency track a cubic function is used to unwrap and interpolate the phase such that the phase track is maximally smooth. This phase function is applied to a sine-wave generator, which is amplitude modulated and added to the other sine waves to give the final speech output. The resulting synthetic waveform preserves the general waveform shape and is essentially perceptually indistinguishable from the original speech. Furthermore, in the presence of noise the perceptual characteristics of the speech as well as the noise are maintained. In addition, it was found that the representation was sufficiently general that high-quality reproduction was obtained for a larger class of inputs including: two overlapping, superposed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. Finally, the analysis/synthesis system forms the basis for new approaches to the problems of speech transformations including time-scale and pitch-scale modification, and midrate speech coding [8], [9].

1,659 citations


"A tutorial on onset detection in mu..." refers methods in this paper

  • ...However, we will focus only on two processes that are consistently mentioned in the literature, and that appear to be of particular relevance to onset detection schemes, especially when simple reduction methods are implemented: separating the signal into multiple frequency bands, and…...

    [...]

Journal Article
TL;DR: In this paper, a model for steady sounds is described having the following stages: 1) a fixed filter representing transfer through the outer ear, 2) an excitation pattern from the physical spectrum, 3) transformation of the excitation patterns to a specific loudness pattern, 4) determination of the area under the specific loudeness pattern, 5) determination for a given ear, and 6) summation of loudness across ears.
Abstract: A loudness model for steady sounds is described having the following stages: 1) a fixed filter representing transfer through the outer ear; 2) a fixed filter representing transfer through the middle ear; 3) calculation of an excitation pattern from the physical spectrum; 4) transformation of the excitation pattern to a specific loudness pattern; 5) determination of the area under the specific loudness pattern, which gives overall loudness for a given ear; and 6) summation of loudness across ears. The model differs from earlier models in the following areas: 1) the assumed transfer function for the outer and middle ear; 2) the way that excitation patterns are calculated; 3) the way that specific loudness is related to excitation for sounds in quiet and in noise; and (4) the way that binaural loudness is calculated from monaural loudness. The model is based on the assumption that sounds at absolute threshold have a small but finite loudness. This loudness is constant regardless of frequency and spectral content. It is also assumed that a sound at masked threshold has the same loudness as a sound at absolute threshold. The model accounts well for recent measures of equal-loudness contours, which differ from earlier measures because of improved control over bias effects. The model correctly predicts the relation between monaural and binaural threshold and loudness. It also correctly accounts for the threshold and loudness of complex sounds as a function of bandwidth.

793 citations


"A tutorial on onset detection in mu..." refers methods in this paper

  • ...By implementing a multiple-band scheme, the approach effectively avoids the constraints imposed by the use of a single reduction method, while having different time resolutions for different frequency bands....

    [...]

Frequently Asked Questions (9)
Q1. What is the function of the regularity modulus?

Since increases of are related to the existence of large, transient-like coefficients in the branch , the regularity modulus can effectively act as an onset detection function. 

The goal of this paper is to review, categorize, and compare some of the most commonly used techniques for onset detection, and to present possible enhancements. The authors discuss methods based on the use of explicitly predefined signal features: the signal ’ s amplitude envelope, spectral magnitudes and phases, time-frequency representations ; and methods based on probabilistic signal models: model-based change point detection, surprise signals, etc. Using a choice of test cases, the authors provide some guidelines for choosing the appropriate method for a given application. 

For nontrivial sounds, onset detection schemes benefit from using richer representations of the signal (e.g., a time-frequency representation). 

Under model , the expectation is(15)If the authors assume that the signal initially follows model , and switches to model at some unknown time, then the short-time average of the log-likelihood ratio will change sign. 

The scheme takes advantage of the correlations across scales of the coefficients: large wavelet coefficients, related to transients in the signal, are not evenly spread within the dyadic plane but rather form “structures”. 

Fig. 2 illustrates the procedure employed in the majority of onset detection algorithms: from the original audio signal, which can be pre-processed to improve the performance of subsequent stages, a detection function is derived at a lower sampling rate, to which a peak-picking algorithm is applied to locate the onsets. 

An alternative to the analysis of the temporal envelope of the signal and of Fourier spectral coefficients, is the use of time-scale or timefrequency representations (TFR). 

A more general approach based on changes in the spectrum is to formulate the detection function as a “distance” between successive short-term Fourier spectra, treating them as points in an -dimensional space. 

All peak-picking parameters (e.g., filter’s cutoff frequency, ) were held constant, except for the threshold which was varied to trace out the performance curve.