scispace - formally typeset
Open AccessJournal ArticleDOI

A tutorial on onset detection in music signals

Reads0
Chats0
TLDR
Methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations, and methods based on probabilistic signal models are discussed.
Abstract
Note onset detection and localization is useful in a number of analysis and indexing techniques for musical signals. The usual way to detect onsets is to look for "transient" regions in the signal, a notion that leads to many definitions: a sudden burst of energy, a change in the short-time spectrum of the signal or in the statistical properties, etc. The goal of this paper is to review, categorize, and compare some of the most commonly used techniques for onset detection, and to present possible enhancements. We discuss methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations; and methods based on probabilistic signal models: model-based change point detection, surprise signals, etc. Using a choice of test cases, we provide some guidelines for choosing the appropriate method for a given application.

read more

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005 1035
A Tutorial on Onset Detection in Music Signals
Juan Pablo Bello, Laurent Daudet, Samer Abdallah, Chris Duxbury, Mike Davies, and
Mark B. Sandler, Senior Member, IEEE
Abstract—Note onset detection and localization is useful in a
number of analysis and indexing techniques for musical signals.
The usual way to detect onsets is to look for “transient” regions in
the signal, a notion that leads to many definitions: a sudden burst
of energy, a change in the short-time spectrum of the signal or in
the statistical properties, etc. The goal of this paper is to review,
categorize, and compare some of the most commonly used tech-
niques for onset detection, and to present possible enhancements.
We discuss methods based on the use of explicitly predefined signal
features: the signal’s amplitude envelope, spectral magnitudes and
phases, time-frequency representations; and methods based on
probabilistic signal models: model-based change point detection,
surprise signals, etc. Using a choice of test cases, we provide
some guidelines for choosing the appropriate method for a given
application.
Index Terms—Attack transcients, audio, note segmentation, nov-
elty detection.
I. INTRODUCTION
A. Background and Motivation
M
USIC is to a great extent an event-based phenomenon for
both performer and listener. We nod our heads or tap our
feet to the rhythm of a piece; the performer’s attention is focused
on each successive note. Even in non note-based music, there
are transitions as musical timbre and tone color evolve. Without
change, there can be no musical meaning.
The automatic detection of events in audio signals gives new
possibilities in a number of music applications including con-
tent delivery, compression, indexing and retrieval. Accurate re-
trieval depends on the use of appropriate features to compare
and identify pieces of music. Given the importance of musical
events, it is clear that identifying and characterizing these events
is an important aspect of this process. Equally, as compres-
sion standards advance and the drive for improving quality at
low bit-rates continues, so does accurate event detection be-
come a basic requirement: disjoint audio segments with homo-
geneous statistical properties, delimited by transitions or events,
can be compressed more successfully in isolation than they can
Manuscript received August 6, 2003; revised July 21, 2004. The associate ed-
itor coordinating the review of this manuscript and approving it for publication
was Dr. Gerald Schuller.
J. P. Bello, S. Abdallah, M. Davies, and M. B. Sandler are with the Centre for
Digital Music, Department of Electronic Engineering, Queen Mary, University
of London, London E1 4NS, U.K. (e-mail: juan.bello-correa@elec.qmul.ac.uk;
samer.abdallah@elec.qmul.ac.uk; mike.davies@elec.qmul.ac.uk; mark.san-
dler@elec.qmul.ac.uk).
L. Daudet is with the Laboratoire d’Acoustique Musicale, Université Pierre
et Marie Curie (Paris 6), 75015 Paris, France (e-mail: daudet@lam.jussieu.fr).
C. Duxbury is with the Centre for Digital Music, Department of Elec-
tronic Engineering, Queen Mary, University of London, London E1 4NS,
U.K., and also with WaveCrest Communications Ltd. (e-mail: christo-
pher.duxbury@elec.qmul.ac.uk).
Digital Object Identifier 10.1109/TSA.2005.851998
Fig. 1. “Attack, “transient, “decay, and “onset” in the ideal case of a single
note.
in combination with dissimilar regions. Finally, accurate seg-
mentation allows a large number of standard audio editing al-
gorithms and effects (e.g., time-stretching, pitch-shifting) to be
more signal-adaptive.
There have been many different approaches for onset detec-
tion. The goal of this paper is to give an overview of the most
commonly used techniques, with a special emphasis on the ones
that have been employed in the authors’ different applications.
For the sake of coherence, the discussion will be focused on
the more specific problem of note onset detection in musical
signals, although we believe that the discussed methods can be
useful for various different tasks (e.g., transient modeling or lo-
calization) and different classes of signals (e.g., environmental
sounds, speech).
B. Definitions: Transients vs. Onsets vs. Attacks
A central issue here is to make a clear distinction between the
related concepts of transients, onsets and attacks. The reason
for making these distinctions clear is that different applications
have different needs. The similarities and differences between
these key concepts are important to consider; it is similarly im-
portant to categorize all related approaches. Fig. 1 shows, in the
simple case of an isolated note, how one could differentiate these
notions.
The attack of the note is the time interval during which
the amplitude envelope increases.
1063-6676/$20.00 © 2005 IEEE

1036 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005
The concept of transient is more difcult to describe pre-
cisely. As a preliminary informaldenition,transients are
short intervals during which the signal evolves quickly
in some nontrivial or relatively unpredictable way. In the
case of acoustic instruments, the transient often corre-
sponds to the period during which the excitation (e.g., a
hammer strike) is applied and then damped, leaving only
the slow decay at the resonance frequencies of the body.
Central to this time duration problem is the issue of the
useful time resolution: we will assume that the human ear
cannot distinguish between two transients less than 10
ms apart [1]. Note that the release or offset of a sustained
sound can also be considered a transient period.
The onset of the note is a single instant chosen to mark
the temporally extended transient. In most cases, it will
coincide with the start of the transient, or the earliest
time at which the transient can be reliably detected.
C. General Scheme of Onset Detection Algorithms
In the more realistic case of a possibly noisy polyphonic
signal, where multiple sound objects may be present at a given
time, the above distinctions become less precise. It is generally
not possible to detect onsets directly without rst quantifying
the time-varying transientness of the signal.
Audio signals are both additive (musical objects in poly-
phonic music superimpose and not conceal each other) and
oscillatory. Therefore, it is not possible to look for changes
simply by differentiating the original signal in the time domain;
this has to be done on an intermediate signal that reects, in
a simplied form, the local structure of the original. In this
paper, we refer to such a signal as a detection function; in the
literature, the term novelty function is sometimes used instead
[2].
Fig. 2 illustrates the procedure employed in the majority
of onset detection algorithms: from the original audio signal,
which can be pre-processed to improve the performance of
subsequent stages, a detection function is derived at a lower
sampling rate, to which a peak-picking algorithm is applied
to locate the onsets. Whereas peak-picking algorithms are
well documented in the literature, the diversity of existing
approaches for the construction of the detection function makes
the comparison between onset detection algorithms difcult for
audio engineers and researchers.
D. Outline of the Paper
The outline of this paper follows the owchart in Fig. 2. In
Section II, we review a number of preprocessing techniques that
can be employed to enhance the performance of some of the de-
tection methods. Section III presents a representative cross-sec-
tion of algorithms for the construction of the detection function.
In Section IV, we describe some basic peak-picking algorithms;
this allows the comparative study of the performance of a se-
lection of note onset detection methods given in Section V. We
nish our discussion in Section VI with a review of our nd-
ings and some thoughts on the future development of these al-
gorithms and their applications.
Fig. 2. Flowchart of a standard onset detection algorithm.
II. PREPROCESSING
The concept of preprocessing implies the transformation of
the original signal in order to accentuate or attenuate various
aspects of the signal according to their relevance to the task in
hand. It is an optional step that derives its relevance from the
process or processes to be subsequently performed.
There are a number of different treatments that can be ap-
plied to a musical signal in order to facilitate the task of onset
detection. However, we will focus only on two processes that
are consistently mentioned in the literature, and that appear to
be of particular relevance to onset detection schemes, especially
when simple reduction methods are implemented: separating
the signal into multiple frequency bands, and transient/steady-
state separation.
A. Multiple Bands
Several onset detection studies have found it useful to in-
dependently analyze information across different frequency
bands. In some cases this preprocessing is needed to satisfy
the needs of specic applications that require detection in in-
dividual sub-bands to complement global estimates; in others,
such an approach can be justied as a way of increasing the
robustness of a given onset detection method.
As examples of the rst scenario, two beat tracking systems
make use of lter banks to analyze transients across frequencies.

BELLO et al.: A TUTORIAL ON ONSET DETECTION IN MUSIC SIGNALS 1037
Goto [3] slices the spectrogram into spectrum strips and recog-
nizes onsets by detecting sudden changes in energy. These are
used in a multiple-agent architecture to detect rhythmic patterns.
Scheirer [4] implements a six-band lter bank, using sixth-order
elliptic lters, and psychoacoustically inspired processing to
produce onset trains. These are fed into comb-lter resonators
in order to estimate the tempo of the signal.
The second case is illustrated by models such as the percep-
tual onset detector introduced by Klapuri [5]. In this implemen-
tation, a lter bank divides the signal into eight nonoverlapping
bands. In each band, onset times and intensities are detected and
nally combined. The lter-bank model is used as an approxi-
mation to the mechanics of the human cochlea.
Another example is the method proposed by Duxbury et al.
[6], that uses a constant-Q conjugate quadrature lter bank to
separate the signal into ve subbands. It goes a step further by
proposing a hybrid scheme that considers energy changes in
high-frequency bands and spectral changes in lower bands. By
implementing a multiple-band scheme, the approach effectively
avoids the constraints imposed by the use of a single reduction
method, while having different time resolutions for different fre-
quency bands.
B. Transient/Steady-State Separation
The process of transient/steady-state separation is usually as-
sociated with the modeling of music signals, which is beyond
the scope of this paper. However, there is a ne line between
modeling and detection, and indeed, some modeling schemes
directed at representing transients may hold promise for onset
detection. Below, we briey describe several methods that pro-
duce modied signals (residuals, transient signals) that can be,
or have been, used for the purpose of onset detection.
Sinusoidal models, such as additive synthesis [7], represent
an audio signal as a sum of sinusoids with slowly varying pa-
rameters. Amongst these methods, spectral modeling synthesis
(SMS) [8] explicitly considers the residual
1
of the synthesis
method as a Gaussian white noise ltered with a slowly varying
low-order lter. Levine [9] calculates the residual between the
original signal and a multiresolution SMS model. Signicant in-
creases in the energy of the residual show a mismatch between
the model and the original, thus effectively marking onsets. An
extension of SMS, transient modeling synthesis, is presented
in [10]. Transient signals are analyzed by a sinusoidal anal-
ysis/synthesis similar to SMS on the discrete cosine transform
of the residual, hence in a pseudo-temporal domain. In [11], the
whole scheme, including tonal and transients extraction is gen-
eralized into a single matching pursuit formulation.
An alternative approach for the segregation of sinusoids from
transient/noise components is proposed by Settel and Lippe [12]
and later rened by Duxbury et al. [13]. It is based on the phase-
vocoder principle of instantaneous frequency (see Section III-
A.3) that allows the classication of individual frequency bins
of a spectrogram according to the predictability of their phase
components.
1
The residual signal results from the subtraction of the modeled signal from
the original waveform. When sinusoidal or harmonic modeling is used, then the
residual is assumed to contain most of the impulse-like, noisy components of
the original signale.g., transients.
Other schemes for the separation of tonal from nontonal com-
ponents make use of lapped orthogonal transforms, such as the
modied discrete cosine transform (MDCT), rst introduced by
Princen and Bradley [14]. These algorithms, originally designed
for compression [15], [16], make use of the relative sparsity of
MDCT representations of most musical signals: a few large co-
efcients account for most of the signals energy. Actually, since
the MDCT atoms are very tone-like (they are cosine functions
slowly modulated in time by a smooth window), the part of the
signal represented by the large MDCT atoms, according to a
given threshold, can be interpreted as the tonal part of the signal
[10], [17]. Transients and noise can be obtained by removing
those large MDCT atoms.
III. R
EDUCTION
In the context of onset detection, the concept of reduction
refers to the process of transforming the audio signal into a
highly subsampled detection function which manifests the oc-
currence of transients in the original signal. This is the key
process in a wide class of onset detection schemes and will
therefore be the focus of most of our review.
We will broadly divide reduction methods in two groups:
methods based on the use of explicitly predened signal fea-
tures, and methods based on probabilistic signal models.
A. Reduction Based on Signal Features
1) Temporal Features: When observing the temporal evo-
lution of simple musical signals, it is noticeable that the oc-
currence of an onset is usually accompanied by an increase of
the signals amplitude. Early methods of onset detection capi-
talized on this by using a detection function which follows the
amplitude envelope of the signal [18]. Such an envelope fol-
lower can be easily constructed by rectifying and smoothing
(i.e., low-pass ltering) the signal
(1)
where
is an -point window or smoothing kernel, cen-
tered at
. This yields satisfactory results for certain appli-
cations where strong percussive transients exist against a quiet
background. A variation on this is to follow the local energy,
rather than the amplitude, by squaring, instead of rectifying,
each sample
(2)
Despite the smoothing, this reduced signal in its raw form is
not usually suitable for reliable onset detection by peak picking.
A further renement, included in a number of standard onset
detection algorithms, is to work with the time derivative of the
energy (or rather the rst difference for discrete-time signals) so
that sudden rises in energy are transformed into narrow peaks in
the derivative. The energy and its derivative are commonly used
in combination with preprocessing, both with lter-banks [3]
and transient/steady-state separation [9], [19].

1038 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 5, SEPTEMBER 2005
Another renement takes its cue from psychoacoustics: em-
pirical evidence [20] indicates that loudness is perceived loga-
rithmically. This means that changes in loudness are judged rel-
ative to the overall loudness, since, for a continuous time signal,
. Hence, computing the rst-dif-
ference of
roughly simulates the ears perception of
loudness. An application of this technique to multiple bands [5]
showed a signicant reduction in the tendency for amplitude
modulation to cause the detection of spurious onsets.
2) Spectral Features: A number of techniques have been
proposed that use the spectral structure of the signal to produce
more reliable detection functions. While reducing the need for
preprocessing (e.g., removal of the tonal part), these methods
are also successful in a number of scenarios, including onset
detection in polyphonic signals with multiple instruments.
Let us consider the short-time Fourier transform (STFT) of
the signal
(3)
where
is again an -point window, and is the hop size,
or time shift, between adjacent windows.
In the spectral domain, energy increases linked to transients
tend to appear as a broadband event. Since the energy of the
signal is usually concentrated at low frequencies, changes due
to transients are more noticeable at high frequencies [21]. To
emphasize this, the spectrum can be weighted preferentially to-
ward high frequencies before summing to obtain a weighted en-
ergy measure
(4)
where
is the frequency dependent weighting. By Parsevals
theorem, if
, is simply equivalent to the local
energy as previously dened. Note also that a choice of
would give the local energy of the derivative of the signal.
Masri [22] proposes a high frequency content (HFC) function
with
, linearly weighting each bins contribution in
proportion to its frequency. The HFC function produces sharp
peaks during attack transients and is notably successful when
faced with percussive onsets, where transients are well modeled
as bursts of white noise.
These spectrally weighted measures are based on the instanta-
neous short-term spectrum of the signal, thus omitting any ex-
plicit consideration of its temporal evolution. Alternatively, a
number of other approaches do consider these changes, using
variations in spectral content between analysis frames in order
to generate a more informative detection function.
Rodet and Jaillet [21] propose a method where the frequency
bands of a sequence of STFTs are analyzed independently
using a piece-wise linear approximation to the magnitude
prole
for , where is a short
temporal window, and
is a xed value. The parameters of
these approximations are used to generate a set of band-wise
detection functions, later combined to produce nal onset re-
sults. Detection results are robust for high-frequencies, showing
consistency with Masris HFC approach.
A more general approach based on changes in the spectrum
is to formulate the detection function as a distance between
successive short-term Fourier spectra, treating them as points
in an
-dimensional space. Depending on the metric chosen to
calculate this distance, different spectral difference, or spectral
ux, detection functions can be constructed: Masri [22] uses the
-norm of the difference between magnitude spectra, whereas
Duxbury [6] uses the
-norm on the rectied difference
(5)
where
, i.e., zero for negative arguments.
The rectication has the effect of counting only those frequen-
cies where there is an increase in energy, and is intended to em-
phasize onsets rather than offsets.
A related form of spectral difference is introduced by Foote
[2] to obtain a measure of audio novelty.
2
A similarity matrix
is calculated using the correlation between STFT feature vectors
(power spectra). The matrix is then correlated with a checker-
board kernel to detect the edges between areas of high and low
similarity. The resulting function shows sharp peaks at the times
of these changes, and is effectively an onset detection function
when kernels of small width are used.
3) Spectral Features Using Phase: All the mentioned
methods have in common their use of the magnitude of the
spectrum as their only source of information. However, recent
approaches make also use of the phase spectra to further their
analyses of the behavior of onsets. This is relevant since much
of the temporal structure of a signal is encoded in the phase
spectrum.
Let us dene
the -unwrapped phase of a given STFT
coefcient
. For a steady state sinusoid, the phase ,
as well as the phase in the previous window
, are used
to calculate a value for the instantaneous frequency, an estimate
of the actual frequency of the
STFT component within this
window, as [23]
(6)
where
is the hop size between windows and is the sampling
frequency.
It is expected that, for a locally stationary sinusoid, the in-
stantaneous frequency should be approximately constant over
adjacent windows. Thus, according to (6), this is equivalent to
the phase increment from window to window remaining approx-
imately constant (cf. Fig. 3)
(7)
2
The term novelty function is common to the literature in machine learning
and communication theory, and is widely used for video segmentation. In the
context of onset detection, our notion of the detection function can be seen also
as a novelty function, in that it tries to measure the extent to which an event is
unusual given a series of observations in the past.

BELLO et al.: A TUTORIAL ON ONSET DETECTION IN MUSIC SIGNALS 1039
Fig. 3. Phase diagram showing instantaneous frequencies as phase derivative
over adjacent frames. For a stationary sinusoid this should stay constant (dotted
line).
Equivalently, the phase deviation can be dened as the second
difference of the phase
(8)
During a transient region, the instantaneous frequency is not
usually well dened, and hence
will tend to be large.
This is illustrated in Fig. 3.
In [24], Bello proposes a method that analyzes the instan-
taneous distribution (in the sense of a probability distribution
or histogram) of phase deviations across the frequency domain.
During the steady-state part of a sound, deviations tend to zero,
thus the distribution is strongly peaked around this value. During
attack transients,
values increase, widening and at-
tening the distribution. In [24], this behavior is quantied by
calculating the inter-quartile range and the kurtosis of the dis-
tribution. In [25], a simpler measure of the spread of the distri-
bution is calculated as
(9)
i.e., the mean absolute phase deviation. The method, although
showing some improvement for complex signals, is susceptible
to phase distortion and to noise introduced by the phases of com-
ponents with no signicant energy.
As an alternative to the sole use of magnitude or phase in-
formation, [26] introduces an approach that works with Fourier
coefcients in the complex domain. The stationarity of the
spectral bin is quantied by calculating the Euclidean distance
between the observed and that predicted by the
previous frames,
(10)
These distances are summed across the frequency-domain to
generate an onset detection function
(11)
See [27] for an application of this technique to multiple
bands. Other preprocessing, such as the removal of the tonal
part, may introduce distortions to the phase information and thus
adversely affect the performance of subsequent phase-based
onset detection methods.
4) Time-Frequency and Time-Scale Analysis: An alternative
to the analysis of the temporal envelope of the signal and of
Fourier spectral coefcients, is the use of time-scale or time-
frequency representations (TFR).
In [28] a novelty function is calculated by measuring the
dissimilarity between feature vectors corresponding to a dis-
cretized Cohens class TFR, in this case the result of convolving
the Wigner-Ville TFR of the function with a Gaussian kernel.
Note that the method could be also seen as a spectral difference
approach, given that by choosing an appropriate kernel, the rep-
resentation becomes equivalent to the spectrogram of the signal.
In [29], an approach for transient detection is described based
on a simple dyadic wavelet decomposition of the residual signal.
This transform, using the Haar wavelet, was chosen for its sim-
plicity and its good time localization at small scales. The scheme
takes advantage of the correlations across scales of the coef-
cients: large wavelet coefcients, related to transients in the
signal, are not evenly spread within the dyadic plane but rather
form structures. Indeed, if a given coefcient has a large am-
plitude, there is a high probability that the coefcients with the
same time localization at smaller scales also have large ampli-
tudes, therefore forming dyadic trees of signicant coefcients.
The signicance of full-size branches of coefcients, from the
largest to the smallest scale, can be quantied by a regularity
modulus, which is a local measure of the regularity of the signal
(12)
where the
are the wavelet coefcients, is the full branch
leading to a given small-scale coefcient
(i.e., the set of
coefcients at larger scale and same time localization), and
a free parameter used to emphasize certain scales ( is
often used in practice). Since increases of
are related to the
existence of large, transient-like coefcients in the branch
,
the regularity modulus can effectively act as an onset detection
function.
B. Reduction Based on Probability Models
Statistical methods for onset detection are based on the as-
sumption that the signal can be described by some probability
model. A system can then be constructed that makes proba-
bilistic inferences about the likely times of abrupt changes in
the signal, given the available observations. The success of this
approach depends on the closeness of t between the assumed
model, i.e., the probability distribution described by the model,
and the true distribution of the data, and may be quantied
using likelihood measures or Bayesian model selection criteria.
1) Model-Based Change Point Detection Methods: A well-
known approach is based on the sequential probability ratio test
[30]. It presupposes that the signal samples
are generated

Citations
More filters
Journal ArticleDOI

Context-Dependent Piano Music Transcription With Convolutional Sparse Coding

TL;DR: Experiments show that this approach significantly outperforms a state-of-the-art music transcription method trained in the same context-dependent setting, in both transcription accuracy and time precision, in various scenarios including synthetic, anechoic, noisy, and reverberant environments.
Journal Article

A review on techniques for the extraction of transients in musical signals

TL;DR: Preliminary comparative results suggest that, for sharp percussive transients, the results are roughly independent of the chosen method, but that for slower rising attacks – e.g. for bowed string or wind instruments - the choice of method is critical.
Journal ArticleDOI

Motor Learning Induces Plasticity in the Resting Brain—Drumming Up a Connection

TL;DR: The resting‐state functional connectivity (rs‐FC) in novice healthy participants before and after a course of drumming and the potential for rehabilitation treatments with exercise‐based intervention to overcome impairments due to brain diseases are discussed.
Proceedings ArticleDOI

Improved estimation of the amplitude envelope of time-domain signals using true envelope cepstral smoothing

TL;DR: This work proposes a method to obtain a smooth function that approximately matches the main peaks of the waveform using true envelope estimation, dubbed true amplitude envelope, a cepstral smoothing technique that has been shown to outperform traditional envelope estimation techniques both in accuracy of estimation and ease of order selection.
Proceedings Article

A study of intonation in three-part singing using the automatic music performance analysis and comparison toolkit (ampact)

TL;DR: The Automatic Music Performance Analysis and Comparison Toolkit (AMPACT), is a MATLAB toolkit for accurately aligning monophonic audio to MIDI scores as well as extracting and analyzing timing-, pitch-, and dynamics-related performance data from the aligned recordings.
References
More filters
Journal ArticleDOI

Detection of abrupt changes: theory and application

TL;DR: A unified framework for the design and the performance analysis of the algorithms for solving change detection problems and links with the analytical redundancy approach to fault detection in linear systems are established.
Book

Auditory Scene Analysis: The Perceptual Organization of Sound

TL;DR: Auditory Scene Analysis as discussed by the authors addresses the problem of hearing complex auditory environments, using a series of creative analogies to describe the process required of the human auditory system as it analyzes mixtures of sounds to recover descriptions of individual sounds.
Journal ArticleDOI

Introduction to the Psychology of Hearing

TL;DR: In this paper, the authors provide an account of current trends in auditory research on a level not too technical for the novice, by relating psychological and perceptual aspects of sound to the underlying physiological mechanisms of hearing in a way that the material can be used as a text to accompany an advanced undergraduate or graduate level course in auditory perception.
Journal ArticleDOI

Speech analysis/Synthesis based on a sinusoidal representation

TL;DR: A sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves, which forms the basis for new approaches to the problems of speech transformations including time-scale and pitch-scale modification, and midrate speech coding.
Journal Article

A model for the prediction of thresholds, loudness, and partial loudness

TL;DR: In this paper, a model for steady sounds is described having the following stages: 1) a fixed filter representing transfer through the outer ear, 2) an excitation pattern from the physical spectrum, 3) transformation of the excitation patterns to a specific loudness pattern, 4) determination of the area under the specific loudeness pattern, 5) determination for a given ear, and 6) summation of loudness across ears.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What is the function of the regularity modulus?

Since increases of are related to the existence of large, transient-like coefficients in the branch , the regularity modulus can effectively act as an onset detection function. 

The goal of this paper is to review, categorize, and compare some of the most commonly used techniques for onset detection, and to present possible enhancements. The authors discuss methods based on the use of explicitly predefined signal features: the signal ’ s amplitude envelope, spectral magnitudes and phases, time-frequency representations ; and methods based on probabilistic signal models: model-based change point detection, surprise signals, etc. Using a choice of test cases, the authors provide some guidelines for choosing the appropriate method for a given application. 

For nontrivial sounds, onset detection schemes benefit from using richer representations of the signal (e.g., a time-frequency representation). 

Under model , the expectation is(15)If the authors assume that the signal initially follows model , and switches to model at some unknown time, then the short-time average of the log-likelihood ratio will change sign. 

The scheme takes advantage of the correlations across scales of the coefficients: large wavelet coefficients, related to transients in the signal, are not evenly spread within the dyadic plane but rather form “structures”. 

Fig. 2 illustrates the procedure employed in the majority of onset detection algorithms: from the original audio signal, which can be pre-processed to improve the performance of subsequent stages, a detection function is derived at a lower sampling rate, to which a peak-picking algorithm is applied to locate the onsets. 

An alternative to the analysis of the temporal envelope of the signal and of Fourier spectral coefficients, is the use of time-scale or timefrequency representations (TFR). 

A more general approach based on changes in the spectrum is to formulate the detection function as a “distance” between successive short-term Fourier spectra, treating them as points in an -dimensional space. 

All peak-picking parameters (e.g., filter’s cutoff frequency, ) were held constant, except for the threshold which was varied to trace out the performance curve.