scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Content-based classification, search, and retrieval of audio

01 Sep 1996-IEEE MultiMedia (IEEE)-Vol. 3, Iss: 3, pp 27-36
TL;DR: The audio analysis, search, and classification engine described here reduces sounds to perceptual and acoustical features, which lets users search or retrieve sounds by any one feature or a combination of them, by specifying previously learned classes based on these features.
Abstract: Many audio and multimedia applications would benefit from the ability to classify and search for audio based on its characteristics. The audio analysis, search, and classification engine described here reduces sounds to perceptual and acoustical features. This lets users search or retrieve sounds by any one feature or a combination of them, by specifying previously learned classes based on these features, or by selecting or entering reference sounds and asking the engine to retrieve similar or dissimilar sounds.

Summary (4 min read)

Previous research

  • Sounds are traditionally described by their pitch, loudness, duration, and timbre.
  • The effort to discover the components of timbre underlies much of the previous psychoacoustic research that is relevant to content-based audio retrieval.*.
  • Salient components of timbre include the amplitude envelope, harmonicity, and spectral envelope.
  • These algorithms were tuned to specific musical constructs and were not appropriate for all sounds.
  • There are several problems from their point of view.

Analysis and retrieval engine

  • Here the authors present a general paradigm and specific techniques for analyzing audio signals in a way that facilitates content-based retrieval.
  • This is analogous to a fuzzy text search and can be implemented using correlation techniques.
  • At the next level, the query might involve acoustic features that can be directly measured and perceptual properties of the sound.+5.
  • Some of these properties may even have different meanings for different users.
  • Since the authors cannot know the complete list of aural properties that users might wish to specify, it is impossible to guarantee that their choice of acoustical features will meet these constraints.

Acoustical features

  • The authors can currently analyze the following aspects of sound: loudness, pitch, brightness, bandwidth, and harmonicity.
  • For each of these frames, the frequencies and amplitudes of the peaks are measured and an approximate greatest common divisor algorithm is used to calculate an estimate of the pitch.
  • A perfect young human ear can hear frequencies in the ZO-Hz to ZO-kHz range.
  • The trajectory in time is computed during the analysis but not stored as such in the database.
  • The feature vector thus consists of the duration plus the parameters just mentioned (average, variance, and autocorrelation) for each of the aspects of sound given above.

Training the system

  • The user can ask for sounds in a certain range of pitch or brightness, However, it is also possible to train the system by example.
  • When the user supplies a set of example sounds for training, the mean vector ,U and the covariance matrix R for the a vectors in each class are calculated.
  • In practice, one can ignore the off-diagonal elements of R if the feature vector elements are reasonably independent of each other.
  • This simplification can yield significant savings in computation time.

Classifying sounds

  • When a new sound needs to be classified, a distance measure is calculated from the new sound's a vector and the model above.
  • Again, the off-diagonal elements of R can be ignored for faster computation.
  • If the class models some timbral aspect of the sounds, the duration and average pitch of the sounds can usually be ignored.
  • This value can be interpreted as "how much" of the defining property for the class the new sound has.

Retrieving sounds

  • It is now possible to select, sort, or classify sounds from the database using the distance measure.
  • I Retrieve all the sounds that are less "scratchy" than a given sound.
  • This technique has the advantage that it can be implemented on top of the very efficient index-based search algorithms in existing commercial databases.
  • If the database has MO sounds total, the authors first ask for all the sounds in a hyper-rectangle centered around the mean ,U with volume V such that V/v, = M/n/i, For small databases, it is easiest to compute the distance measure(s) for all the sounds in the database and then to choose the sounds that match the desired result.
  • Note that the above discussion is a simplification of their current algorithm, which asks for bigger volumes to begin with to correct for two factors.

Quality measures

  • If the dimensions of R are similar to the dimensions of the database, this class would not be useful as a discriminator, since all the sounds would fall into it.
  • Similarly, the system can detect other irregularities in the training set, such as outliers or bimodality.
  • From this, the user can see if a particular feature is too important or not important enough.
  • If all the sounds in the training set happen to have a very similar duration, the classification process will rank this feature highly, even though it may be irrelevant.
  • Similarly, the system can report to the user the components of the computed distance measure.

Segmentation

  • The discussion above deals with the case where each sound is a single gestalt.
  • Recordings that contain many different events need to be segmented before using the features above.
  • Segmentation is accomplished by applying the acoustic analyses discussed to the signal and looking for transitions (sudden changes in the measured features).
  • The transitions define segments of the signal, which can then be treated like individual sounds.
  • A recording of a concert could be scanned automatically for applause sounds to determine the boundaries between musical pieces.

Performance

  • The authors have used the above algorithms at Muscle Fish on a test sound database that contains about 400 sound files.
  • These sound files were culled from various sound effects and musical instrument sample libraries.
  • These classes were then used to reorder the sounds in the database by their likelihood of membership in the class.
  • See the "Class Model" sidebar on p. 32 for details on how their system computed this model.
  • One of the touchtone recordings that was left out of the training set also has a high likelihood, but notice that the other one, as well as one of those included in the training set, returned very low likelihoods.

Applications

  • The examples in this section will show the power this capability can bring to a user working in these areas.
  • For this examor object is formed with this supplemental inforple, the female-spoken phrase "tear gas" was used.
  • Figure 5 shows the record used in their sound browser, described in the next section.
  • Any user of the database can form an audio class by presenting a set of sounds to the classification algorithm of the last section.
  • This class can be private to the user or made available to all database users.

An audio database browser

  • The authors present a front-end database application named SoundFisher that lets the user search for sounds using queries that can be content based.
  • The bottom portion of the Query window consists of a list of sounds in the training set.
  • This likelihood is used as a multiplier against the likelihood computed from the similarity calculation or other parts of the query that yield fuzzy results.
  • The Back and Forward commands allow navigation along this path.
  • One of the fields available for constructing query components is "query," meaning "saved query.".

Audio editors

  • Current audio editors operate directly on the samples of the audio waveform.
  • A more useful editor would include knowledge of the audio content.
  • Using the techniques presented in this article, a variety of sound classes appropriate for the particular application domain could be developed.
  • Editing a concert recording would be aided by classes for audience applause, solo instruments, loud and soft ensemble playing, and other typical sound features of concerts.
  • During the editing process, all the types of queries presented in the preceding sections could be used to navigate through the recording.

Surveillance

  • The application of content-based retrieval in surveillance is identical to that of the audio editor except that the identification and classification would be done in real time.
  • Many offices are already equipped with computers that have builtin audio input devices.
  • These could be used to listen for the sounds of people, glass breaking, and so on.
  • There are also a number of police jurisdictions using microphones and video cameras to continuously survey areas having a high incidence of criminal activity or a low tolerance of such activity.
  • Again, such surveillance could be made more efficient and easier to monitor with the ability to detect sounds associated with criminal activity.

Automatic segmentation of audio and video

  • In large archives of raw audio and video, it is useful to have some automatic indexing and segmentation of the raw recordings.
  • There has been quite a bit of work on the video side of the segmentation problem using scene changes and camera movement.
  • The raw trajectories are segmented by amplitude and pitch and converted into musical score information in the form of MIDI data.
  • This product assumes musical instrument recordings, so pitch is very important.
  • You could treat these segments as individual sounds that can then be analyzed for their statistical features, as the authors have described above.

Additional analytic features

  • An analysis engine for content-based audio classification and retrieval works by analyzing the acoustic features of the audio and reducing these to a few statistical values.
  • More analyses could be added to handle specific problem domains.
  • The authors current set of acoustic features is targeted toward short or single-gestalt sounds.
  • The Audioto-MIDI system referenced above could be used to do matching of musical melodies.

Source separation

  • In their current system, simultaneously sounding sources are treated as a single ensemble.
  • Approaches to separating simultaneous sounds typically involve either Gestalt psycholo& or non-perceptual signal-processing techniques.
  • For musical applications, polyphonic pitch-tracking has been studied for many years, but might well be an intractable problem in the general case.

Sound synthesis

  • Sound synthesis could assist a user in making content-based queries to an audio database.
  • When the user was unsure what values to use, the synthesis feature would create sound prototypes that matched the curremset of values as they were manipulated.
  • The authors examples show the efficacy and useful fuzzy nature of the search.
  • The results of searches are sometimes surprising in that they cross semantic boundaries, but aurally the results are reasonable.
  • Further implementation and testing of the system will reveal whether the chosen acoustical features are sufficient or excessive for usefully analyzing and classifying most sounds.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Content-Based
Classification,
Search, and
Retrieval of Audio
Many audio and
multimedia
applications would
benefit from the
ability to classify and
search for audio
based on its
characteristics. The
audio analysis,
search, and
classification engine
described here
reduces sounds to
perceptual and
acoustical features.
This lets users search
or retrieve sounds by
any one feature or a
combination of
them, by specifying
previously learned
classes based on
these features, or by
selecting or entering
reference sounds and
asking the engine to
retrieve similar or
dissimilar sounds.
Erling Wold, Thorn Blum, Douglas Keislar,
and James Wheaton
Muscle Fish
he
T
rapid increase in speed and capacity
of computers and networks has
allowed the inclusion of audio as a data
type in many modern computer appli-
cations. However, the audio is usually treated as an
opaque collection of bytes with only the most prim-
itive fields attached: name, file format, sampling
rate, and so on. Users accustomed to searching,
scanning, and retrieving text data can be frustrated
by the inability to look inside the audio objects.
Multimedia databases or file systems, for exam-
ple, can easily have thousands of audio record-
ings. These could be anything from a library of
sound effects to the soundtrack portion of a news
footage archive. Such libraries are often poorly
indexed or named to begin with. Even if a previ-
ous user has assigned keywords or indices to the
data, these are often highly subjective and may be
useless to another person. Searching for a partic-
ular sound or class of sound (such as applause,
music, or the speech of a particular speaker) can
be a daunting task.
How might people want to access sounds? We
believe there are several useful methods, all of
which we have attempted to incorporate into our
system.
I Simile: saying one sound is like another sound
or a group of sounds in terms of some charac-
teristics. For example, like the sound of a herd
of elephants. A simpler example would be to
say that it belongs to the class
of speech
sounds
or the class of applause sounds, where the sys-
tem has previously been trained on other
sounds in this class.
I Acoustical/perceptual features: describing the
sounds in terms of commonly understood
physical characteristics such as brightness,
pitch, and loudness.
I Subjective features: describing the sounds using
personal descriptive language. This requires
training the system (in our case, by example)
to understand the meaning of these descriptive
terms. For example, a user might be looking for
a shimmering sound.
I Onomatopoeia: making a sound similar in some
quality to the sound you are looking for. For
example, the user could making a buzzing
sound to find bees or electrical hum.
In a retrieval application, all of the above could
be used in combination with traditional keyword
and text queries.
To accomplish any of the above methods, we
first reduce the sound to a small set of parameters
using various analysis techniques. Second, we use
statistical techniques over the parameter space to
accomplish the classification and retrieval.
Previous research
Sounds are traditionally described by their
pitch, loudness, duration, and timbre. The first
three of these psychological percepts are well
understood and can be accurately modeled by
measurable acoustic features. Timbre, on the other
hand, is an ill-defined attribute that encompasses
all the distinctive qualities of a sound other than
its pitch, loudness, and duration. The effort to dis-
cover the components of timbre underlies much
of the previous psychoacoustic research that is rel-
evant to content-based audio retrieval.*
Salient components of timbre include the
amplitude envelope, harmonicity, and spectral
envelope. The attack portions of a tone are often
essential for identifying the timbre. Timbres with
similar spectral energy distributions (as measured
by the centroid of the spectrum) tend to be judged
as perceptually similar. However, research has
shown that the time-varying spectrum of a single
musical instrument tone cannot generally be
treated as a fingerprint identifying the instru-
ment, because there is too much variation across
1070-986X/96/$5.00 6 1996 IEEE

the instruments range of pitches and across its
range of dynamic levels.
Various researchers have discussed or proto-
typed algorithms capable of extracting audio
structure from a sound.z The goal was to allow
queries such as find the first occurrence of the
note G-sharp. These algorithms were tuned to
specific musical constructs and were not appro-
priate for all sounds.
Other researchers have focused on indexing
audio databases using neural nets.3 Although they
have had some success with their method, there
are several problems from our point of view. For
example, while the neural nets report similarities
between sounds, it is very hard to look inside a
net after it is trained or while it is in operation to
determine how well the training worked or what
aspects of the sounds are similar to each other.
This makes it difficult for the user to specify which
features of the sound are important and which to
ignore.
Analysis and retrieval engine
Here we present a general paradigm and spe-
cific techniques for analyzing audio signals in a
way that facilitates content-based retrieval.
Content-based retrieval of audio can mean a vari-
ety of things. At the lowest level, a user could
retrieve a sound by specifying the exact numbers
in an excerpt of the sounds sampled data. This is
analogous to an exact text search and is just as
simple to implement in the audio domain.
At the next higher level of abstraction, the
retrieval would match any sound containing the
given excerpt, regardless of the datas sample rate,
quantization, compression, and so on. This is
analogous to a fuzzy text search and can be imple-
mented using correlation techniques. At the next
level, the query might involve acoustic features
that can be directly measured and perceptual (sub-
jective) properties of the sound.+5 Above this, one
can ask for speech content .or musical content.
It is the sound level-acoustic and perceptu-
al properties-with which we are most concerned
here. Some of the aural (perceptual) properties of
a sound, such as pitch, loudness, and brightness,
correspond closely to measurable features of the
audio signal, making it logical to provide fields for
these properties in the audio database record.
However, other aural properties (scratchiness,
for instance) are more indirectly related to easily
measured acoustical features of the sound. Some
of these properties may even have different mean-
ings for different users.
We first measure a variety of acoustical features
of each sound. This set of N features is represented
as an N-vector. In text databases, the resolution of
queries typically requires matching and compar-
ing strings. In an audio database, we would like to
match and compare the aural properties as
described above. For example, we would like to
ask for all the sounds similar to a given sound or
that have more or less of a given property. To
guarantee that this is possible, sounds that differ
in the aural property should map to different
regions of the N-space. If this were not satisfied,
the database could not distinguish between
sounds with different values for this property.
Note that this approach is similar to the feature-
vectorapproach currently used in content-based
retrieval of images, although the actual features
used are very different.6
Since we cannot know the complete list of
aural properties that users might wish to specify,
it is impossible to guarantee that our choice of
acoustical features will meet these constraints.
However, we can make sure that we meet these
constraints for many useful aural properties.
Acoustical features
We can currently analyze the following aspects
of sound: loudness, pitch, brightness, bandwidth,
and harmonicity.
Loudness is approximated by the signals root-
mean-square (RMS) level in decibels, which is cal-
culated by taking a series of windowed frames of
the sound and computing the square root of the
sum of the squares of the windowed sample val-
ues. (This method does not account for the fre-
quency response of the human ear; if desired, the
necessary equalization can be added by applying
the Fletcher-Munson equal-loudness contours.)
The human ear can hear over a 120-decibel range.
Our software produces estimates over a lOO-
decibel range from 16-bit audio recordings.
Pitch is estimated by taking a series of short-
time Fourier spectra. For each of these frames, the
frequencies and amplitudes of the peaks are mea-
sured and an approximate greatest common divi-
sor algorithm is used to calculate an estimate of
the pitch. We store the pitch as a log frequency.
The pitch algorithm also returns a pitch confi-
dence value that can be used to weight the pitch
in later calculations. A perfect young human ear
can hear frequencies in the ZO-Hz to ZO-kHz
range. Our software can measure pitches in the
range of 50 Hz to about 10 kHz.
Brightness is computed as the centroid of the

short-time Fourier magnitude spec-
tra, again stored as a log frequency.
It is a measure of the higher fre-
quency content of the signal. As an
example, putting your hand over
your mouth as you speak reduces the
brightness of the speech sound as
well as the loudness. This feature
varies over the same range as the
pitch, although it cant be less than the pitch esti-
mate at any given instant.
Bandwidth is computed as the magnitude-
weighted average of the differences between the
spectral components and the centroid. As exam-
ples, a single sine wave has a bandwidth of zero
and ideal white noise has an infinite bandwidth.
Harmonicity distinguishes between harmonic
spectra (such as vowels and most musical sounds),
inharmonic spectra (such as metallic sounds), and
noise (spectra that vary randomly in frequency
and time). It is computed by measuring the devl-
ation of the sounds line spectrum from a perfect-
ly harmonic spectrum. This is currently an
optional feature and is not used in the examples
that follow. It is normalized to lie in a range from
zero to one.
All of these aspects of sound vary over time.
The trajectory in time is computed during the
analysis but not stored as such in the database.
However, for each of these trajectories, several fea-
tures are computed and stored. These include the
average value, the variance of the value over the
trajectory, and the autocorrelation of the trajec-
tory at a small lag. Autocorrelation is a measure of
the smoothness of the trajectory. It can distin-
guish between a pitch glissando and a wildly vary-
ing pitch (for example), which the simple variance
measure cannot.
The average, variance, and autocorrelation
computations are weighted by the amplitude tra-
jectory to emphasize the perceptually important
sections of the sound. In addition to the above
features, the duration of the sound is stored. The
feature vector thus consists of the duration plus
the parameters just mentioned (average, variance,
and autocorrelation) for each of the aspects of
sound given above. Figure 1 shows a plot of the
raw trajectories of loudness, brightness, band-
width, and pitch for a recording of male laughter.
After the statistical analyses, the resulting
analysis record (shown in Table 1) contains the
computed values. These numbers are the only
information used in the content-based classifica-
tion and retrieval of these sounds. It is possible to
see some of the essential characteris-
tics of the sound. Most notably, we
see the rapidly time-varying nature
of the laughter.
Training the system
It is possible to specify a sound
directly by submitting constraints on
the values of the N-vector described
above directly to the system. For
example, the user can ask for sounds
in a certain range of pitch or bright-
ness, However, it is also possible to
train the system by example. In this
case, the user selects examples of
sounds that demonstrate the proper-
ty the user wishes to train, such as
scratchiness.
For each sound entered into the
database, the N-vector, which we
represent as a, is computed. When
the user supplies a set of example
sounds for training, the mean vector
,U and the covariance matrix R for
the a vectors in each class are calcu-
lated. The mean and covariance are
given by
0.00 0.50
1.00 0.00 1 so
X
--... LaughterYoungMale.bright
Figure 1. Male laughter.
where A4 is the number of sounds in the summa-
tion. In practice, one can ignore the off-diagonal
elements of R if the feature vector elements are
reasonably independent of each other. This sim-
plification can yield significant savings in com-
putation time. The mean and covariance together
become the systems model of the perceptual
property being trained by the user.
Classifying sounds
When a new sound needs to be classified, a dis-
tance measure is calculated from the new sounds
a vector and the model above. We use a weighted

D = ((a - p)T R1 (a - p))
Again, the off-diagonal elements of R can be
ignored for faster computation. Also, simpler mea-
sures such as an L, or Manhattan distance can be
used. The distance is compared to a threshold to
determine whether the sound is in or out of
the class. If there are several mutually exclusive
classes, the sound is placed in the class to which
it is closest, that is, for which it has the smallest
value of D.
If it is known a priori that some acoustic fea-
tures .are unimportant for the class, these can be
ignored or given a lower weight in the computa-
tion of D. For example, if the class models some
timbral aspect of the sounds, the duration and
average pitch of the sounds can usually be
ignored.
We also define a likelihood value L based on
the normal distribution and given by
L = exp(-D2/2)
This value can be interpreted as how muchof the
defining property for the class the new sound has.
Retrieving sounds
It is now possible to select, sort, or classify
sounds from the database using the distance mea-
sure. Some example queries are
I Retrieve the scratchy sounds. That is, retrieve
all the sounds that have a high likelihood of
being in the scratchy class.
I Retrieve the top 20 scratchy sounds.
I Retrieve all the sounds that are less scratchy
than a given sound.
I Sort the given set of sounds by how scratchy
they are.
I Classify a given set of sounds into the follow-
ing set of classes.
any desired hyper-rectangle of sounds in the data-
base by requesting all sounds whose feature val-
ues fall in a set of desired ranges. Requesting such
hyper-rectangles allows a much more efficient
search. This technique has the advantage that it
can be implemented on top of the very efficient
index-based search algorithms in existing com-
mercial databases.
As an example, consider a query to retrieve the
top M sounds in a class. If the database has MO
sounds total, we first ask for all the sounds in a
hyper-rectangle centered around the mean ,U with
volume V such that
V/v, = M/n/i,
For small databases, it is easiest to compute the
distance measure(s) for all the sounds in the data-
base and then to choose the sounds that match
the desired result. For large databases, this can be
too expensive. To speed up the search, we index
(sort) the sounds in the database by all the
and try again.
Note that the above discussion is a simplifica-
tion of our current algorithm, which asks for big-
ger volumes to begin with to correct for two
factors. First, for, our distance measure, we really
want a hypersphere of volume V, which means we
want the hyper-rectangle that circumscribes this
sphere. Second, the distribution of sounds in the
feature space is not perfectly regular. If we assume
some reasonable distribution of the sounds in the
database, we can easily compute how much larger
V has to be to achieve some desired confidence
level that the search will succeed.
Quality measures
The magnitude of the covariance matrix R is a
measure of the compactness of the class. This can
be reported to the user as a quality measure of the
classification. For example, if the dimensions of R
are similar to the dimensions of the database, this
class would not be useful as a discriminator, since
all the sounds would fall into it. Similarly, the sys-
tem can detect other irregularities in the training
set, such as outliers or bimodality.
The size of the covariance matrix in each
dimension is a measure of the particular dimen-

sions importance to the class. From this, the user
can see if a particular feature is too important or
not important enough. For example, if all the
sounds in the training set happen to have a very
similar duration, the classification process will
rank this feature highly, even though it may be
irrelevant. If this is the case, the user can tell the
system to ignore duration or weight it differently,
or the user can try to improve the training set.
Similarly, the system can report to the user the
components of the computed distance measure.
Again, this is an indication to the user of possible
problems in the class description.
Note that all of these measures would be diffi-
cult to derive from a non-statistical model such as
a neural network.
Segmentation
The discussion above deals with the case where
each sound is a single gestalt. Some examples of
this would be single short sounds, such as a door
slam, or longer sounds of uniform texture, such as
a recording of rain on cement. Recordings that
contain many different events need to be seg-
mented before using the features above.
Segmentation is accomplished by applying the
acoustic analyses discussed to the signal and look-
ing for transitions (sudden changes in the mea-
sured features). The transitions define segments of
the signal, which can then be treated like individ-
ual sounds. For example, a recording of a concert
could be scanned automatically for applause
sounds to determine the boundaries between
musical pieces. Similarly, after training the system
to recognize a certain speaker, a recording could
be segmented and scanned for all the sections
where that speaker was talking.
Performance
We have used the above algorithms at Muscle
Fish on a test sound database that contains about
400 sound files. These sound files were culled from
various sound effects and musical instrument sam-
ple libraries. A wide variety of sounds are represent-
ed from animals, machines, musical instruments,
speech, and nature. The sounds vary in duration
from less than a second to about 1.5 seconds.
A number of classes were made by running the
classification algorithm on some perceptually sim-
ilar sets of sounds. These classes were then used to
reorder the sounds in the database by their likeli-
hood of membership in the class. The following
discussion shows the results of this process for sev-
eral sound sets. These examples illustrate the char-
acter of the process and the fuzzy
nature of the retrieval. (For more
information, and to duplicate these
examples, see the Interactive Web
Demo sidebar.)
Example 1: Laughter. For this
example, all the recordings of laugh-
ter except two were used in creating
the class. Figure 2 shows a plot of the
class membership likelihood values
(the Y-axis) for all of the sound files
in the test database. Each vertical
strip along the X-axis is a user-
defined category (the directory in
which the sound resides). See the
Class Model sidebar on p. 32 for
details on how our system comput-
ed this model.
The highest returned likelihoods
are for the laughing sounds, includ-
ing the two that were not included
in the original training set, as well as
one of the animal recordings. This animal record-
ing is of a chicken coop and has strong similari-
ties in sound to the laughter recordings,
consisting of a number of strong sound bursts.
Example 2: Female speech. Our test database
contains a number of very short recordings of a
- Laughter.order
not in training set
m . . . . . Animals
+---..-.. Bells
*----
Crowds
*..
k2000
s_-- Laughter
---
Telephone
--....-
Water
- Mcgill/altotrombone
o.. . . . .
Mcgill/cellobowed
c I - - - - - Mcgill/oboe
-----
Mcgill/percussion
.- .
Mcgill/tubularbells
I--
Mcgill/violinbowed
---
Mcgill/violinpizz
--
Speech/female
- Speech/male
Figure 2. Laughter
classification.

Citations
More filters
Journal ArticleDOI
TL;DR: The automatic classification of audio signals into an hierarchy of musical genres is explored and three feature sets for representing timbral texture, rhythmic content and pitch content are proposed.
Abstract: Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.

2,668 citations

Proceedings ArticleDOI
05 Mar 2017
TL;DR: The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Abstract: Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

2,204 citations


Cites methods from "Content-based classification, searc..."

  • ...Automatic systems for audio event classification go back to the Muscle Fish content-based sound effects retrieval system [14]....

    [...]

Patent
06 Jun 1995
TL;DR: An adaptive interface for a programmable system, for predicting a desired user function, based on user history, as well as machine internal status and context, is presented for confirmation by the user, and the predictive mechanism is updated based on this feedback as mentioned in this paper.
Abstract: An adaptive interface for a programmable system, for predicting a desired user function, based on user history, as well as machine internal status and context. The apparatus receives an input from the user and other data. A predicted input is presented for confirmation by the user, and the predictive mechanism is updated based on this feedback. Also provided is a pattern recognition system for a multimedia device, wherein a user input is matched to a video stream on a conceptual basis, allowing inexact programming of a multimedia device. The system analyzes a data stream for correspondence with a data pattern for processing and storage. The data stream is subjected to adaptive pattern recognition to extract features of interest to provide a highly compressed representation which may be efficiently processed to determine correspondence. Applications of the interface and system include a VCR, medical device, vehicle control system, audio device, environmental control system, securities trading terminal, and smart house. The system optionally includes an actuator for effecting the environment of operation, allowing closed-loop feedback operation and automated learning.

1,976 citations

Proceedings Article
25 Aug 1997
TL;DR: The results demonstrate that the Mtree indeed extends the domain of applicability beyond the traditional vector spaces, performs reasonably well in high-dimensional data spaces, and scales well in case of growing files.
Abstract: A new access method, called M-tree, is proposed to organize and search large data sets from a generic “metric space”, i.e. where object proximity is only defined by a distance function satisfying the positivity, symmetry, and triangle inequality postulates. We detail algorithms for insertion of objects and split management, which keep the M-tree always balanced - several heuristic split alternatives are considered and experimentally evaluated. Algorithms for similarity (range and k-nearest neighbors) queries are also described. Results from extensive experimentation with a prototype system are reported, considering as the performance criteria the number of page I/O’s and the number of distance computations. The results demonstrate that the Mtree indeed extends the domain of applicability beyond the traditional vector spaces, performs reasonably well in high-dimensional data spaces, and scales well in case of growing files.

1,792 citations

Patent
01 Feb 1999
TL;DR: An adaptive interface for a programmable system, for predicting a desired user function, based on user history, as well as machine internal status and context, is presented for confirmation by the user, and the predictive mechanism is updated based on this feedback as mentioned in this paper.
Abstract: An adaptive interface for a programmable system, for predicting a desired user function, based on user history, as well as machine internal status and context. The apparatus receives an input from the user and other data. A predicted input is presented for confirmation by the user, and the predictive mechanism is updated based on this feedback. Also provided is a pattern recognition system for a multimedia device, wherein a user input is matched to a video stream on a conceptual basis, allowing inexact programming of a multimedia device. The system analyzes a data stream for correspondence with a data pattern for processing and storage. The data stream is subjected to adaptive pattern recognition to extract features of interest to provide a highly compressed representation that may be efficiently processed to determine correspondence. Applications of the interface and system include a video cassette recorder (VCR), medical device, vehicle control system, audio device, environmental control system, securities trading terminal, and smart house. The system optionally includes an actuator for effecting the environment of operation, allowing closed-loop feedback operation and automated learning.

1,182 citations

References
More filters
Journal ArticleDOI
TL;DR: A twin-comparison approach has been developed to solve the problem of detecting transitions implemented by special effects, and a motion analysis algorithm is applied to determine whether an actual transition has occurred.
Abstract: Partitioning a video source into meaningful segments is an important step for video indexing. We present a comprehensive study of a partitioning system that detects segment boundaries. The system is based on a set of difference metrics and it measures the content changes between video frames. A twin-comparison approach has been developed to solve the problem of detecting transitions implemented by special effects. To eliminate the false interpretation of camera movements as transitions, a motion analysis algorithm is applied to determine whether an actual transition has occurred. A technique for determining the threshold for a difference metric and a multi-pass approach to improve the computation speed and accuracy have also been developed.

1,360 citations

Book
31 Oct 1995
TL;DR: This chapter discusses image and Video Indexing and Retrieval techniques for Multimedia Compression, and some of the techniques used in this chapter were developed in the second part of this book.
Abstract: Part I: Introduction to Multimedia. 1. Basic Concepts. 2. Multimedia Networking and Synchronization. 3. Overview of Multimedia Applications. References. Part II: Multimedia Compression Techniques and Standards. 4. Introduction to Multimedia Compression. 5. JPEG Algorithm for Full-Color Still Image Compression. 6. PX64 Compression Algorithm for Video Telecommunications. 7. MPEG Compression for Motion-Intensive Applications. 8. Other Multimedia Compression Techniques. 9. Imlementations of Compression Algorithms. 10. Applications of Compression Systems. References. Part III: Image and Video Indexing and Retrieval Techniques. 11. Content-Based Image Retrieval. 12. Content-Based Video Indexing and Retrieval. 13. Video Processing Using Compressed Data. 14. A Case Study in Video Parsing: Television News. References. Index.

224 citations

Book ChapterDOI
01 Apr 1993

203 citations

Journal ArticleDOI
TL;DR: One of the main problems in sound synthesis is that the composer's idea or concept of a sound does not necessarily correspond directly to the physical parameters of synthesis algorithms.
Abstract: One of the main problems in sound synthesis is that the composer's idea or concept of a sound does not necessarily correspond directly to the physical parameters of synthesis algorithms. In regard to FM syntheseis, Ackermann (1991) mentions that the transition from idea to synthesis requires "patience, skill and a little bit of luck." Even computer-based sound-generating systems like Cmusic do not have a user-interface that allows the intuitive mapping from sound idea to soundgenerating method in a musically satisfactory

120 citations

Book
01 Jan 1976
TL;DR: One of the books you can enjoy now is aspects of tone sensation a psychophysical study here.
Abstract: One day, you will discover a new adventure and knowledge by spending more money. But when? Do you think that you need to obtain those all requirements when having much money? Why don't you try to get something simple at first? That's something that will lead you to know more about the world, adventure, some places, history, entertainment, and more? It is your own time to continue reading habit. One of the books you can enjoy now is aspects of tone sensation a psychophysical study here.

86 citations

Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

Wold et al. this paper presented an audio analysis, search, and classification engine that reduces sounds to perceptual and acoustical features.