Proceedings Article•DOI•

Content Fingerprinting Using Wavelets

Shumeet Baluja¹, Michele Covell¹•Institutions (1)

01 Jan 2006-pp 198-207

TL;DR: Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched, and explicitly measures the tradeoffs between performance, memory usage, and computation.

read less

Abstract: In this paper, we introduce Waveprint, a novel method for audio identification. Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched. The resulting system has excellent identification capabilities for small snippets of audio that have been degraded in a variety of manners, including competing noise, poor recording quality, and cell-phone playback. We explicitly measure the tradeoffs between performance, memory usage, and computation through extensive experimentation.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [2 Previous Work] – [3 System Overview] – [3.1 Retrieval] – [3.1.1 Temporal Ordering Constraints] – [4 Experiments] – [4.1 Empirical Results] – [4.2 Comparisons] and [Memory]

1 Introduction

Audio fingerprinting provides the ability to link short, unlabeled, snippets of audio content to corresponding data about that content.
There are an immense number of applications for audio fingerprinting.
In contentmanagement systems, it can help follow the use of music and other audio material.
The query and stored version of a song may be aurally similar while having distinct bit representations.
Numerous difficulties exist when moving to techniques that do not require exact bit-level matches.

2 Previous Work

Many audio-fingerprinting techniques use low-level features that attempt to compactly describe the audio signal without assigning higher-level meaning to the features.
The sub-fingerprints are a vector of 32-bits that indicate whether the difference in successive BFCC bands increases or decreases in consecutive frames.
Ke used the same basic architecture of [8] , but introduced a learning approach into the feature-selection process.
The features are more complex than in the studies by [8, 10] , but also summarize longer segments of audio than in that other work.
OPCA selects a set of directions for modeling the subspace that maximizes the signal variance while minimizing the noise power.

3 System Overview

The authors system builds on the insights from [10] : computer vision techniques can be a powerful method for analyzing audio data.
For each image, a wavelet signature of the image is computed; the wavelet signature is a truncated, quantized version of the wavelet decomposition of the image.
The authors can repeat the above procedure multiple times, each time with a new permutation of bit positions.
After these steps, each spectral image (or equivalent-length audio segment) is represented by a series of p 8-bit integers, the sub-fingerprint.
Instead, the authors use a technique, termed Locality-Sensitive Hashing (LSH) [5] .

3.1 Retrieval

The first difference in the retrieval process, in comparison to the database-generation process, is that the song is divided into randomly overlapping segments rather than uniformly overlapping segments.
The next steps 6-8 describe an efficient mechanism for finding matches in the database and measuring their distances from the query, and are the subject of the next section.
LSH also supports flexible constraints on which candidates from the individual component hash tables will be examined further, as part of the final list of candidate matches.
Under this development, each component hash table votes for the sub-fingerprints that were retrieved using its p/l-bytes.
Given the audio spectra of a song, extract spectral images of 11.6*w ms duration, with random spacing averaging d-ms apart.

3.1.1 Temporal Ordering Constraints

Up to this point, the authors have discussed matching sub-fingerprints from the probe into the database.
The authors describe the methods that they have explored to accumulate evidence across sub-fingerprints over the duration of the probe snippet.
Dynamic-time warping is a form of dynamic programming for imposing "tempo" constraints in mapping one sequence onto another [11] .
Each probe sub-fingerprint can propose multiple matches in any given song and any of these can be the correct starting correspondence.
In the opposite direction, if fewer than 20 matches passed their previous criteria for sub-fingerprint matches, then only that smaller set is considered.

4 Experiments

Over 50,400 different parameter combinations were tried to ensure that the authors select the best settings and understand the tradeoffs with each parameter.
Since their goal at this point is to understand the general system performance, the authors did not use heuristics such as unequal protection over time (i.e., protecting song beginnings and choruses more heavily) to reduce the amount of memory usage or computation.
The authors used 1000 independent probe snippets into this song data base.
Each of these probes included one of the following distortions, with each distortion getting equal representation: 1. Time-offset only:.
This increases/decreases the tempo by 10% without changing the pitch.

4.1 Empirical Results

With over 50,400 parameter settings, and three interesting attributes (retrieval accuracy, memory usage, computational load), there are numerous manners in which to report the results.
There also was an unequal distribution across the numbers of retained top wavelets: 400 and 200 top wavelets accounted for nearly 2/3 rds of the run-to-completion points, with the remaining 1/3 split across 50, 100 and 800 top wavelets.
Third, the best retrieval accuracy on the best operating curve achieves 97.9% accuracy, while the best retrieval accuracy over all the parameter setting was only 0.2% higher on this probe set.
Fifth, another interesting point is that the authors can reduce the computation by an order of magnitude with little drop in accuracy.
Surprisingly, the number of hash tables that define the best operating curve for 50% accuracy use 25 hashes, so the memory reduction is not achieved through reducing the number of hash tables.

4.2 Comparisons

For performance comparison, the authors use the extension to [8] that was developed by [10] .
Unfortunately, their system was not designed to handle large timing variations, so the authors did not include time-based degradations in the next set of tests.
In practice, since these are pointers, they are represented with 4 bytes.
The sub-fingerprint is p elements long (the number of permutations used in the minhash signature).

Memory

There is a memory cost when using temporal constraints with dynamic programming.the authors.
Therefore, on a standard 2GB machine, the authors can store approximately 47,000 songs without touching disk for retrieval.
Doing the same analysis for Waveprint-2 (l=25 instead of 20), and keeping the Temporal_Constraint_Overhead at 10 megabytes, the authors get 0.50x10 9 bytes of memory.
The longer the sample snippet, the more reliable the recognition is but the longer the processing takes.

Did you find this useful? Give us your feedback

Figures (6)

Table 2. Performance Comparison – Full Test Set

Table 3. Performance Comparison – Test Set without Time-Scale- and Speed-Modification Degradation

Table 1. Types of Match/Mismatch between single bits of two binary vectors

Figure 1. The representation for three songs – 5 consecutive frames shown for each, skipping 0.2 seconds. For each song, the top row is the original spectrogram image, the second row is the wavelet magnitudes, the third row shows the top-200 (t=200) wavelets. Note that the top wavelets have a distinctive pattern for each of the three songs. (For each song, the top 2 rows in the figure have been extensively visually enhanced to be visible when printed on paper).

Figure 2. Overall architecture for the retrieval process. Step 9 not shown in the diagram.

Figure 3. Results for retrieval accuracy

Content maybe subject to copyright Report

Frequently Asked Questions (17)

Q1. What are the contributions in "Content fingerprinting using wavelets" ?

In this paper, the authors introduce Waveprint, a novel method for audio identification. The authors explicitly measure the tradeoffs between performance, memory usage, and computation through extensive experimentation.

Q2. What future works have the authors mentioned in the paper "Content fingerprinting using wavelets" ?

Other future work includes exploring applications beyond music matching, such as using the system for matching television broadcasts.

Q3. What is the purpose of audio fingerprinting?

Audio fingerprinting provides the ability to link short, unlabeled, snippets of audio content to corresponding data about that content.

Q4. Why is it important to keep the sampling constant?

Randomly selecting the stride amount is important to avoid problems of unlucky alignments; if the sampling of the probe is kept constant, it may be possible to repeatedly find samples that have uniformly large offsets from the sampling used to create the database.

Q5. Why do the authors need to look at the number of bytes that match exactly?

Because each byte of the subfingerprint is a Min-Hash signature, the authors simply look at the number of bytes (out of p) that match exactly.

Q6. What is the definition of dynamic-time warping?

Dynamic-time warping is a form of dynamic programming for imposing “tempo” constraints in mapping one sequence onto another [11].

Q7. How much time is it possible to speed up the expensive portion of the process?

it should be possible to speed up the most computationally expensive portion of the process (computing and sorting the wavelets, which account for approximately 90% of the cost) by a factor of ~16-32x.

Q8. What is the operating curve for 50% accuracy?

the number of hash tables that define the best operating curve for 50% accuracy use 25 hashes, so the memory reduction is not achieved through reducing the number of hash tables.

Q9. How many parameters are there to report the results?

With over 50,400 parameter settings, and three interesting attributes (retrieval accuracy, memory usage, computational load), there are numerous manners in which to report the results.

Q10. How many different parameter combinations were tested?

Over 50,400 different parameter combinations were tried to ensure that the authors select the best settings and understand the tradeoffs with each parameter.

Q11. What is the process for creating the 32-bit subfingerprint?

For lookup of new queries, their system processes the audio-image (similarly spaced to [8]), to create the 32-bit subfingerprint using the learned features.

Q12. What is the temporal track in a song?

The score of the best temporal track within each song (as describedabove), where sub-fingerprint matches within each temporal track must:- not introduce local time inversions (no backtracking within the database-song), - not match a single probe sub-fingerprint to more than one database-song sub-fingerprint (the opposite is allowed, due to unequal sampling rates), - not include probe-database sub-fingerprint pairs that, when measured along the database-song axis, lie outside more than a database sampling stride outsideof the ±10% tempo cone as defined by the starting probe-database sub-fingerprint pair.

Q13. What is the description of the retrieval process?

Methods to make the matching process efficient, based on LocalitySensitive-Hashing (LSH), are presented with the description of the retrieval process.

Q14. What is the value of the sub-fingerprint match score?

This reduces to the sub-fingerprint match score in the case of a single-length match track and to twice that value if all the sub-fingerprints are equal strength, but it otherwise does not grow with track length changes.

Q15. Why did the authors use heuristics to reduce the amount of memory usage?

Since their goal at this point is to understand the general system performance, the authors did not use heuristics such as unequal protection over time (i.e., protecting song beginnings and choruses more heavily) to reduce the amount of memory usage or computation.

Q16. What is the operating curve for the selected accuracy?

In the 2 graphs shown in the figure, the authors restricted the computation and memory range to be close to the best operating curve (for the selected accuracy) but left all experimental results that fell within the shown range of values, even if they were not on that operating curve.

Q17. How many potential matches can be proposed for each sub-fingerprint?

the authors only allow each sub-fingerprint to propose approximately twenty potential matches for itself, across the full data base of songs.