scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Content Fingerprinting Using Wavelets

Shumeet Baluja1, Michele Covell1
01 Jan 2006-pp 198-207
TL;DR: Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched, and explicitly measures the tradeoffs between performance, memory usage, and computation.
Abstract: In this paper, we introduce Waveprint, a novel method for audio identification. Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched. The resulting system has excellent identification capabilities for small snippets of audio that have been degraded in a variety of manners, including competing noise, poor recording quality, and cell-phone playback. We explicitly measure the tradeoffs between performance, memory usage, and computation through extensive experimentation.

Summary (3 min read)

1 Introduction

  • Audio fingerprinting provides the ability to link short, unlabeled, snippets of audio content to corresponding data about that content.
  • There are an immense number of applications for audio fingerprinting.
  • In contentmanagement systems, it can help follow the use of music and other audio material.
  • The query and stored version of a song may be aurally similar while having distinct bit representations.
  • Numerous difficulties exist when moving to techniques that do not require exact bit-level matches.

2 Previous Work

  • Many audio-fingerprinting techniques use low-level features that attempt to compactly describe the audio signal without assigning higher-level meaning to the features.
  • The sub-fingerprints are a vector of 32-bits that indicate whether the difference in successive BFCC bands increases or decreases in consecutive frames.
  • Ke used the same basic architecture of [8] , but introduced a learning approach into the feature-selection process.
  • The features are more complex than in the studies by [8, 10] , but also summarize longer segments of audio than in that other work.
  • OPCA selects a set of directions for modeling the subspace that maximizes the signal variance while minimizing the noise power.

3 System Overview

  • The authors system builds on the insights from [10] : computer vision techniques can be a powerful method for analyzing audio data.
  • For each image, a wavelet signature of the image is computed; the wavelet signature is a truncated, quantized version of the wavelet decomposition of the image.
  • The authors can repeat the above procedure multiple times, each time with a new permutation of bit positions.
  • After these steps, each spectral image (or equivalent-length audio segment) is represented by a series of p 8-bit integers, the sub-fingerprint.
  • Instead, the authors use a technique, termed Locality-Sensitive Hashing (LSH) [5] .

3.1 Retrieval

  • The first difference in the retrieval process, in comparison to the database-generation process, is that the song is divided into randomly overlapping segments rather than uniformly overlapping segments.
  • The next steps 6-8 describe an efficient mechanism for finding matches in the database and measuring their distances from the query, and are the subject of the next section.
  • LSH also supports flexible constraints on which candidates from the individual component hash tables will be examined further, as part of the final list of candidate matches.
  • Under this development, each component hash table votes for the sub-fingerprints that were retrieved using its p/l-bytes.
  • Given the audio spectra of a song, extract spectral images of 11.6*w ms duration, with random spacing averaging d-ms apart.

3.1.1 Temporal Ordering Constraints

  • Up to this point, the authors have discussed matching sub-fingerprints from the probe into the database.
  • The authors describe the methods that they have explored to accumulate evidence across sub-fingerprints over the duration of the probe snippet.
  • Dynamic-time warping is a form of dynamic programming for imposing "tempo" constraints in mapping one sequence onto another [11] .
  • Each probe sub-fingerprint can propose multiple matches in any given song and any of these can be the correct starting correspondence.
  • In the opposite direction, if fewer than 20 matches passed their previous criteria for sub-fingerprint matches, then only that smaller set is considered.

4 Experiments

  • Over 50,400 different parameter combinations were tried to ensure that the authors select the best settings and understand the tradeoffs with each parameter.
  • Since their goal at this point is to understand the general system performance, the authors did not use heuristics such as unequal protection over time (i.e., protecting song beginnings and choruses more heavily) to reduce the amount of memory usage or computation.
  • The authors used 1000 independent probe snippets into this song data base.
  • Each of these probes included one of the following distortions, with each distortion getting equal representation: 1. Time-offset only:.
  • This increases/decreases the tempo by 10% without changing the pitch.

4.1 Empirical Results

  • With over 50,400 parameter settings, and three interesting attributes (retrieval accuracy, memory usage, computational load), there are numerous manners in which to report the results.
  • There also was an unequal distribution across the numbers of retained top wavelets: 400 and 200 top wavelets accounted for nearly 2/3 rds of the run-to-completion points, with the remaining 1/3 split across 50, 100 and 800 top wavelets.
  • Third, the best retrieval accuracy on the best operating curve achieves 97.9% accuracy, while the best retrieval accuracy over all the parameter setting was only 0.2% higher on this probe set.
  • Fifth, another interesting point is that the authors can reduce the computation by an order of magnitude with little drop in accuracy.
  • Surprisingly, the number of hash tables that define the best operating curve for 50% accuracy use 25 hashes, so the memory reduction is not achieved through reducing the number of hash tables.

4.2 Comparisons

  • For performance comparison, the authors use the extension to [8] that was developed by [10] .
  • Unfortunately, their system was not designed to handle large timing variations, so the authors did not include time-based degradations in the next set of tests.
  • In practice, since these are pointers, they are represented with 4 bytes.
  • The sub-fingerprint is p elements long (the number of permutations used in the minhash signature).

Memory

  • There is a memory cost when using temporal constraints with dynamic programming.the authors.
  • Therefore, on a standard 2GB machine, the authors can store approximately 47,000 songs without touching disk for retrieval.
  • Doing the same analysis for Waveprint-2 (l=25 instead of 20), and keeping the Temporal_Constraint_Overhead at 10 megabytes, the authors get 0.50x10 9 bytes of memory.
  • The longer the sample snippet, the more reliable the recognition is but the longer the processing takes.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Citations
More filters
Patent
22 Aug 2008
TL;DR: In this paper, a matching module receives an input video fingerprint representing the input video and a set of reference fingerprints representing reference videos in a reference database, and compares the reference fingerprints and input fingerprints to generate a list of candidate segments from the reference video set.
Abstract: A system and method detects matches between portions of video content. A matching module receives an input video fingerprint representing an input video and a set of reference fingerprints representing reference videos in a reference database. The matching module compares the reference fingerprints and input fingerprints to generate a list of candidate segments from the reference video set. Each candidate segment comprises a time- localized portion of a reference video that potentially matches the input video. A classifier is applied to each of the candidate segments to classify the segment as a matching segment or a non-matching segment. A result is then outputted identifying a matching portion of a reference video from the reference video set based on the segments classified as matches.

114 citations

Proceedings ArticleDOI
Shumeet Baluja1, Michele Covell1
15 Apr 2007
TL;DR: The waveprint system, a novel system for audio identification that uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched, is presented.
Abstract: In this paper, we present waveprint, a novel system for audio identification. Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched. The resulting system has excellent identification capabilities for small snippets of audio that have been degraded in a variety of manners, including competing noise, poor recording quality, and cell-phone playback. We measure the tradeoffs between performance, memory usage, and computation through extensive experimentation. The system is more efficient in terms of memory usage and computation, while being more accurate, when compared with previous state of the art systems.

112 citations

Journal ArticleDOI
TL;DR: This work proposes the NV-tree, which is a very efficient disk-based data structure that can give good approximate answers to nearest neighbor queries with a single disk operation, even for very large collections of high-dimensional data.
Abstract: Over the last two decades, much research effort has been spent on nearest neighbor search in high-dimensional data sets. Most of the approaches published thus far have, however, only been tested on rather small collections. When large collections have been considered, high-performance environments have been used, in particular systems with a large main memory. Accessing data on disk has largely been avoided because disk operations are considered to be too slow. It has been shown, however, that using large amounts of memory is generally not an economic choice. Therefore, we propose the NV-tree, which is a very efficient disk-based data structure that can give good approximate answers to nearest neighbor queries with a single disk operation, even for very large collections of high-dimensional data. Using a single NV-tree, the returned results have high recall but contain a number of false positives. By combining two or three NV-trees, most of those false positives can be avoided while retaining the high recall. Finally, we compare the NV-tree to locality sensitive hashing, a popular method for ?-distance search. We show that they return results of similar quality, but the NV-tree uses many fewer disk reads.

102 citations


Cites methods from "Content Fingerprinting Using Wavele..."

  • ...Instead, we have taken the approach used in [4] and filter false positives by simply counting the number of occurrences of each descriptor in the result sets from all the hash tables and ranking the result accordingly....

    [...]

Patent
19 Jun 2007
TL;DR: In this paper, video fingerprints provide a compact representation of the temporal locations of discontinuities in the video that can be used to quickly and efficiently identify video content, such as shot boundaries in video frame sequence or silent points in audio stream.
Abstract: A method and system generates and compares fingerprints for videos in a video library. The video fingerprints provide a compact representation of the temporal locations of discontinuities in the video that can be used to quickly and efficiently identify video content. Discontinuities can be, for example, shot boundaries in the video frame sequence or silent points in the audio stream. Because the fingerprints are based on structural discontinuity characteristics rather than exact bit sequences, visual content of videos can be effectively compared even when there are small differences between the videos in compression factors, source resolutions, start and stop times, frame rates, and so on. Comparison of video fingerprints can be used, for example, to search for and remove copyright protected videos from a video library. Furthermore, duplicate videos can be detected and discarded in order to preserve storage space.

86 citations

Patent
09 May 2007
TL;DR: In this paper, a method and system for generating and comparing fingerprints for videos in a video library is presented, which provides a compact representation of the spatial and sequential characteristics of the video that can be used to identify video content.
Abstract: A method and system generates and compares fingerprints for videos in a video library. The video fingerprints provide a compact representation of the spatial and sequential characteristics of the video that can be used to quickly and efficiently identify video content. Because the fingerprints are based on spatial and sequential characteristics rather than exact bit sequences, visual content of videos can be effectively compared even when there are small differences between the videos in compression factors, source resolutions, start and stop times, frame rates, and so on. Comparison of video fingerprints can be used, for example, to search for and remove copyright protected videos from a video library. Further, duplicate videos can be detected and discarded in order to preserve storage space.

76 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

13,037 citations

Proceedings ArticleDOI
07 Jul 2001
TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algo- rithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a "cascade" which allows back- ground regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection perfor- mance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

10,592 citations


"Content Fingerprinting Using Wavele..." refers methods in this paper

  • ...The duration and frequencies are selected via the AdaBoost algorithm; they are similar to the "boxlet" features used in [ 16 ] (average intensities of rectangular sub-regions of the spectrogram image)....

    [...]

  • ...The learning approach, based on AdaBoost, is often used in computer-vision applications such as face detection [ 16 ]....

    [...]

Book
01 Jan 1982

5,834 citations

Proceedings Article
07 Sep 1999
TL;DR: Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.
Abstract: The nearestor near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality." That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points Supported by NAVY N00014-96-1-1221 grant and NSF Grant IIS-9811904. Supported by Stanford Graduate Fellowship and NSF NYI Award CCR-9357849. Supported by ARO MURI Grant DAAH04-96-1-0007, NSF Grant IIS-9811904, and NSF Young Investigator Award CCR9357849, with matching funds from IBM, Mitsubishi, Schlumberger Foundation, Shell Foundation, and Xerox Corporation. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999. from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. We provide experimental evidence that our method gives signi cant improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition. Experimental results also indicate that our scheme scales well even for a relatively large number of dimensions (more than 50).

3,705 citations


"Content Fingerprinting Using Wavele..." refers methods in this paper

  • ...The motivation for using wavelets in audio-retrieval is based on their successful use in creating an image-retrieval system [9]....

    [...]

Book
01 Mar 1993
TL;DR: The preface to the IEEE Edition explains the background to speech production, coding, and quality assessment and introduces the Hidden Markov Model, the Artificial Neural Network, and Speech Enhancement.
Abstract: Preface to the IEEE Edition. Preface. Acronyms and Abbreviations. SIGNAL PROCESSING BACKGROUND. Propaedeutic. SPEECH PRODUCTION AND MODELLING. Fundamentals of Speech Science. Modeling Speech Production. ANALYSIS TECHNIQUES. Short--Term Processing of Speech. Linear Prediction Analysis. Cepstral Analysis. CODING, ENHANCEMENT AND QUALITY ASSESSMENT. Speech Coding and Synthesis. Speech Enhancement. Speech Quality Assessment. RECOGNITION. The Speech Recognition Problem. Dynamic Time Warping. The Hidden Markov Model. Language Modeling. The Artificial Neural Network. Index.

2,761 citations

Frequently Asked Questions (17)
Q1. What are the contributions in "Content fingerprinting using wavelets" ?

In this paper, the authors introduce Waveprint, a novel method for audio identification. The authors explicitly measure the tradeoffs between performance, memory usage, and computation through extensive experimentation. 

Other future work includes exploring applications beyond music matching, such as using the system for matching television broadcasts. 

Audio fingerprinting provides the ability to link short, unlabeled, snippets of audio content to corresponding data about that content. 

Randomly selecting the stride amount is important to avoid problems of unlucky alignments; if the sampling of the probe is kept constant, it may be possible to repeatedly find samples that have uniformly large offsets from the sampling used to create the database. 

Because each byte of the subfingerprint is a Min-Hash signature, the authors simply look at the number of bytes (out of p) that match exactly. 

Dynamic-time warping is a form of dynamic programming for imposing “tempo” constraints in mapping one sequence onto another [11]. 

it should be possible to speed up the most computationally expensive portion of the process (computing and sorting the wavelets, which account for approximately 90% of the cost) by a factor of ~16-32x. 

the number of hash tables that define the best operating curve for 50% accuracy use 25 hashes, so the memory reduction is not achieved through reducing the number of hash tables. 

With over 50,400 parameter settings, and three interesting attributes (retrieval accuracy, memory usage, computational load), there are numerous manners in which to report the results. 

Over 50,400 different parameter combinations were tried to ensure that the authors select the best settings and understand the tradeoffs with each parameter. 

For lookup of new queries, their system processes the audio-image (similarly spaced to [8]), to create the 32-bit subfingerprint using the learned features. 

The score of the best temporal track within each song (as describedabove), where sub-fingerprint matches within each temporal track must:- not introduce local time inversions (no backtracking within the database-song), - not match a single probe sub-fingerprint to more than one database-song sub-fingerprint (the opposite is allowed, due to unequal sampling rates), - not include probe-database sub-fingerprint pairs that, when measured along the database-song axis, lie outside more than a database sampling stride outsideof the ±10% tempo cone as defined by the starting probe-database sub-fingerprint pair. 

Methods to make the matching process efficient, based on LocalitySensitive-Hashing (LSH), are presented with the description of the retrieval process. 

This reduces to the sub-fingerprint match score in the case of a single-length match track and to twice that value if all the sub-fingerprints are equal strength, but it otherwise does not grow with track length changes. 

Since their goal at this point is to understand the general system performance, the authors did not use heuristics such as unequal protection over time (i.e., protecting song beginnings and choruses more heavily) to reduce the amount of memory usage or computation. 

In the 2 graphs shown in the figure, the authors restricted the computation and memory range to be close to the best operating curve (for the selected accuracy) but left all experimental results that fell within the shown range of values, even if they were not on that operating curve. 

the authors only allow each sub-fingerprint to propose approximately twenty potential matches for itself, across the full data base of songs.