scispace - formally typeset
Search or ask a question

Showing papers on "Feature hashing published in 2009"


Journal ArticleDOI
TL;DR: In this paper, a deep graphical model of the word-count vectors obtained from a large set of documents is proposed. But the model is restricted to the deep layer of the deep neural network and cannot handle large numbers of documents.

1,266 citations


Proceedings ArticleDOI
01 Sep 2009
TL;DR: It is shown how to generalize locality-sensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm's sub-linear time similarity search guarantees for a wide class of useful similarity functions.
Abstract: Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space where items can be efficiently searched. However, existing methods do not apply for high-dimensional kernelized data when the underlying feature embedding for the kernel is unknown. We show how to generalize locality-sensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm's sub-linear time similarity search guarantees for a wide class of useful similarity functions. Since a number of successful image-based kernels have unknown or incomputable embeddings, this is especially valuable for image retrieval tasks. We validate our technique on several large-scale datasets, and show that it enables accurate and fast performance for example-based object classification, feature matching, and content-based retrieval.

975 citations


Proceedings ArticleDOI
14 Jun 2009
TL;DR: In this article, the authors provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability, and demonstrate the feasibility of this approach with experimental results for a new use case.
Abstract: Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case --- multitask learning with hundreds of thousands of tasks.

955 citations


Proceedings Article
07 Dec 2009
TL;DR: An algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings is developed.
Abstract: Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques.

914 citations


Proceedings Article
07 Dec 2009
TL;DR: This paper introduces a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel between the vectors.
Abstract: This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing.

702 citations


Journal Article
TL;DR: This work generalizes previous work using sampling and shows a principled way to compute the kernel matrix for data streams and sparse feature spaces and gives deviation bounds from the exact kernel matrix.
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs.

264 citations


Journal ArticleDOI
01 Dec 2009
TL;DR: An efficient data-parallel algorithm for building large hash tables of millions of elements in real-time, which considers a classical sparse perfect hashing approach, and cuckoo hashing, which packs elements densely by allowing an element to be stored in one of multiple possible locations.
Abstract: We demonstrate an efficient data-parallel algorithm for building large hash tables of millions of elements in real-time. We consider two parallel algorithms for the construction: a classical sparse perfect hashing approach, and cuckoo hashing, which packs elements densely by allowing an element to be stored in one of multiple possible locations. Our construction is a hybrid approach that uses both algorithms. We measure the construction time, access time, and memory usage of our implementations and demonstrate real-time performance on large datasets: for 5 million key-value pairs, we construct a hash table in 35.7 ms using 1.42 times as much memory as the input data itself, and we can access all the elements in that hash table in 15.3 ms. For comparison, sorting the same data requires 36.6 ms, but accessing all the elements via binary search requires 79.5 ms. Furthermore, we show how our hashing methods can be applied to two graphics applications: 3D surface intersection for moving data and geometric hashing for image matching.

194 citations


Posted Content
TL;DR: This paper provides exponential tail bounds for feature hashing and shows that the interaction between random subspaces is negligible with high probability, and demonstrates the feasibility of this approach with experimental results for a new use case --- multitask learning.
Abstract: Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case -- multitask learning with hundreds of thousands of tasks.

125 citations


Proceedings ArticleDOI
02 Nov 2009
TL;DR: This article proposes Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match and proposes several improvements to the basic model, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH).
Abstract: In this article we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

100 citations


Journal ArticleDOI
TL;DR: An image hashing algorithm based on compressive sensing principles is proposed, which solves both the authentication and the tampering identification problems and is robust to moderate content-preserving transformations including cropping, scaling, and rotation.
Abstract: In the last decade, the increased possibility to produce, edit, and disseminate multimedia contents has not been adequately balanced by similar advances in protecting these contents from unauthorized diffusion of forged copies When the goal is to detect whether or not a digital content has been tampered with in order to alter its semantics, the use of multimedia hashes turns out to be an effective solution to offer proof of legitimacy and to possibly identify the introduced tampering We propose an image hashing algorithm based on compressive sensing principles, which solves both the authentication and the tampering identification problems The original content producer generates a hash using a small bit budget by quantizing a limited number of random projections of the authentic image The content user receives the (possibly altered) image and uses the hash to estimate the mean square error distortion between the original and the received image In addition, if the introduced tampering is sparse in some orthonormal basis or redundant dictionary, an approximation is given in the pixel domain We emphasize that the hash is universal, eg, the same hash signature can be used to detect and identify different types of tampering At the cost of additional complexity at the decoder, the proposed algorithm is robust to moderate content-preserving transformations including cropping, scaling, and rotation In addition, in order to keep the size of the hash small, hash encoding/decoding takes advantage of distributed source codes

86 citations


01 Jan 2009
TL;DR: This paper delves into a recently proposed technique for collaborative spam ltering that facilitates personalization and describes how this can be used to improve the quality of spam campaigns.
Abstract: This paper delves into a recently proposed technique for collaborative spam ltering [7] that facilitates personalization

Proceedings ArticleDOI
19 Oct 2009
TL;DR: It is shown that the proposed expansion methods are complementary to each other and can collaboratively contribute up to 76.3% (average) relative improvement over the original hash-based method.
Abstract: An efficient indexing method is essential for content-based image retrieval with the exponential growth in large-scale videos and photos. Recently, hash-based methods (e.g., locality sensitive hashing - LSH) have been shown efficient for similarity search. We extend such hash-based methods for retrieving images represented by bags of (high-dimensional) feature points. Though promising, the hash-based image object search suffers from low recall rates. To boost the hash-based search quality, we propose two novel expansion strategies - intra-expansion and inter-expansion. The former expands more target feature points similar to those in the query and the latter mines those feature points that shall co-occur with the search targets but not present in the query. We further exploit variations for the proposed methods. Experimenting in two consumer-photo benchmarks, we will show that the proposed expansion methods are complementary to each other and can collaboratively contribute up to 76.3% (average) relative improvement over the original hash-based method.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: This paper proposes a simple key-based secure image hashing scheme by using radon transform and 1-D Discrete Cosine Transform (DCT), and introduces a randomization on Radon transform to make the image hash more secure and more robust.
Abstract: The perceptual image hashing is an emerging technique which can be used in image authentication and content-based image retrieval. Recently, several image hashing schemes based on Radon transform are proposed for image authentication and retrieval. These schemes have no random information to strengthen the security with image hashing. In this paper, we propose a simple key-based secure image hashing scheme by using radon transform and 1-D Discrete Cosine Transform (DCT). Particularly, we introduce a randomization on Radon transform and make the image hash more secure and more robust. Moreover, the discriminative capability is also confirmed in our experimental results.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: The paper gives the guideline to choose a best suitable hashing method hash function for a particular problem and presents six suitable various classes of hash functions in which most of the problems can find their solution.
Abstract: The paper gives the guideline to choose a best suitable hashing method hash function for a particular problem. After studying the various problem we find some criteria has been found to predict the best hash method and hash function for that problem. We present six suitable various classes of hash functions in which most of the problems can find their solution. Paper discusses about hashing and its various components which are involved in hashing and states the need of using hashing for faster data retrieval. Hashing methods were used in many different applications of computer science discipline. These applications are spread from spell checker, database management applications, symbol tables generated by loaders, assembler, and compilers. There are various forms of hashing that are used in different problems of hashing like Dynamic hashing, Cryptographic hashing, Geometric hashing, Robust hashing, Bloom hash, String hashing. At the end we conclude which type of hash function is suitable for which kind of problem.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: The proposed EEG hashing approach presents a fundamental departure from existing methods in EEG-biometry study and suggests that hashing may open new research directions and applications in the emerging EEG-based biometry area.
Abstract: Electroencephalogram (EEG) recordings of brain waves have been shown to have unique pattern for each individual and thus have potential for biometric applications. In this paper, we propose an EEG feature extraction and hashing approach for person authentication. Multi-variate autoregressive (mAR) coefficients are extracted as features from multiple EEG channels and then hashed by using our recently proposed Fast Johnson-Lindenstrauss Transform (FJLT)-based hashing algorithm to obtain compact hash vectors. Based on the EEG hash vectors, a Naive Bayes probabilistic model is employed for person authentication. Our EEG hashing approach presents a fundamental departure from existing methods in EEG-biometry study. The promising results suggest that hashing may open new research directions and applications in the emerging EEG-based biometry area.

Proceedings ArticleDOI
TL;DR: Side-information assisted robust perceptual hashing is proposed as a solution to the above shortcomings and is based on both achievable rate and probability of error.
Abstract: In this paper, we consider some basic concepts behind the design of existing robust perceptual hashing techniques for content identification. We show the limits of robust hashing from the communication perspectives as well as propose an approach that is able to overcome these shortcomings in certain setups. The consideration is based on both achievable rate and probability of error. We use the fact that most robust hashing algorithms are based on dimensionality reduction using random projections and quantization. Therefore, we demonstrate the corresponding achievable rate and probability of error based on random projections and compare with the results for the direct domain. The effect of dimensionality reduction is studied and the corresponding approximations are provided based on the Johnson-Lindenstrauss lemma. Side-information assisted robust perceptual hashing is proposed as a solution to the above shortcomings.

Book ChapterDOI
21 Aug 2009
TL;DR: This paper significantly advance the state of the art by proving a polylogarithmic bound on the more efficient random-walk method, where items repeatedly kick out random blocking items until a free location for an item is found.
Abstract: In this paper, we provide a polylogarithmic bound that holds with high probability on the insertion time for cuckoo hashing under the random-walk insertion method. Cuckoo hashing provides a useful methodology for building practical, high-performance hash tables. The essential idea of cuckoo hashing is to combine the power of schemes that allow multiple hash locations for an item with the power to dynamically change the location of an item among its possible locations. Previous work on the case where the number of choices is larger than two has required a breadth-first search analysis, which is both inefficient in practice and currently has only a polynomial high probability upper bound on the insertion time. Here we significantly advance the state of the art by proving a polylogarithmic bound on the more efficient random-walk method, where items repeatedly kick out random blocking items until a free location for an item is found.

Journal ArticleDOI
TL;DR: A novel key-dependent robust speech hashing based on speech production model is proposed in this letter, which is highly robust to content preserving operations as well as having high accuracy of tampering localization.
Abstract: Robust hashing for multimedia authentication is an emerging research area. A novel key-dependent robust speech hashing based on speech production model is proposed in this letter. Robust hash is calculated based on linear spectrum frequencies (LSFs) which model the vocal tract. The correlation between LSFs is decoupled by discrete cosine transformation (DCT). A randomization scheme controlled by a secret key is applied in hash generation for random feature selection. The hash function is key-dependent and collision resistant. Meanwhile, it is highly robust to content preserving operations as well as having high accuracy of tampering localization.

Patent
Todd Adam Bachmann1
09 Jun 2009
TL;DR: In this paper, a hash value is mapped to a plurality of words to form the mnemonic, and the hash value may be mapped to word indices used to identify particular words in word lists.
Abstract: Methods and computing devices enable users to identify documents using a hash value mapped to a word mnemonic for easy recall and comparison. A hash algorithm may be applied a document to generate a distinguishing hash value. The hash value is mapped to a plurality of words to form the mnemonic. To obtain the words, the hash value may be mapped to word indices used to identify particular words in word lists. Word lists may include a list of nouns, a list of verbs, and a list of adverbs or adjectives, so that the resulting three word mnemonics are memorable. More word lists may be used to map hash values to four-, five- or more word mnemonics. The number-to-mnemonic mapping methods may be used to map large numbers, such as account numbers, telephone numbers, etc. into mnemonics which are easier for people to remember and compare.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed algorithm outperforms the Philips Robust Hash (PRH) algorithm under various distortions.
Abstract: Audio fingerprinting techniques aim at successfully performing content-based audio identification even when the audio signals are slightly or seriously distorted. In this letter, we propose a novel audio fingerprinting technique based on multiple hashing. In order to improve the robustness of hashing, multiple hash strings are generated through the discrete cosine transform (DCT) which is applied to the temporal energy sequence in each subband. Experimental results show that the proposed algorithm outperforms the Philips Robust Hash (PRH) algorithm under various distortions.

Proceedings ArticleDOI
TL;DR: After modeling the process of hash extraction and the properties involved in this process, two different security threats are studied, namely the disclosure of the secret feature space and the tampering of the hash.
Abstract: Perceptual hashing has to deal with the constraints of robustness, accuracy and security. After modeling the process of hash extraction and the properties involved in this process, two different security threats are studied, namely the disclosure of the secret feature space and the tampering of the hash. Two different approaches for performing robust hashing are presented: Random-Based Hash (RBH) where the security is achieved using a random projection matrix and Content-Based Hash (CBH) were the security relies on the difficulty to tamper the hash. As for digital watermarking, different security setups are also devised: the Batch Hash Attack, the Group Hash Attack, the Unique Hash Attack and the Sensitivity Attack. A theoretical analysis of the information leakage in the context of Random-Based Hash is proposed. Finally, practical attacks are presented: (1) Minor Component Analysis is used to estimate the secret projection of Random-Based Hashes and (2) Salient point tampering is used to tamper the hash of Content-Based Hashes systems.

Proceedings ArticleDOI
Yu Liu1, Cho Kiho1, Hwan Sik Yun1, Jong Won Shin1, Nam Soo Kim1 
19 Apr 2009
TL;DR: A novel audio fingerprinting technique based on combining fingerprint matching results for multiple hash tables in order to improve the robustness of hashing is presented.
Abstract: Audio fingerprinting techniques should successfully perform content-based audio identification even when the audio files are slightly or seriously distorted. In this paper, we present a novel audio fingerprinting technique based on combining fingerprint matching results for multiple hash tables in order to improve the robustness of hashing. Multiple hash tables are built based on the discrete cosine transform (DCT) which is applied to the time sequence of energies in each sub-band. Experimental results show that the recognition errors are significantly reduced compared with Philips Robust Hash (PRH) [1] under various distortions.

Journal ArticleDOI
Minho Jin1, Chang D. Yoo1
TL;DR: The quantum hashing system is shown to be more robust against various distortions than the binary hashing system using the same intermediate hash values.
Abstract: In this paper, a novel multimedia identification system based on quantum hashing is considered. Many traditional systems are based on binary hash which is obtained by encoding intermediate hash extracted from multimedia content. In the system considered, the intermediate hash values extracted from a query are encoded into quantum hash values by incorporating uncertainty in the binary hash values. For this, the intermediate hash difference between the query and its true-underlying content is considered as a random process. Then, the uncertainty is represented by the probability density estimate of the intermediate hash difference. The quantum hashing system is evaluated using both audio and video databases, and with marginal increment in computational cost, the quantum hashing system is shown to be more robust against various distortions than the binary hashing system using the same intermediate hash values.

Journal ArticleDOI
TL;DR: The fact that if the image features of an image hash function are scaled by a constant that is large than one, then the tradeoff between the robustness and the fragility of the image hashfunction will not change at all, but the security indicated by the randomness measure will increase is shown.
Abstract: How to measure the security of image hashing is still an open issue in the field of image authentication. Some works have been conducted on the security measure of image hashing. One of the most important works is the randomness measure proposed by Swaminathan, which uses differential entropy as a metric to evaluate the security of randomized image features and has been applied mainly in the security analysis of the feature extraction stage of image hashing. It is meaningful to measure the randomness of the image features over the secret-key set for the security of image hashing because the image features extracted by image hashing should be generated randomly and difficult to guess. However, as is well known, differential entropy is not invariant to scaling; thus it might not be enough to evaluate the security of randomized image features. In this paper, we show the fact that if the image features of an image hash function are scaled by a constant that is large than one, then the tradeoff between the robustness and the fragility of the image hash function will not change at all, but the security indicated by the randomness measure will increase. The above-mentioned fact seems to contradict the following. First, the security of image hashing, which conflicts with robustness and fragility, cannot increase freely. Secondly, a deterministic operation, such as deterministic scaling, does not change the security of image hashing in terms of the difficulty of guessing the secret key or randomized image features. Therefore, the randomness measure should be modified to be invariant to scaling at least.

Proceedings ArticleDOI
19 Oct 2009
TL;DR: This paper presents "Progressive Hashing" (PH), a general open addressing hash-based packet processing scheme for Internet routers using the set associative memory architecture and shows by experimenting with real IP lookup tables and synthetic packet filtering databases that PH reduces the overflow over the multiple hashing.
Abstract: As the Internet grows, both the number of rules in packet filtering databases and the number of prefixes in IP lookup tables inside the router are growing. The packet processing engine is a critical part of the Internet router as it is used to perform packet forwarding (PF) and packet classification (PC). In both applications, processing has to be at wire speed. It is common to use hash-based schemes in packet processing engines; however, the downside of classic hashing techniques such as overflow and worst case memory access time, has to be dealt with. Implementing hash tables using set associative memory has the property that each bucket of a hash table can be searched in one memory cycle outperforming the conventional Ternary CAMs in terms of power and scalability.In this paper we present "Progressive Hashing" (PH), a general open addressing hash-based packet processing scheme for Internet routers using the set associative memory architecture. Our scheme is an extension of the multiple hashing scheme and is amendable to high-performance hardware implementation with low overflow and low memory access latency. We show by experimenting with real IP lookup tables and synthetic packet filtering databases that PH reduces the overflow over the multiple hashing. The proposed PH processing engine is estimated to achieve an average processing speed of 160 Gbps for the PC application and 320 Gbps for the PF application.

Proceedings ArticleDOI
28 Sep 2009
TL;DR: The LFH algorithm is built on the p-stable distribution Locality-Sensitive Hashing scheme that projects a set of local features representing a query image to an ID histogram where the maximum bin is regarded as the recognized ID.
Abstract: In this paper, we present Local Feature Hashing (LFH), a novel approach for face recognition. Focusing on the scalability of face recognition systems, we build our LFH algorithm on the p-stable distribution Locality-Sensitive Hashing (pLSH) scheme that projects a set of local features representing a query image to an ID histogram where the maximum bin is regarded as the recognized ID. Our extensive experiments on two publicly available databases demonstrate the advantages of our LFH method, including: i) significant computational improvement over naive search; ii) hashing in high-dimensional Euclidean space without embedding; and iii) robustness to pose, facial expression, illumination and partial occlusion.

Patent
Sergey Ioffe1
29 Sep 2009
TL;DR: In this article, a robust hashing method is applied to media data (e.g., video, image, and audio data), producing a hash output that is robust with respect to at least one attribute of the media data.
Abstract: A robust hashing method is applied to media data (e.g., video, image, and/or audio data), producing a hash output that is robust with respect to at least one attribute of the media data. A histogram is generated for the media data and the histogram is hashed using a weighted hashing procedure. The histogram can be derived from a plurality of randomized versions of the media file, each randomized version of the media file altered to a random extent with respect to the attribute. The histogram can also be derived from a plurality of feature descriptors computed for the media data that are coarsely encoded with respect to the attribute. The weighted hashing procedure includes assigning a weight to components of the histogram and applying a plurality of hash functions to a number of versions of each component, the number of versions based on the assigned weight.

Proceedings ArticleDOI
28 Jun 2009
TL;DR: Two features are proposed for natural image hashing, based on the description of shapes, in terms of contours and regions, which have good robustness and discriminability and better ROC performance.
Abstract: Perceptual hashing is a solution for identification and authentication of multimedia content. The key of this technique is the extraction of proper features. In this paper, two features are proposed for natural image hashing. They are based on the description of shapes, in terms of contours and regions. The contour-based feature is formed by edge detection. The region-based feature is formed by the angular radial transform. Simulation results show that they have good robustness and discriminability. Compared to some other features, better ROC performance is achieved.

Book ChapterDOI
21 Sep 2009
TL;DR: This work introduces and investigates the new notion of corruptionlocalizing hashing, defined as a natural extension of collision-intractable hashing, and designs two such schemes, one starting from any collision- Intractable hash function, and the other starting fromAny collision- intractable keyed hash function.
Abstract: Collision-intractable hashing is an important cryptographic primitive with numerous applications including efficient integrity checking for transmitted and stored data, and software. In several of these applications, it is important that in addition to detecting corruption of the data we also localize the corruptions. This motivates us to introduce and investigate the new notion of corruptionlocalizing hashing, defined as a natural extension of collision-intractable hashing. Our main contribution is in formally defining corruption-localizing hash schemes and designing two such schemes, one starting from any collision-intractable hash function, and the other starting from any collision-intractable keyed hash function. Both schemes have attractive efficiency properties in three important metrics: localization factor, tag length and localization running time, capturing the quality of localization, and performance in terms of storage and time complexity, respectively. The closest previous results, when modified to satisfy our formal definitions, only achieve similar properties in the case of a single corruption.

Proceedings ArticleDOI
19 Oct 2009
TL;DR: This paper proposes a hash function family based on feature vocabularies, which can be employed to build a high-dimensional index for approximate nearest neighbor search and investigates the application in building indexes for image search.
Abstract: This paper proposes a hash function family based on feature vocabularies and investigates the application in building indexes for image search. Each hash function is associated with a set of feature points, i.e. a vocabulary, and maps an input point to the ID of the nearest one in the vocabulary. The function family can be employed to build a high-dimensional index for approximate nearest neighbor search. Then we concentrate on its application in image search. Guiding rules for the construction of the vocabularies are derived, which improve the effectiveness of the approach in this context by taking advantage of the data distribution. The rules are applied to design an algorithm for vocabulary construction in practice. Experiments show promising performance of the approach and the effectiveness of the guiding rules. Comparison with the popular Euclidean locality-sensitive hashing also shows the advantage of our approach in image search.