scispace - formally typeset
Search or ask a question

Showing papers on "Feature hashing published in 2010"


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work proposes a semi-supervised hashing method that is formulated as minimizing empirical error on the labeled data while maximizing variance and independence of hash bits over the labeled and unlabeled data.
Abstract: Large scale image search has recently attracted considerable attention due to easy availability of huge amounts of data. Several hashing methods have been proposed to allow approximate but highly efficient search. Unsupervised hashing methods show good performance with metric distances but, in image search, semantic similarity is usually given in terms of labeled pairs of images. There exist supervised hashing methods that can handle such semantic similarity but they are prone to overfitting when labeled data is small or noisy. Moreover, these methods are usually very slow to train. In this work, we propose a semi-supervised hashing method that is formulated as minimizing empirical error on the labeled data while maximizing variance and independence of hash bits over the labeled and unlabeled data. The proposed method can handle both metric as well as semantic similarity. The experimental results on two large datasets (up to one million samples) demonstrate its superior performance over state-of-the-art supervised and unsupervised methods.

662 citations


Proceedings Article
21 Jun 2010
TL;DR: This paper proposes a novel data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially, and shows significant performance gains over the state-of-the-art methods on two large datasets containing up to 1 million points.
Abstract: Hashing based Approximate Nearest Neighbor (ANN) search has attracted much attention due to its fast query time and drastically reduced storage. However, most of the hashing methods either use random projections or extract principal directions from the data to derive hash functions. The resulting embedding suffers from poor discrimination when compact codes are used. In this paper, we propose a novel data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially. The proposed method easily adapts to both unsupervised and semi-supervised scenarios and shows significant performance gains over the state-of-the-art methods on two large datasets containing up to 1 million points.

357 citations


Journal ArticleDOI
TL;DR: This paper compares several families of space hashing functions in a real setup and reveals that unstructured quantizer significantly improves the accuracy of LSH, as it closely fits the data in the feature space.

327 citations


Proceedings ArticleDOI
19 Jul 2010
TL;DR: Self-Taught Hashing (STH) as discussed by the authors is a self-taught hashing method that finds the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then trains l classifiers via supervised learning to predict the lbit code for any query document unseen before.
Abstract: The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms state-of-the-art techniques significantly.

322 citations


Journal ArticleDOI
TL;DR: The proposed hash-based image authentication scheme uses secret key to randomly modulate image pixels to create a transformed feature space to calculate the image hash, which can detect minute tampering with localization of the tampered area.

185 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper proposes a supervised hashing method, i.e., the LAbel-regularized Max-margin Partition (LAMP) algorithm, which generates hash functions in weakly-supervised setting and provides a collision bound which is beyond pairwise data interaction based on Markov random fields theory.
Abstract: The explosive growth of the vision data motivates the recent studies on efficient data indexing methods such as locality-sensitive hashing (LSH). Most existing approaches perform hashing in an unsupervised way. In this paper we move one step forward and propose a supervised hashing method, i.e., the LAbel-regularized Max-margin Partition (LAMP) algorithm. The proposed method generates hash functions in weakly-supervised setting, where a small portion of sample pairs are manually labeled to be “similar” or “dissimilar”. We formulate the task as a Constrained Convex-Concave Procedure (CCCP), which can be relaxed into a series of convex sub-problems solvable with efficient Quadratic-Program (QP). The proposed hashing method possesses other characteristics including: 1) most existing LSH approaches rely on linear feature representation. Unfortunately, kernel tricks are often more natural to gauge the similarity between visual objects in vision research, which corresponds to probably infinite-dimensional Hilbert spaces. The proposed LAMP has a natural support for kernel-based feature representation. 2) traditional hashing methods assume uniform data distributions. Typically, the collision probability of two samples in hash buckets is only determined by pairwise similarity, unrelated to contextual data distribution. In contrast, we provide such a collision bound which is beyond pairwise data interaction based on Markov random fields theory. Extensive empirical evaluations are conducted on five widely-used benchmarks. It takes only several seconds to generate a new hashing function, and the adopted random supporting-vector scheme enables the LAMP algorithm scalable to large-scale problems. Experimental results well validate the superiorities of the LAMP algorithm over the state-of-the-art kernel-based hashing methods.

166 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: This paper develops a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on, and incorporates efficient techniques to further reduce time and space complexity for indexing and search.
Abstract: Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.

151 citations


Journal ArticleDOI
TL;DR: This article defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score.
Abstract: In this article we present Supervised Semantic Indexing which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

128 citations


Proceedings Article
11 Jul 2010
TL;DR: This paper utilizes the norm-keeping property of p-stable functions to ensure that two data's collision probability reflects their non-metric distance in original feature space and investigates various concrete examples to validate the proposed algorithm.
Abstract: Non-metric distances are often more reasonable compared with metric ones in terms of consistency with human perceptions. However, existing locality-sensitive hashing (LSH) algorithms can only support data which are gauged with metrics. In this paper we propose a novel locality-sensitive hashing algorithm targeting such non-metric data. Data in original feature space are embedded into an implicit reproducing kernel Kreĭn space and then hashed to obtain binary bits. Here we utilize the norm-keeping property of p-stable functions to ensure that two data's collision probability reflects their non-metric distance in original feature space. We investigate various concrete examples to validate the proposed algorithm. Extensive empirical evaluations well illustrate its effectiveness in terms of accuracy and retrieval speedup.

76 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: It is shown that with hashing, the sparse representation can be recovered with a high probability because hashing preserves the restrictive isometry property and is presented a theoretical analysis on the recognition rate.
Abstract: We propose a face recognition approach based on hashing. The approach yields comparable recognition rates with the random l 1 approach [18], which is considered the state-of-the-art. But our method is much faster: it is up to 150 times faster than [18] on the YaleB dataset. We show that with hashing, the sparse representation can be recovered with a high probability because hashing preserves the restrictive isometry property. Moreover, we present a theoretical analysis on the recognition rate of the proposed hashing approach. Experiments show a very competitive recognition rate and significant speedup compared with the state-of-the-art.

61 citations


Proceedings ArticleDOI
01 Mar 2010
TL;DR: This article proposes a new hashing framework for tree-structured data that maps an unordered tree into a multiset of simple wedge-shaped structures refered to as pivots and empirically demonstrates the efficacy and efficiency of the overall approach on a range of real-world datasets and applications.
Abstract: In this article we propose a new hashing framework for tree-structured data. Our method maps an unordered tree into a multiset of simple wedge-shaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signature-sketch of the tree-structured datum yielding an effective mechanism for hashing such data. We discuss several potential pivot structures and study some of the theoretical properties of such structures, and discuss their implications to tree edit distance and properties related to perfect hashing. We then empirically demonstrate the efficacy and efficiency of the overall approach on a range of real-world datasets and applications.

Journal ArticleDOI
TL;DR: A measure called expected discriminability for the fragility of image hashing is proposed and this fragility theoretically is studied theoretically based on the proposed measure.
Abstract: Fragility is one of the most important properties of authentication-oriented image hashing. However, to date, there has been little theoretical analysis on the fragility of image hashing. In this paper, we propose a measure called expected discriminability for the fragility of image hashing and study this fragility theoretically based on the proposed measure. According to our analysis, when Gray code is applied into the discrete-binary conversion stage of image hashing, the value of the expected discriminability, which is dominated by the quantization stage of image hashing, is no more than 1/2. We further evaluate the expected discriminability of the image-hashing scheme that uses adaptive quantization, which is the most popular quantization scheme in the field of image hashing. Our evaluation reveals that if deterministic adaptive quantization is applied, then the expected discriminability of the image-hashing scheme can reach the maximum value (i.e., 1/2). Finally, some experiments are conducted to validate our theoretical analysis and to compare the performance of several quantization schemes for image hashing.

Patent
25 Mar 2010
TL;DR: In this paper, a first hash function with a sliding hash window is applied to the normalized text string to generate an array of hash values candidate anchoring points are selected by applying a first filter to the array of hashes values The anchor points are chosen by applying another filter to candidate anchors.
Abstract: One embodiment relates to a computer-implemented process for generating document fingerprints A document is normalized to create a normalized text string A first hash function with a sliding hash window is applied to the normalized text string to generate an array of hash values Candidate anchoring points are selected by applying a first filter to the array of hash values The anchoring points are chosen by applying a second filter to the candidate anchoring points Finally, a second hash function is applied to substrings located at the chosen anchoring points to generate hash values for use as fingerprints for the document Other embodiments and aspects are also disclosed

Proceedings ArticleDOI
25 Oct 2010
TL;DR: Data-Oriented LSH is proposed to reduce memory consumption when data are non-uniformly distributed and focused on the hash table construction, and thus the query-directed methods can be applied to the index to improve further.
Abstract: Locality Sensitive Hashing (LSH) has been proposed as a scalable and high-dimensional index for approximate similarity search. Euclidean LSH is a variation of LSH and has been successfully used in many multimedia applications. However, hash functions of the basic Euclidean LSH project data points over randomly selected directions, which reduces accuracy when data are non-uniformly distributed. So more hash tables are needed to guarantee the accuracy, and thus more memory is consumed. Since heavy memory cost is a significant drawback of Euclidean LSH, we propose Data-Oriented LSH to reduce memory consumption when data are non-uniformly distributed. Most of existing methods are query-directed, such as multi-probe and query expansion methods. We focused on the hash table construction, and thus the query-directed methods can be applied to our index to improve further. The experiment shows that to achieve the same accuracy, our method uses less time and less memory compared with original Euclidean LSH.

Journal ArticleDOI
TL;DR: Experimental results demonstrate the effectiveness of the proposed hash function in terms of discrimination and robustness against various types of content preserving signal processing manipulations.
Abstract: In this letter, we present a new speech hash function based on the non-negative matrix factorization (NMF) of linear prediction coefficients (LPCs). First, linear prediction analysis is applied to the speech to obtain its LPCs, which represent the frequency shaping attributes of the vocal tract. Then, the NMF is performed on the LPCs to capture the speech’s local feature, which is then used for hash vector generation. Experimental results demonstrate the effectiveness of the proposed hash function in terms of discrimination and robustness against various types of content preserving signal processing manipulations.

Journal ArticleDOI
TL;DR: A comprehensive survey of image hashing is given, which presents an overview of various image hashing schemes and discusses their advantages and limitations in terms of security, robustness, and discrimination under different types of operations on the image.
Abstract: The traditional cryptographic hash functions are sensitive to even one-bit difference of the input message. While multimedia data always undergo compression or other signal processing operations, which lead to the unsuitability of multimedia authentication using cryptographic hash. The image hashing has emerged recently which captures visual essentials for robust image authentication. In this paper, we give a comprehensive survey of image hashing. We present an overview of various image hashing schemes and discuss their advantages and limitations in terms of security, robustness, and discrimination under different types of operations on the image.

Patent
04 Jun 2010
TL;DR: In this paper, the authors describe methods, systems and articles of manufacture for identifying semantic nearest neighbors in a feature space, which includes generating an affinity matrix for objects in a given feature space and training a multi-bit hash function using a greedy algorithm.
Abstract: Methods, systems and articles of manufacture for identifying semantic nearest neighbors in a feature space are described herein. A method embodiment includes generating an affinity matrix for objects in a given feature space, wherein the affinity matrix identifies the semantic similarity between each pair of objects in the feature space, training a multi-bit hash function using a greedy algorithm that increases the Hamming distance between dissimilar objects in the feature space while minimizing the Hamming distance between similar objects, and identifying semantic nearest neighbors for an object in a second feature space using the multi-bit hash function. A system embodiment includes a hash generator configured to generate the affinity matrix and train the multi-bit hash function, and a similarity determiner configured to identify semantic nearest neighbors for an object in a second feature space using the multi-bit hash function.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: A novel biometric hashing scheme which is secure and robust to lighting changes is proposed in this paper andHamming distance is used to measure the performance of the scheme.
Abstract: Biometric hash finds extensive applications in multimedia security systems. Biometric hashing schemes combine biometric features with random numbers for robust and secure human authentication or recognition. A novel biometric hashing scheme which is secure and robust to lighting changes is proposed in this paper. First, the local binary pattern (LBP) based histogram sequence or vector is employed to represent a normalized face image. Secondly, the pseudorandom sequence is generated by using the user specific secret seed (hash key) and it is orthonormalized by using Gram-Schmidt algorithm. Third, the inner product between the histogram vector and pseudorandom sequence is computed. The biometric hash is obtained by thresholding the inner product with the threshold calculated by using Otsu method. Our scheme is tested on the face image data. Hamming distance is used to measure the performance of the scheme. Experimental results show that our scheme is secure and robust to lighting changes.

Journal Article
TL;DR: A frame hash based video hash construction framework is proposed, which reduces a video hash design to an image hash design, so that the performance of the video hash can be estimated without heavy simulation.
Abstract: Perceptual hashing is a technique for content identification and authentication. In this work, a frame hash based video hash construction framework is proposed. This approach reduces a video hash design to an image hash design, so that the performance of the video hash can be estimated without heavy simulation. Target performance can be achieved by tuning the construction parameters. A frame hash algorithm and two performance metrics are proposed.

Proceedings ArticleDOI
06 Mar 2010
TL;DR: This paper demonstrates the use of modified Locality Sensitive Hashing (mLSH) technique with Euclidean distance space to build a data structure for Defense Meteorological Satellite Program (DMSP) satellite imagery database that can be used to find similar satellite image matches in sublinear search time.
Abstract: This paper demonstrates the use of modified Locality Sensitive Hashing (mLSH) technique with Euclidean distance space to build a data structure for Defense Meteorological Satellite Program (DMSP) satellite imagery database that can be used to find similar satellite image matches in sublinear search time. Given the texture feature vectors of the images extracted using Gaussian central moments of wavelet edges after multi-resolution decomposition, a one-time linked-list hash table is created. A family of hash functions is drawn randomly and independently from a Gaussian distribution with mean zero and a standard deviation, d (i.e. dimensionality of the image feature vectors) to create the hash table. When tested, our algorithm has proved to be at least twenty times faster than the linear search algorithm. In addition, the algorithm ensures that the percentage of the entire database searched to find possible matches to any given query falls below ten percent. 1 2

Proceedings ArticleDOI
19 Jul 2010
TL;DR: Experimental results confirm that the proposed hashing method shows robustness against geometrical and topological attacks and provides a unique hash for each model and key.
Abstract: In this paper, a robust 3D mesh hashing method based on a key-dependent 3D surface feature is developed. The main objectives of the proposed hashing method are to show robustness against content-preserved attacks and to enable blind-detection without using any preprocessing techniques for the attacks. To achieve these objectives, the proposed hashing method projects all vertices to the shape coordinates of 3D SSD and curvedness, and then, it segments the shape coordinates into rectangular blocks and computes the block shape intensity using a permutation key and a random key. A hash is generated by binarizing the block shape intensity. Experimental results confirm that the proposed hashing method shows robustness against geometrical and topological attacks and provides a unique hash for each model and key.

Posted Content
TL;DR: It is argued that this system causes numerous problems in tracking internet criminals, and further allows even “newbies” to avoid detection, so cryptologists and computer forensics experts need to focus on this as they develop the next generation of hashing algorithms.
Abstract: In this article, I aim to show, through practical examples, that computer forensics techniques such as the use of hash values are inherently flawed in tracking illegal computer files. First, I describe the underlying theory of hashing algorithms and hash values, as well as discuss that several U.S. government agencies keep detailed file databases in order to track or detect illegal files, e.g. pirated media or child pornography. These databases include the file’s unique hash values. Then, I provide real examples of hash values using MD5 and SHA-1 hashing algorithms to show how extremely minor alterations to a computer file produce radically different hash values. While such a cryptological system is important in authenticating files and ensuring that a given file is the one sought by an internet user, I argue that this system causes numerous problems in tracking internet criminals, and further allows even “newbies” to avoid detection. In conclusion, I state that cryptologists and computer forensics experts need to focus on this as they develop the next generation of hashing algorithms.


Proceedings ArticleDOI
Shijun Xiang1
04 Nov 2010
TL;DR: In this paper, the authors extend the invariance of the spatial-domain histogram in shape to geometric deformations to the discrete wavelet transform (DWT) domain to make this hashing scheme more flexible.
Abstract: Media hashing is a compact representation of media content, one potential application of which is for content-based information retrieval. In the earlier work [13], the invariance of the spatial-domain histogram in shape to geometric deformations has been successfully exploited for image hashing. In this work, we extend the invariance into the discrete wavelet transform (DWT) domain to make this hashing scheme more flexible. As two important aspects, robustness and uniqueness of the proposed DWT hashing function are investigated in detail by representing the histogram shape as the relative relations in the number of the low-frequency coefficients among groups of two different histogram bins. Extensive tests show that the hashing scheme has a satisfactory performance to various geometric deformations and most common signal processing operations due to the use of the DWT-domain histogram and Gaussian kernel filter.

Proceedings ArticleDOI
22 Nov 2010
TL;DR: This paper estimates the performance of the order-preserving linear hashing on distributed environment and shows that the proposed scheme provides better results than the traditional distributed linear hashing.
Abstract: For efficient retrieval of data, the design of the data structuring is important. Tree structures and hash tables are popular data structures. A hash table is a simple data structure and it can be retrieved very fast for an exact match query. But for a range query, the hashing scheme is necessary to search much more data blocks than other data structures. Order-preserving linear hashing is one of the solutions for this problem. Its hash function is a combination of division function and bit reversal function. By using this kind of hashing, the nearest data can be stored on the same data block in many cases as the tree structure and good performance for a range query is expected. In this paper, we estimate the performance of the order-preserving linear hashing on distributed environment. The experimental results show that the proposed scheme provides better results than the traditional distributed linear hashing.

Journal ArticleDOI
TL;DR: A mapped perfect hashing function is proposed which maps the region of hash key combinations into a continuous integer space for phase 3 and maximizes the efficiency of direct hashing mechanis m.
Abstract: One of the most distinguished features of the DHP association rules mining algorithm is that it counts the support of hash key combinations composed of k items at phase k-1, and uses the counted support for pruning candidate large itemsets to improve perform ance. At this time, it is desirable for each hash key combination to have a separate count variable, wh ere it is impossible to allocate the variables owing to memory shortage. So, the algorithm uses a di rect hashing mechanism in which several hash key combinations conflict and are counted in a sam e hash bucket. But the direct hashing mechanism is not efficient because the distribution of hash key combinations is unvalanced by the characteristics sourced from the mining process. This pa per proposes a mapped perfect hashing function which maps the region of hash key combinations into a continuous integer space for phase 3 and maximizes the efficiency of direct hashing mechanis m. The results of a performance test experimented on 42 test data sets shows that the average perfor mance improvement of the proposed ∙제1저자 : 이형봉 교신저자 : 권기현∙투고일 : 2010. 05. 19, 심사일 : 2010. 07. 09, 게재확정일 : 2010. 07. 14.*강릉원주대학교 컴퓨터공학과 교수 **강원대학교 전자정보통신공학부 교수

Proceedings ArticleDOI
23 Aug 2010
TL;DR: This paper proposes a new hash tree-based indexing structure called tertiary hash tree for indexing high-dimensional feature values and shows that the proposed index structure achieves outstanding performance.
Abstract: Dominant features for content-based image retrieval usually consist of high-dimensional values. So far, many researches have been done to index such values for fast retrieval. Still, many existing indexing schemes are suffering from performance degradation due to the curse of dimensionality problem. As an alternative, heuristic algorithms have been proposed to calculate the result with ‘high probability’ at the cost of accuracy. In this paper, we propose a new hash tree-based indexing structure called tertiary hash tree for indexing high-dimensional feature values. Tertiary hash tree provides several advantages compared to the traditional extendible hash structure in terms of resource usage and search performance. Through extensive experiments, we show that our proposed index structure achieves outstanding performance.


Proceedings Article
23 Jul 2010
TL;DR: In this article, a new approach for a hash function construction is presented which offers unique properties for textual and geometric data, and the proposed hash construction has been verified on non-trivial data sets.
Abstract: Techniques based on hashing are heavily used in many applications, e.g. information retrieval, geometry processing, chemical and medical applications etc. and even in cryptography. Traditionally the hash functions are considered in a form of h(v) = f(v) mod m, where m is considered as a prime number and f(v) is a function over the element v, which is generally of "unlimited" dimensionality and/or of "unlimited" range of values. In this paper a new approach for a hash function construction is presented which offers unique properties for textual and geometric data. Textual data have a limited range of values (the alphabet size) and "unlimited" dimensionality (the string length), while geometric data have "unlimited" range of values (usually (-∞, ∞) ), but limited dimensionality (usually 2 or 3). Construction of the hash function differs for textual and geometric data and the proposed hash construction has been verified on non-trivial data sets.

Proceedings ArticleDOI
22 Sep 2010
TL;DR: In this paper, the authors propose the creation of hash values that keep similar data stored near to each other in a P2P network, reducing the effort to retrieve similar data.
Abstract: The increasing volume of semantic content available in the Web, generally classified by concept hierarchies or simple ontologies, turns the searching and reasoning upon these data a great challenge. Generally, a search in Semantic Web may not be addressed to a specific document, but to a group of data classified in the same concept. Several structures used to distribute data, e.g. P2P networks, use hash values to identify these data, without maintaining the semantic values of the stored data. This paper contributes by proposing the creation of hash values that keep similar data stored near to each other in a P2P network, reducing the effort to retrieve similar data. The proposed hash values are derived from the data classification based on ontologies, using locality sensitive hashing (LSH) functions.