scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2004"


Proceedings ArticleDOI
08 Jun 2004
TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Abstract: We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions.Our scheme improves the running time of the earlier algorithm for the case of the lp norm. It also yields the first known provably efficient approximate NN algorithm for the case p

3,109 citations


Proceedings Article
01 Dec 2004
TL;DR: This paper asks the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how and why and introduces a new kind of metric tree that allows overlap.
Abstract: This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hashing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also introduce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same random-projection-based approximations that LSH enjoys, but with a simpler algorithm and perhaps with greater efficiency. We then provide a detailed empirical evaluation on five large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels.

487 citations


Book ChapterDOI
13 Apr 2004

261 citations


Journal ArticleDOI
TL;DR: Two novel classifiers based on locally nearest neighborhood rule, called nearest neighbor line and nearest neighbor plane, are presented for pattern classification, which take much lower computation cost and achieve competitive performance.

116 citations


Proceedings ArticleDOI
30 Jun 2004
TL;DR: This paper proposes a geometry-invariant image hashing scheme, which can be employed for content copy detection and tracing and exhaustive experimental results obtained from benchmark attacks have confirmed the performance of the proposed method.
Abstract: Due to the desired non-invasive property, non-data hiding (called media hashing here) is considered to be an alternative to achieve many applications previously accomplished with watermarking. Recently, media hashing techniques for content identification have been gradually emerging. However, none of them are really resistant against geometrical attacks. In this paper, our aim is to propose a geometry-invariant image hashing scheme, which can be employed for content copy detection and tracing. Our system is mainly composed of three components: (i) robust mesh extraction; (iii) mesh-based robust hash extraction; and (iii) hash matching for similarity measurement. Exhaustive experimental results obtained from benchmark attacks have confirmed the performance of the proposed method

62 citations


Proceedings ArticleDOI
27 Jun 2004
TL;DR: Two weaknesses of Locality sensitive hashing are addressed when applied to the video identification problem, and two enhancements to LSH are proposed that improve the performance of LSH significantly in terms of efficiency and accuracy.
Abstract: Searching for similar video clips in large video database, or video identification, requires finding the nearest neighbor in high-dimensional feature space. Locality sensitive hashing, or LSH, is a well-known indexing method that allows us to efficiently find approximate nearest neighbor in such space. In this paper, we address two weaknesses of LSH when applied to the video identification problem. We propose two enhancements to LSH, and show that our enhancements improve the performance of LSH significantly in terms of efficiency and accuracy

30 citations


01 Jan 2004
TL;DR: The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.
Abstract: Hashing large collection of URLs is an inevitable problem in many Web research activities. Through a large scale experiment, three hash functions are compared in this paper. Two metrics were developed for the comparison, which are related to web structure analysis and Web crawling, respectively. The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.

24 citations


Proceedings ArticleDOI
20 Sep 2004
TL;DR: A novel geometric distortion-invariant image hashing scheme, which can be employed to perform copy detection and content authentication of digital images, is proposed and exhaustive experimental results obtained from benchmark attacks confirm the excellent performance of the proposed method.
Abstract: Media hashing is an alternative approach to many applications previously accomplished with watermarking. The major disadvantage of the existing media hashing technologies is their poor resistance to geometric attacks. In this paper, a novel geometric distortion-invariant image hashing scheme, which can be employed to perform copy detection and content authentication of digital images, is proposed. Our major contributions are threefold: (i) mesh-based robust hashing function is proposed; (ii) sophisticated hash database for error-resilient and fast matching is constructed; and (iii) the application scalability of our scheme for content copy tracing and authentication is studied. In addition, we further investigate several media hashing issues, including robustness and discrimination, error analysis, and complexity, for the proposed image hashing system. Exhaustive experimental results obtained from benchmark attacks confirm the excellent performance of the proposed method.

22 citations


Proceedings ArticleDOI
G. Swart1
05 Jul 2004
TL;DR: This paper analyzes how well consistent hashing does at evenly distributing objects among the nodes in the system and extends current consistent hashing algorithms to allow for dynamic load balancing while retaining the good properties of consistent hashing.
Abstract: Consistent hashing can be used to assign objects to nodes in a distributed system. It has been used by several distributed systems including Chord, Pastry, and Tornado because of its efficient handling of node failure and repair. In this paper we analyze how well consistent hashing does at evenly distributing objects among the nodes in the system. We also extend current consistent hashing algorithms to allow for dynamic load balancing while retaining the good properties of consistent hashing. Finally we analyze our extensions using both probabilistic analysis and simulations. The algorithms derived appear to achieve much better load balancing.

18 citations


Proceedings ArticleDOI
24 Oct 2004
TL;DR: This work provides one possible approach to undertake the modelling of robust soft hashing, detailing the basic problems involved and shows how some prior schemes partly fit into this model.
Abstract: Soft hashing, also known as robust hashing or perceptual hashing, consists of summarising multimedia data, so as to obtain a concise representation called a hash value. There has been an increasing interest in the soft hashing problem recently. Techniques implementing soft hashing intend to mirror the behaviour of cryptographic hashing, when the information to be hashed can be subject to different kinds of distortion. Many heuristic techniques for undertaking soft hashing of images and other multimedia data have been devised. Except for some attempts, a framework giving solid guidelines to solve the problem is largely lacking. We provide one possible approach to undertake the modelling of robust soft hashing, detailing the basic problems involved. We show how some prior schemes partly fit into our model.

17 citations


Book ChapterDOI
TL;DR: A survey of existing probabilistic state space exploration methods is given, including bitstate hashing, which was introduced in order to lower the probability of producing a wrong result, but maintaining the memory and runtime efficiency.
Abstract: Several methods have been developed to validate the correctness and performance of hard- and software systems. One way to do this is to model the system and carry out a state space exploration in order to detect all possible states. In this paper, a survey of existing probabilistic state space exploration methods is given. The paper starts with a thorough review and analysis of bitstate hashing, as introduced by Holzmann. The main idea of this initial approach is the mapping of each state onto a specific bit within an array by employing a hash function. Thus a state is represented by a single bit, rather than by a full descriptor. Bitstate hashing is efficient concerning memory and runtime, but it is hampered by the non deterministic omission of states. The resulting positive probability of producing wrong results is due to the fact that the mapping of full state descriptors onto much smaller representatives is not injective. – The rest of the paper is devoted to the presentation, analysis, and comparison of improvements of bitstate hashing, which were introduced in order to lower the probability of producing a wrong result, but maintaining the memory and runtime efficiency. These improvements can be mainly grouped into two categories: The approaches of the first group, the so called multiple hashing schemes, employ multiple hash functions on either a single or on multiple arrays. The approaches of the remaining category follow the idea of hash compaction. I.e. the diverse schemes of this category store a hash value for each detected state, rather than associating a single or multiple bit positions with it, leading to persuasive reductions of the probability of error if compared to the original bitstate hashing scheme.

Book ChapterDOI
02 Oct 2004
TL;DR: A hierarchical clustering is a clustering method in which each point is regarded as a single cluster initially and then the clustering algorithm repeats connecting the nearest two clusters until only one cluster remains.
Abstract: A hierarchical clustering is a clustering method in which each point is regarded as a single cluster initially and then the clustering algorithm repeats connecting the nearest two clusters until only one cluster remains. Because the result is presented as a dendrogram, one can easily figure out the distance and the inclusion relation between clusters.

Journal Article
TL;DR: The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.
Abstract: Hashing large collection of URLs is an inevitable problem in many Web research activities. Through a large scale experiment, three hash functions are compared in this paper. Two metrics were developed for the comparison, which are related to web structure analysis and Web crawling, respectively. The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.

Proceedings Article
21 Apr 2004
TL;DR: This work proposes a mathematical analysis to analyze and evaluate the performance of external hashing with separate chain for two cases and provides an approach to clarify the relationship between the insertion order of keys and position that key is located.
Abstract: External hashing with separate chain algorithm is a well-known method to dealing with the collision problem when hashing technique is employed. The performance of external hashing with separate chain depends on the data structure of separate chain. We provide an approach to clarify the relationship between the insertion order of keys and position that key is located. Introducing the probability distribution of frequency of access to each individual key in the separate chain into the analysis of search cost, we propose a mathematical analysis to analyze and evaluate the performance of external hashing with separate chain for two cases. Some experimental results obtained from the proposed formulae are also presented.

Proceedings ArticleDOI
21 Nov 2004
TL;DR: An adaptive hashing scheme is proposed that works on dynamic key sets and still enables keys to be searched in constant time and, if the hash functions are carefully chosen, then the space requirement of the hash structure is O(n).
Abstract: Hashing is an important tool in randomized algorithms, with applications in such diverse fields including information retrieval, data mining, cryptology and parallel algorithms. However, the worst case behavior of a regular hash-based searching is O(n). Perfect hashing is a solution to this problem that offers a worst case performance of O(1) only for the static key set. In this paper we have proposed an adaptive hashing scheme that works on dynamic key sets and still enables keys to be searched in constant time. It has been further established that, if the hash functions are carefully chosen, then the space requirement of the hash structure is O(n).