Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•

In Defense of MinHash Over SimHash

[...]

Anshumali Shrivastava, Ping Li

01 Jan 2014

TL;DR: A theoretical answer is provided (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search.

...read moreread less

Abstract: MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S 2

...read moreread less

96 citations

Proceedings Article•DOI•

LBMCH: Learning Bridging Mapping for Cross-modal Hashing

[...]

Yang Wang¹, Xuemin Lin¹, Lin Wu², Wenjie Zhang¹, Qing Zhang - Show less +1 more•Institutions (2)

University of New South Wales¹, University of Adelaide²

09 Aug 2015

TL;DR: A novel method to learning bridging mapping for cross-modal hashing, named LBMCH, is proposed to characterize the cross- modal semantic correspondence by seamlessly connecting these distinct hamming spaces with each preserving the local structure of data objects from an individual modality.

...read moreread less

Abstract: Hashing has gained considerable attention on large-scale similarity search, due to its enjoyable efficiency and low storage cost. In this paper, we study the problem of learning hash functions in the context of multi-modal data for cross-modal similarity search. Notwithstanding the progress achieved by existing methods, they essentially learn only one common hamming space, where data objects from all modalities are mapped to conduct similarity search. However, such method is unable to well characterize the flexible and discriminative local (neighborhood) structure in all modalities simultaneously, hindering them to achieve better performance. Bearing such stand-out limitation, we propose to learn heterogeneous hamming spaces with each preserving the local structure of data objects from an individual modality. Then, a novel method to learning bridging mapping for cross-modal hashing, named LBMCH, is proposed to characterize the cross-modal semantic correspondence by seamlessly connecting these distinct hamming spaces. Meanwhile, the local structure of each data object in a modality is preserved by constructing an anchor based representation, enabling LBMCH to characterize a linear complexity w.r.t the size of training set. The efficacy of LBMCH is experimentally validated against real-world cross-modal datasets.

...read moreread less

95 citations

Journal Article•DOI•

Tracking Web spam with HTML style similarities

[...]

Tanguy Urvoy¹, Emmanuel Chauveau¹, Pascal Filoche¹, Thomas Lavergne•Institutions (1)

Orange S.A.¹

03 Mar 2008-ACM Transactions on The Web

TL;DR: This work study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code and proposes a flexible algorithm to cluster a large collection of documents according to these measures.

...read moreread less

Abstract: Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).Those pages built using the same generating method (template or script) share a common “look and feel” that is not easily detected by common text classification methods, but is more related to stylometry.In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.

...read moreread less

95 citations

Journal Article•DOI•

Robust and Flexible Discrete Hashing for Cross-Modal Similarity Search

[...]

Di Wang¹, Quan Wang¹, Xinbo Gao¹•Institutions (1)

Xidian University¹

01 Oct 2018-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A novel hashing model is proposed to efficiently learn robust discrete binary codes, which is referred as Robust and Flexible Discrete Hashing (RFDH), which is directly learned based on discrete matrix decomposition so that the large quantization error caused by relaxation is avoided.

...read moreread less

Abstract: Multimodal hashing approaches have gained great success on large-scale cross-modal similarity search applications, due to their appealing computation and storage efficiency. However, it is still a challenge work to design binary codes to represent the original features with good performance in an unsupervised manner. We argue that there are some limitations that need to be further considered for unsupervised multimodal hashing: 1) most existing methods drop the discrete constraints to simplify the optimization, which will cause large quantization error; 2) many methods are sensitive to outliers and noises since they use $\ell _{2}$ -norm in their objective functions which can amplify the errors; and 3) the weight of each modality, which greatly influences the retrieval performance, is manually or empirically determined and may not fully fit the specific training set. The above limitations may significantly degrade the retrieval accuracy of unsupervised multimodal hashing methods. To address these problems, in this paper, a novel hashing model is proposed to efficiently learn robust discrete binary codes, which is referred as Robust and Flexible Discrete Hashing (RFDH). In the proposed RFDH model, binary codes are directly learned based on discrete matrix decomposition, so that the large quantization error caused by relaxation is avoided. Moreover, the $\ell _{2,1}$ -norm is used in the objective function to improve the robustness, such that the learned model is not sensitive to data outliers and noises. In addition, the weight of each modality is adaptively adjusted according to training data. Hence the important modality will get large weights during the hash learning procedure. Owing to above merits of RFDH, it can generate more effective hash codes. Besides, we introduce two kinds of hash function learning methods to project unseen instances into hash codes. Extensive experiments on several well-known large databases demonstrate superior performance of the proposed hash model over most state-of-the-art unsupervised multimodal hashing methods.

...read moreread less

94 citations

Posted Content•

Learning to Hash for Indexing Big Data - A Survey

[...]

Jun Wang¹, Wei Liu², Sanjiv Kumar³, Shih-Fu Chang⁴•Institutions (4)

East China Normal University¹, IBM², Google³, Columbia University⁴

17 Sep 2015-arXiv: Learning

TL;DR: A comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised, is provided and recent hashing approaches utilizing the deep learning models are summarized.

...read moreread less

Abstract: The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.

...read moreread less

93 citations

Collapse

Network Information

Performance

Metrics

2,048

Papers

77,891

Citations

No. of papers in the topic in previous years
Year	Papers
2023	43
2022	108
2021	88
2020	110
2019	104
2018	139

Locality-sensitive hashing

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics