Open Access
Streaming Similarity Search over One Billion Tweets Using Parallel Locality-Sensitive Hashing
Narayanan Sundaram,Aizana Z. Turmukhametova,Nadathur Satish,Todd Mostak,Piotr Indyk,Samuel Madden,Pradeep Dubey +6 more
Reads0
Chats0
TLDR
Parallel LSH (PLSH) as mentioned in this paper is a variant of LSH designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data.Abstract:
Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects.In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.read more
Citations
More filters
Posted Content
Approximate Nearest Neighbor Search in High Dimensions
TL;DR: The nearest neighbor problem as mentioned in this paper is defined as follows: given a set of points in some metric space, build a data structure that, given any point $q, returns a point in the set $P$ that is closest to $q$ (its "nearest neighbor" in $P), which is then used to find the nearest neighbor without computing all distances between the points.
Posted Content
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
TL;DR: FLASH is a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms, by leveraging a LSH style randomized indexing procedure and combining it with several principled techniques.
Posted Content
The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings
TL;DR: In this paper, the authors examined a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation.
Posted Content
Scalability and Total Recall with Fast CoveringLSH
Ninh Pham,Rasmus Pagh +1 more
TL;DR: Fast CoveringLSH (fcLSH) as discussed by the authors is a fast and practical covering LSH scheme for Hamming space, which avoids false negatives and always reports all near neighbors.
Posted Content
Massively-Parallel Similarity Join, Edge-Isoperimetry, and Distance Correlations on the Hypercube
Paul Beame,Cyrus Rashtchian +1 more
TL;DR: In this article, a connection between similarity search algorithms and certain graph-theoretic quantities is made, and a general method for designing one-round protocols using edge-isoperimetric shapes in similarity graphs is presented.
References
More filters
Journal ArticleDOI
Multidimensional binary search trees used for associative searching
TL;DR: The multidimensional binary search tree (or k-d tree) as a data structure for storage of information to be retrieved by associative searches is developed and it is shown to be quite efficient in its storage requirements.
Proceedings ArticleDOI
Approximate nearest neighbors: towards removing the curse of dimensionality
Piotr Indyk,Rajeev Motwani +1 more
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
MonographDOI
Direct methods for sparse matrices
TL;DR: This book aims to be suitable also for a student course, probably at MSc level, and the subject is intensely practical and this book is written with practicalities ever in mind.
Journal ArticleDOI
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Alexandr Andoni,Piotr Indyk +1 more
TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(DN + n1+1c2 + o(1) + 1/c2), which almost matches the lower bound for hashing-based algorithm recently obtained.
Proceedings ArticleDOI
Robust and fast similarity search for moving object trajectories
TL;DR: Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences, indicate that EDR is more robust than Euclideans distance, DTW and ERP, and it is on average 50% more accurate than LCSS.