On the resemblance and containment of documents
Citations
4,806 citations
Cites background from "On the resemblance and containment ..."
...In contrast to the first-stage filter, which uses multiple hash functions (Broder et al. 2000), bottom sketching uses a single hash function from which the s minimum values are retained as the sketch (Broder 1997)....
[...]
4,478 citations
2,843 citations
Cites background or methods from "On the resemblance and containment ..."
...dex vectors of each non-unique coordinate of r. Random Indexing was shown to perform as well as LSA on a word synonym selection task (Karlgren & Sahlgren, 2001). Locality sensitive hashing (LSH) (Broder, 1997) is another technique that approximates the similarity matrix with complexity O(n2 rδ 2), where δ is a constant number of random projections, which controls the accuracy versusefficiency tradeoff.21 LSHi...
[...]
...s, such that two similar vectors are likely to have similar fingerprints. Definitions of LSH functions include the Min-wise independent function, which preserves the Jaccard similarity between vectors (Broder, 1997), and functions that preserve the cosine similarity between vectors (Charikar, 2002). On a word similarity task, Ravichandran et al. (2005) showed that, on average, over 80% of the top-10 similar word...
[...]
2,477 citations
Cites background from "On the resemblance and containment ..."
...This new setting has resulted in an increased interest in algorithms that process the input data in restricted ways, including sampling a few data points, making only a few passes over the data, and constructing a succinct sketch of the input which can then be e.ciently processed....
[...]
2,176 citations
References
6,594 citations
"On the resemblance and containment ..." refers background in this paper
...in other words in this case we can ignore the eect of multiple collisions, that is, three or more distinct elements of S having the same image under f . Furthermore, it can be argued that the size of f (S) is fairly well concentrated. By Azuma’s inequality [ 1 ] we have...
[...]
1,560 citations
"On the resemblance and containment ..." refers background in this paper
...For very large collections of documents where r is such that external storage is needed, dierent approaches become necessary. For further details see [ 5 ]....
[...]
...The remaining 1.5 million clusters contained 7 million documents (a mixture of exact duplicates and similar). For further details see [ 5 ]....
[...]
962 citations
821 citations
"On the resemblance and containment ..." refers methods in this paper
...Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Stanford SCAM project [2, 8, 9]....
[...]
660 citations
"On the resemblance and containment ..." refers methods in this paper
...Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Stanford SCAM project [2, 8, 9]....
[...]