Document Image Indexing Using Edit Distance Based Hashing

doi:10.1109/ICDAR.2011.242

Proceedings Article•DOI•

Document Image Indexing Using Edit Distance Based Hashing

Ehtesham Hassan¹, Santanu Chaudhury¹, M. Gopal¹•Institutions (1)

18 Sep 2011-pp 1200-1204

TL;DR: A novel word image based document indexing scheme by combination of string matching and hashing is presented for two document image collections belonging to Devanagari and Bengali script.

read less

Abstract: We present a novel word image based document indexing scheme by combination of string matching and hashing The word image representation is defined by string codes obtained by unsupervised learning over graphical primitives The indexing framework is defined by distance based hashing function which does the object projection to hash space by preserving their distances We have used edit distance based string matching for defining the hashing function and for approximate nearest neighbor based retrieval The application of the proposed indexing framework is presented for two document image collections belonging to Devanagari and Bengali script

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Applications Exploiting Multimedia Semantics

[...]

Santanu Chaudhury, Anupama Mallik, Hiranmay Ghosh

16 Jul 2015

References

PDF

Open Access

More filters

Proceedings Article•

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

[...]

Martin Ester¹, Hans-Peter Kriegel¹, Jörg Sander¹, Xiaowei Xu¹•Institutions (1)

Ludwig Maximilian University of Munich¹

02 Aug 1996

TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.

...read moreread less

Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

...read moreread less

17,056 citations

Proceedings Article•

A density-based algorithm for discovering clusters in large spatial Databases with Noise

[...]

Martin Ester¹, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Jan 1996

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.

...read moreread less

Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

...read moreread less

14,297 citations

"Document Image Indexing Using Edit ..." refers methods in this paper

...In this direction, the paper presents novel application of edit distance based hashing for indexing....
[...]

Book•

Introduction to Information Retrieval

[...]

Christopher D. Manning¹, Prabhakar Raghavan², Hinrich Schütze³•Institutions (3)

Stanford University¹, Google², University of Stuttgart³

01 Jan 2008

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.

...read moreread less

Abstract: Class-tested and coherent, this groundbreaking new textbook teaches web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. Written from a computer science perspective by three leading experts in the field, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Although originally designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also create a buzz for researchers and professionals alike.

...read moreread less

11,804 citations

Proceedings Article•DOI•

Approximate nearest neighbors: towards removing the curse of dimensionality

[...]

Piotr Indyk¹, Rajeev Motwani¹•Institutions (1)

Stanford University¹

23 May 1998

TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.

...read moreread less

Abstract: We present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces. For data sets of size n living in R d , the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors' STOC'98 and FOCS'01 papers. It unifies, generalizes and simplifies the results from those papers.

...read moreread less

4,478 citations

Journal Article•DOI•

Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality

[...]

Sariel Har-Peled, Piotr Indyk, Rajeev Motwani

16 Jul 2012-Theory of Computing

TL;DR: Two algorithms for the approximate nearest neighbor problem in high dimensional spaces for data sets of size n living in IR are presented, achieving query times that are sub-linear in n and polynomial in d.

...read moreread less

Abstract: We present two algorithms for the approximate nearest neighbor problem in high dimensional spaces. For data sets of size n living in IR, the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree.

...read moreread less

1,182 citations

"Document Image Indexing Using Edit ..." refers background in this paper

...Locality Sensitive Hashing (LSH) introduced by Indyk and Motwani is state-of-theart method for finding similar objects in large data collection [9]....
[...]

Document Image Indexing Using Edit Distance Based Hashing

Citations

References

"Document Image Indexing Using Edit ..." refers methods in this paper

"Document Image Indexing Using Edit ..." refers background in this paper

Related Papers (5)