Document Image Indexing Using Edit Distance Based Hashing

doi:10.1109/ICDAR.2011.242

Proceedings Article•DOI•

Document Image Indexing Using Edit Distance Based Hashing

Ehtesham Hassan¹, Santanu Chaudhury¹, M. Gopal¹•Institutions (1)

18 Sep 2011-pp 1200-1204

TL;DR: A novel word image based document indexing scheme by combination of string matching and hashing is presented for two document image collections belonging to Devanagari and Bengali script.

read less

Abstract: We present a novel word image based document indexing scheme by combination of string matching and hashing The word image representation is defined by string codes obtained by unsupervised learning over graphical primitives The indexing framework is defined by distance based hashing function which does the object projection to hash space by preserving their distances We have used edit distance based string matching for defining the hashing function and for approximate nearest neighbor based retrieval The application of the proposed indexing framework is presented for two document image collections belonging to Devanagari and Bengali script

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Applications Exploiting Multimedia Semantics

[...]

Santanu Chaudhury, Anupama Mallik, Hiranmay Ghosh

16 Jul 2015

References

PDF

Open Access

More filters

Proceedings Article•DOI•

FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

[...]

Christos Faloutsos¹, King-Ip Lin²•Institutions (2)

Bell Labs¹, University of Maryland, College Park²

22 May 1995

TL;DR: A fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dis-similarities are preserved, and this method is introduced from pattern recognition, namely, Multi-Dimensional Scaling (MDS).

...read moreread less

Abstract: A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly fine-tuned spatial access methods (SAMs), to answer several types of queries, including the 'Query By Example' type (which translates to a range query); the 'all pairs' query (which translates to a spatial join [8]); the nearest-neighbor or best-match query, etc.However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points.This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dis-similarities are preserved. There are two benefits from this mapping: (a) efficient retrieval, in conjunction with a SAM, as discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or 3-d space, revealing potential clusters, correlations among attributes and other regularities that data-mining is looking for.We introduce an older method from pattern recognition, namely, Multi-Dimensional Scaling (MDS) [51]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances and the overall structure of the data-set.

...read moreread less

1,124 citations

Proceedings Article•DOI•

Word image matching using dynamic time warping

[...]

Toni M. Rath¹, R. Manmatha¹•Institutions (1)

University of Massachusetts Amherst¹

18 Jun 2003

TL;DR: This work presents an algorithm for matching handwritten words in noisy historical documents that performs better and is faster than competing matching techniques and presents experimental results on two different data sets from the George Washington collection.

...read moreread less

Abstract: Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labor and expense. Since current handwriting recognizers do not perform well on historical documents, a technique called word spotting has been developed: clusters with occurrences of the same word in a collection are established using image matching. By annotating "interesting" clusters, an index can be built automatically. We present an algorithm for matching handwritten words in noisy historical documents. The segmented word images are preprocessed to create sets of 1-dimensional features, which are then compared using dynamic time warping. We present experimental results on two different data sets from the George Washington collection. Our experiments show that this algorithm performs better and is faster than competing matching techniques.

...read moreread less

626 citations

Journal Article•DOI•

Signature verification using multiple neural classifiers

[...]

Reena Bajaj¹, Santanu Chaudhury¹•Institutions (1)

Indian Institute of Technology Delhi¹

01 Jan 1997-Pattern Recognition

TL;DR: Experimental results show that combination of the classifiers increases reliability of the recognition results and is the unique feature of this work.

...read moreread less

186 citations

"Document Image Indexing Using Edit ..." refers background in this paper

...In this direction, the paper presents a novel string based word image representation....
[...]

Journal Article•DOI•

Document Image Retrieval through Word Shape Coding

[...]

Shijian Lu¹, Linlin Li², Chew Lim Tan²•Institutions (2)

Agency for Science, Technology and Research¹, National University of Singapore²

01 Nov 2008-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code.

...read moreread less

Abstract: This paper presents a document retrieval technique that is capable of searching document images without optical character recognition (OCR). The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code. In particular, we annotate word images by using a set of topological shape features including character ascenders/descenders, character holes, and character water reservoirs. With the annotated word shape codes, document images can be retrieved by either query keywords or a query document image. Experimental results show that the proposed document image retrieval technique is fast, efficient, and tolerant to various types of document degradation.

...read moreread less

111 citations

"Document Image Indexing Using Edit ..." refers background in this paper

...In recent works, Shijian et al. [4] proposed a word shape coding without requiring character segmentation....
[...]
...Large amount of research has been done in the area of word image representation, and document indexing [4][5][6][7]....
[...]

Proceedings Article•DOI•

Nearest Neighbor Retrieval Using Distance-Based Hashing

[...]

Vassilis Athitsos¹, Michalis Potamias², Panagiotis Papapetrou², George Kollios²•Institutions (2)

University of Texas at Arlington¹, Boston University²

07 Apr 2008

TL;DR: A novel formulation is presented, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.

...read moreread less

Abstract: A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as locality sensitive hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.

...read moreread less

105 citations

Document Image Indexing Using Edit Distance Based Hashing

Citations

References

"Document Image Indexing Using Edit ..." refers background in this paper

"Document Image Indexing Using Edit ..." refers background in this paper

Related Papers (5)