scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Document Image Indexing Using Edit Distance Based Hashing

18 Sep 2011-pp 1200-1204
TL;DR: A novel word image based document indexing scheme by combination of string matching and hashing is presented for two document image collections belonging to Devanagari and Bengali script.
Abstract: We present a novel word image based document indexing scheme by combination of string matching and hashing The word image representation is defined by string codes obtained by unsupervised learning over graphical primitives The indexing framework is defined by distance based hashing function which does the object projection to hash space by preserving their distances We have used edit distance based string matching for defining the hashing function and for approximate nearest neighbor based retrieval The application of the proposed indexing framework is presented for two document image collections belonging to Devanagari and Bengali script
Citations
More filters
References
More filters
Proceedings ArticleDOI
22 May 1995
TL;DR: A fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dis-similarities are preserved, and this method is introduced from pattern recognition, namely, Multi-Dimensional Scaling (MDS).
Abstract: A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly fine-tuned spatial access methods (SAMs), to answer several types of queries, including the 'Query By Example' type (which translates to a range query); the 'all pairs' query (which translates to a spatial join [8]); the nearest-neighbor or best-match query, etc.However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points.This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dis-similarities are preserved. There are two benefits from this mapping: (a) efficient retrieval, in conjunction with a SAM, as discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or 3-d space, revealing potential clusters, correlations among attributes and other regularities that data-mining is looking for.We introduce an older method from pattern recognition, namely, Multi-Dimensional Scaling (MDS) [51]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances and the overall structure of the data-set.

1,124 citations

Proceedings ArticleDOI
18 Jun 2003
TL;DR: This work presents an algorithm for matching handwritten words in noisy historical documents that performs better and is faster than competing matching techniques and presents experimental results on two different data sets from the George Washington collection.
Abstract: Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labor and expense. Since current handwriting recognizers do not perform well on historical documents, a technique called word spotting has been developed: clusters with occurrences of the same word in a collection are established using image matching. By annotating "interesting" clusters, an index can be built automatically. We present an algorithm for matching handwritten words in noisy historical documents. The segmented word images are preprocessed to create sets of 1-dimensional features, which are then compared using dynamic time warping. We present experimental results on two different data sets from the George Washington collection. Our experiments show that this algorithm performs better and is faster than competing matching techniques.

626 citations

Journal ArticleDOI
TL;DR: Experimental results show that combination of the classifiers increases reliability of the recognition results and is the unique feature of this work.

186 citations


"Document Image Indexing Using Edit ..." refers background in this paper

  • ...In this direction, the paper presents a novel string based word image representation....

    [...]

Journal ArticleDOI
TL;DR: The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code.
Abstract: This paper presents a document retrieval technique that is capable of searching document images without optical character recognition (OCR). The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code. In particular, we annotate word images by using a set of topological shape features including character ascenders/descenders, character holes, and character water reservoirs. With the annotated word shape codes, document images can be retrieved by either query keywords or a query document image. Experimental results show that the proposed document image retrieval technique is fast, efficient, and tolerant to various types of document degradation.

111 citations


"Document Image Indexing Using Edit ..." refers background in this paper

  • ...In recent works, Shijian et al. [4] proposed a word shape coding without requiring character segmentation....

    [...]

  • ...Large amount of research has been done in the area of word image representation, and document indexing [4][5][6][7]....

    [...]

Proceedings ArticleDOI
07 Apr 2008
TL;DR: A novel formulation is presented, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.
Abstract: A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as locality sensitive hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.

105 citations