Searching OCR'ed Text: An LDA Based Approach

doi:10.1109/ICDAR.2011.244

Home
/
Papers
/
Searching OCR'ed Text: An LDA Based Approach

Proceedings Article•DOI•

Searching OCR'ed Text: An LDA Based Approach

Ehtesham Hassan¹, Vikram Garg¹, S. K. Mirajul Haque¹, Santanu Chaudhury¹, M. Gopal¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Delhi¹

18 Sep 2011-pp 1210-1214

TL;DR: A novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy and is based on topic modeling using Latent Dirichlet Allocation (LDA).

read less

Abstract: Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR's confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Content level access to digital library of India pages

[...]

Praveen Krishnan¹, Ravi Shekhar¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

16 Dec 2012

TL;DR: A search scheme which fuses noisy OCR output and holistic visual features for content level access to the DLI pages and shows that the fusion scheme improves over the individual methods in terms of mean Average Precision (mAP) and mean precision at 10 (mPrec@10).

...read moreread less

Abstract: In this paper, we propose a framework for content level access to the scanned pages of Digital Library of India (DLI). The current Optical Character Recognition (OCR) systems are not robust and reliable enough for generating accurate text from DLI pages. We propose a search scheme which fuses noisy OCR output and holistic visual features for content level access to the DLI pages. Visual content is captured using Bag of Visual Words (BoVW) approach. We show that our fusion scheme improves over the individual methods in terms of mean Average Precision (mAP) and mean precision at 10 (mPrec@10). We exploit the fact that OCR has a high precision while BoVW has a high recall. We use a modified edit distance to improve the order of results ranked by BoVW. Experiments are carried out on large datasets of DLI pages in Hindi and Telugu languages. We validate our method on more than 10,000 pages and 4 Million words, and report a mAP of around 0.8 and mPrec@10 of more than 0.9. We show improvements over BoVW by introducing query expansion. We also demonstrate a textual query interface for the search system.

...read moreread less

10 citations

Cites methods from "Searching OCR'ed Text: An LDA Based..."

...Second Column Shows the Retrieved Results in decreasing order of Rank. across the dataset....
[...]
...To improve recall, we have also indexed word images and used query expansion as explained in Section 2.3 to formulate query histogram based on initial results of text query index....
[...]
...Second category does it mostly in the image domain by matching them in some appropriate feature space....
[...]
...Learn to improve from annotated dataset In Section 2.5, we explained the use of confusion matrix to deal with the errors given by OCR....
[...]
...Since OCRs are not an immediate feasibility (see Section 3 for quantitative performance of OCR on DLI pages) for building search engines, we naturally move towards the image based search and retrieval techniques....
[...]

Proceedings Article•DOI•

Multi-modal Information Integration for Document Retrieval

[...]

Ehtesham Hassan¹, Santanu Chaudhury¹, M. Gopal•Institutions (1)

Indian Institute of Technology Delhi¹

25 Aug 2013

TL;DR: A novel multi-modal document indexing framework for retrieval of old and degraded text documents by combining OCR'ed text and image based representation using learning is proposed.

...read moreread less

Abstract: The paper proposes a novel multi-modal document image retrieval framework by exploiting the information of text and graphics regions. The framework applies multiple kernel learning based hashing formulation for generation of composite document indexes using different modalities. The existing multimedia management methods for imaged text documents have not addressed the requirement of old and degraded documents. In the subsequent contribution, we propose novel multi-modal document indexing framework for retrieval of old and degraded text documents by combining OCR'ed text and image based representation using learning. The evaluation of proposed concepts is demonstrated on sampled magazine cover pages, and documents of Devanagari script.

...read moreread less

5 citations

Cites background from "Searching OCR'ed Text: An LDA Based..."

...In [22], topic model based indexing and retrieval framework is proposed for text documents....
[...]

Journal Article•DOI•

Balinese Script’s Character Reconstruction Using Linear Discriminant Analysis

[...]

Made Sudarma¹, Sri Ariyani¹, Manuh Artana¹•Institutions (1)

Udayana University¹

01 Nov 2016-Indonesian Journal of Electrical Engineering and Computer Science

TL;DR: Balinese papyrus reconstructed through several processes from scanning into a digital image, performing preprocessing for image quality improvement, segmenting the Balinese characters on image, doing character recognition using LDA algorithm, rearranging the result of recognition in accordance with the original content in papyrus, and translating that characters result into Latin.

...read moreread less

Abstract: Balinese people have one of the civilization history and cultural heritage are handwritten in Balinese script on palm leaves known as Balinese Papyrus (LontarAksara Bali). Until now that cultural heritage is still continuously strived its preservation along with the implementation begin to be abandoned in public life. Some of Balinese Papyrus now begin to rot and fade under influenced by age. Information technology utilization can be a tool to solve the problems faced in the preservation of the Balinese papyrus. By using digital image processing techniques, the papyrus script can be reconstructed digitally so that it can be retrieved and store the content in the digital media. Balinese papyrus reconstructed through several processes from scanning into a digital image, performing preprocessing for image quality improvement, segmenting the Balinese characters on image, doing character recognition using LDA algorithm, rearranging the result of recognition in accordance with the original content in papyrus, and translating that characters result into Latin. LDA algorithm quite successfully performs the classification associated with handwritten character recognition.

...read moreread less

4 citations

Cites methods from "Searching OCR'ed Text: An LDA Based..."

...The general LDA approach is very similar to a Principal Component Analysis (PCA) [13], [15]....
[...]
...LDA is one of the methods used in statistics, pattern recognition [11] in general to find a linear combination of features that characterize or separating two or more classes of objects or events [12-15]....
[...]
...Hassan E [13] had also utilizing the SDA to make OCR (Optical Character Recognition)....
[...]
...ISSN: 2502-4752 IJEECS Vol. 4, No. 2, November 2016 : 479 – 485 482 [ ] (1) Where i = 1,2,3 of the class Now, compute the two 4x4-dimensional matrices: between class scatter matrix SB and within class scatter matrix Sw. if in the PCA is computed the average a whole images only, then in the LDA we should compute the average image contained in one class....
[...]

Proceedings Article•DOI•

Document indexing framework for retrieval of degraded document images

[...]

Ritu Garg¹, Ehtesham Hassan², Santanu Chaudhury¹•Institutions (2)

Indian Institutes of Technology¹, Harvard University²

23 Aug 2015

TL;DR: This paper presents a indexing methodology that uses multiple kernel learning to combine features from different modalities by joint optimization of search time and accuracy and is demonstrated on document images of Bangla and Devanagari script.

...read moreread less

Abstract: With the availability of large collection of document images in Indian languages, image based retrieval has gained popularity. The performance of such systems is effected by the presence of degraded and noisy images. Moreover, Optical character recognition systems for Indian scripts are not yet robust, leading to noisy OCR'ed text. Information retrieval system designed using inputs from both modalities (image features and OCR based recognition data) will lead to better retrieval performance in contrast to usage of individual modality. In this paper we present a indexing methodology that uses multiple kernel learning to combine features from different modalities by joint optimization of search time and accuracy. The evaluation of the proposed methodology is demonstrated on document images of Bangla and Devanagari script.

...read moreread less

3 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Latent dirichlet allocation

[...]

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

30,570 citations

"Searching OCR'ed Text: An LDA Based..." refers background in this paper

...The topic model based indexing groups different terms occurring in the text document based on their semantic relationship [1][2][3]....
[...]
...Latent Dirichlet Allocation (LDA) defines a generative probabilistic model over the document collection [3]....
[...]

Proceedings Article•

Latent Dirichlet Allocation

[...]

David M. Blei¹, Andrew Y. Ng¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

03 Jan 2001

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

...read moreread less

25,546 citations

Journal Article•DOI•

Indexing by Latent Semantic Analysis

[...]

Scott Deerwester¹, Susan T. Dumais², George W. Furnas², Thomas K. Landauer², Richard A. Harshman³ - Show less +1 more•Institutions (3)

University of Chicago¹, Telcordia Technologies², University of Western Ontario³

01 Sep 1990-Journal of the Association for Information Science and Technology

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

...read moreread less

12,443 citations

Journal Article•DOI•

Probabilistic latent semantic indexing

[...]

Thomas Hofmann¹•Institutions (1)

International Computer Science Institute¹

01 Aug 1999

TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.

...read moreread less

Abstract: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

...read moreread less

4,577 citations

Proceedings Article•DOI•

Multi-Document Summarization using Sentence-based Topic Models

[...]

Dingding Wang¹, Shenghuo Zhu, Tao Li¹, Yihong Gong•Institutions (1)

Florida International University¹

04 Aug 2009

TL;DR: A new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations is proposed and an efficient variational Bayesian algorithm is derived for model parameter estimation.

...read moreread less

Abstract: Most of the existing multi-document summarization methods decompose the documents into sentences and work directly in the sentence space using a term-sentence matrix. However, the knowledge on the document side, i.e. the topics embedded in the documents, can help the context understanding and guide the sentence selection in the summarization procedure. In this paper, we propose a new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations. An efficient variational Bayesian algorithm is derived for model parameter estimation. Experimental results on benchmark data sets show the effectiveness of the proposed model for the multi-document summarization task.

...read moreread less

175 citations