scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Searching OCR'ed Text: An LDA Based Approach

18 Sep 2011-pp 1210-1214
TL;DR: A novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy and is based on topic modeling using Latent Dirichlet Allocation (LDA).
Abstract: Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR's confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script.
Citations
More filters
Proceedings ArticleDOI
16 Dec 2012
TL;DR: A search scheme which fuses noisy OCR output and holistic visual features for content level access to the DLI pages and shows that the fusion scheme improves over the individual methods in terms of mean Average Precision (mAP) and mean precision at 10 (mPrec@10).
Abstract: In this paper, we propose a framework for content level access to the scanned pages of Digital Library of India (DLI). The current Optical Character Recognition (OCR) systems are not robust and reliable enough for generating accurate text from DLI pages. We propose a search scheme which fuses noisy OCR output and holistic visual features for content level access to the DLI pages. Visual content is captured using Bag of Visual Words (BoVW) approach. We show that our fusion scheme improves over the individual methods in terms of mean Average Precision (mAP) and mean precision at 10 (mPrec@10). We exploit the fact that OCR has a high precision while BoVW has a high recall. We use a modified edit distance to improve the order of results ranked by BoVW. Experiments are carried out on large datasets of DLI pages in Hindi and Telugu languages. We validate our method on more than 10,000 pages and 4 Million words, and report a mAP of around 0.8 and mPrec@10 of more than 0.9. We show improvements over BoVW by introducing query expansion. We also demonstrate a textual query interface for the search system.

10 citations


Cites methods from "Searching OCR'ed Text: An LDA Based..."

  • ...Second Column Shows the Retrieved Results in decreasing order of Rank. across the dataset....

    [...]

  • ...To improve recall, we have also indexed word images and used query expansion as explained in Section 2.3 to formulate query histogram based on initial results of text query index....

    [...]

  • ...Second category does it mostly in the image domain by matching them in some appropriate feature space....

    [...]

  • ...Learn to improve from annotated dataset In Section 2.5, we explained the use of confusion matrix to deal with the errors given by OCR....

    [...]

  • ...Since OCRs are not an immediate feasibility (see Sec­tion 3 for quantitative performance of OCR on DLI pages) for building search engines, we naturally move towards the image based search and retrieval techniques....

    [...]

Proceedings ArticleDOI
25 Aug 2013
TL;DR: A novel multi-modal document indexing framework for retrieval of old and degraded text documents by combining OCR'ed text and image based representation using learning is proposed.
Abstract: The paper proposes a novel multi-modal document image retrieval framework by exploiting the information of text and graphics regions. The framework applies multiple kernel learning based hashing formulation for generation of composite document indexes using different modalities. The existing multimedia management methods for imaged text documents have not addressed the requirement of old and degraded documents. In the subsequent contribution, we propose novel multi-modal document indexing framework for retrieval of old and degraded text documents by combining OCR'ed text and image based representation using learning. The evaluation of proposed concepts is demonstrated on sampled magazine cover pages, and documents of Devanagari script.

5 citations


Cites background from "Searching OCR'ed Text: An LDA Based..."

  • ...In [22], topic model based indexing and retrieval framework is proposed for text documents....

    [...]

Journal ArticleDOI
TL;DR: Balinese papyrus reconstructed through several processes from scanning into a digital image, performing preprocessing for image quality improvement, segmenting the Balinese characters on image, doing character recognition using LDA algorithm, rearranging the result of recognition in accordance with the original content in papyrus, and translating that characters result into Latin.
Abstract: Balinese people have one of the civilization history and cultural heritage are handwritten in Balinese script on palm leaves known as Balinese Papyrus (LontarAksara Bali). Until now that cultural heritage is still continuously strived its preservation along with the implementation begin to be abandoned in public life. Some of Balinese Papyrus now begin to rot and fade under influenced by age. Information technology utilization can be a tool to solve the problems faced in the preservation of the Balinese papyrus. By using digital image processing techniques, the papyrus script can be reconstructed digitally so that it can be retrieved and store the content in the digital media. Balinese papyrus reconstructed through several processes from scanning into a digital image, performing preprocessing for image quality improvement, segmenting the Balinese characters on image, doing character recognition using LDA algorithm, rearranging the result of recognition in accordance with the original content in papyrus, and translating that characters result into Latin. LDA algorithm quite successfully performs the classification associated with handwritten character recognition.

4 citations


Cites methods from "Searching OCR'ed Text: An LDA Based..."

  • ...The general LDA approach is very similar to a Principal Component Analysis (PCA) [13], [15]....

    [...]

  • ...LDA is one of the methods used in statistics, pattern recognition [11] in general to find a linear combination of features that characterize or separating two or more classes of objects or events [12-15]....

    [...]

  • ...Hassan E [13] had also utilizing the SDA to make OCR (Optical Character Recognition)....

    [...]

  • ...ISSN: 2502-4752 IJEECS Vol. 4, No. 2, November 2016 : 479 – 485 482 [ ] (1) Where i = 1,2,3 of the class Now, compute the two 4x4-dimensional matrices: between class scatter matrix SB and within class scatter matrix Sw. if in the PCA is computed the average a whole images only, then in the LDA we should compute the average image contained in one class....

    [...]

Proceedings ArticleDOI
23 Aug 2015
TL;DR: This paper presents a indexing methodology that uses multiple kernel learning to combine features from different modalities by joint optimization of search time and accuracy and is demonstrated on document images of Bangla and Devanagari script.
Abstract: With the availability of large collection of document images in Indian languages, image based retrieval has gained popularity. The performance of such systems is effected by the presence of degraded and noisy images. Moreover, Optical character recognition systems for Indian scripts are not yet robust, leading to noisy OCR'ed text. Information retrieval system designed using inputs from both modalities (image features and OCR based recognition data) will lead to better retrieval performance in contrast to usage of individual modality. In this paper we present a indexing methodology that uses multiple kernel learning to combine features from different modalities by joint optimization of search time and accuracy. The evaluation of the proposed methodology is demonstrated on document images of Bangla and Devanagari script.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


"Searching OCR'ed Text: An LDA Based..." refers background in this paper

  • ...The topic model based indexing groups different terms occurring in the text document based on their semantic relationship [1][2][3]....

    [...]

  • ...Latent Dirichlet Allocation (LDA) defines a generative probabilistic model over the document collection [3]....

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations

Journal ArticleDOI
01 Aug 1999
TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.
Abstract: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

4,577 citations

Proceedings ArticleDOI
04 Aug 2009
TL;DR: A new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations is proposed and an efficient variational Bayesian algorithm is derived for model parameter estimation.
Abstract: Most of the existing multi-document summarization methods decompose the documents into sentences and work directly in the sentence space using a term-sentence matrix. However, the knowledge on the document side, i.e. the topics embedded in the documents, can help the context understanding and guide the sentence selection in the summarization procedure. In this paper, we propose a new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations. An efficient variational Bayesian algorithm is derived for model parameter estimation. Experimental results on benchmark data sets show the effectiveness of the proposed model for the multi-document summarization task.

175 citations


"Searching OCR'ed Text: An LDA Based..." refers background in this paper

  • ...Topic models have been extensively applied for document summarization, and indexing applications [5][6][7][8][9]....

    [...]