Showing papers in "International Journal on Document Analysis and Recognition in 2011"

PDF

Open Access

Journal Article•DOI•

ICDAR2009 handwriting segmentation contest

[...]

Basilis Gatos, Nikolaos Stamatopoulos, Georgios Louloudis

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: The contest details including the dataset, the ground truth and the evaluation criteria are described and the results of the 12 participating methods as well as of two state-of-the-art algorithms are presented.

...read moreread less

Abstract: ICDAR 2009 Handwriting Segmentation Contest was organized in the context of ICDAR2009 conference in order to record recent advances in off-line handwriting segmentation. The contest includes handwritten document images produced by many writers in several languages (English, French, German and Greek). These images are manually annotated in order to produce the ground truth which corresponds to the correct text line and word segmentation result. For the evaluation, a well-established approach is used based on counting the number of matches between the entities detected by the segmentation algorithm and the entities in the ground truth. This paper describes the contest details including the dataset, the ground truth and the evaluation criteria and presents the results of the 12 participating methods as well as of two state-of-the-art algorithms. A description of the winning algorithms is also given.

...read moreread less

139 citations

Journal Article•DOI•

ICDAR 2009-Arabic handwriting recognition competition

[...]

Haikal El Abed¹, Volker Märgner¹•Institutions (1)

Braunschweig University of Technology¹

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: A continuous improvement of the recognition rate from competition to competition of more than 5% can be observed and this is a very important result of this competition.

...read moreread less

Abstract: This paper describes the Arabic handwriting recognition competition held at ICDAR 2009. This third competition (the first two were held at ICDAR 2005 and 2007, respectively) again used the IfN/ENIT-database with Arabic handwritten Tunisian town names. This very successful database is used today by more than 82 research groups from universities, research centers, and industries worldwide. At ICDAR 2009, 7 groups with 17 systems participated in the competition. The system evaluation was made on one known dataset and on two datasets unknown to the participants. The systems were compared based on the recognition rates achieved. Additionally, the relative speeds of the systems were compared. A description of the participating groups, their systems, and the results achieved are presented. As a very important result of this competition, a continuous improvement of the recognition rate from competition to competition of more than 5% can be observed.

...read moreread less

108 citations

Journal Article•DOI•

DIBCO 2009: document image binarization contest

[...]

Basilis Gatos, Konstantinos Ntirogiannis, Ioannis Pratikakis

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: The contest details including the evaluation measures used as well as the performance of the 43 submitted methods along with a short description of the top five algorithms are described.

...read moreread less

Abstract: DIBCO 2009 is the first International Document Image Binarization Contest organized in the context of ICDAR 2009 conference. The general objective of the contest is to identify current advances in document image binarization using established evaluation performance measures. This paper describes the contest details including the evaluation measures used as well as the performance of the 43 submitted methods along with a short description of the top five algorithms.

...read moreread less

81 citations

Journal Article•DOI•

SCUT-COUCH2009—a comprehensive online unconstrained Chinese handwriting database and benchmark evaluation

[...]

Lianwen Jin¹, Yan Gao¹, Gang Liu¹, Yunyang Li¹, Kai Ding¹ - Show less +1 more•Institutions (1)

South China University of Technology¹

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: The SCUT-COUCH2009 database is the first publicly available large vocabulary online Chinese handwriting database containing multi-type character/word samples and some evaluation results on the database are reported using state-of-the-art recognizers for benchmarking.

...read moreread less

Abstract: A comprehensive online unconstrained Chinese handwriting dataset, SCUT-COUCH2009, is introduced in this paper. As a revision of SCUT-COUCH2008 [1], the SCUT-COUCH2009 database consists of more datasets with larger vocabularies and more writers. The database is built to facilitate the research of unconstrained online Chinese handwriting recognition. It is comprehensive in the sense that it consists of 11 datasets of different vocabularies, named GB1, GB2, TradGB1, Big5, Pinyin, Letters, Digit, Symbol, Word8888, Word17366 and Word44208. In particular, the SCUT-COUCH2009 database contains handwritten samples of 6,763 single Chinese characters in the GB2312-80 standard, 5,401 traditional Chinese characters of the Big5 standard, 1,384 traditional Chinese characters corresponding to the level 1 characters of the GB2312-80 standard, 8,888 frequently used Chinese words, 17,366 daily-used Chinese words, 44,208 complete words from the Fourth Edition of “The Contemporary Chinese Dictionary”, 2,010 Pinyin and 184 daily-used symbols. The samples were collected using PDAs (Personal Digit Assistant) and smart phones with touch screens and were contributed by more than 190 persons. The total number of character samples is over 3.6 million. The SCUT-COUCH2009 database is the first publicly available large vocabulary online Chinese handwriting database containing multi-type character/word samples. We report some evaluation results on the database using state-of-the-art recognizers for benchmarking.

...read moreread less

72 citations

Journal Article•DOI•

On-line Arabic handwriting recognition competition: ADAB database and participating systems

[...]

Haikal El Abed¹, Monji Kherallah², Volker Märgner¹, Adel M. Alimi²•Institutions (2)

Braunschweig University of Technology¹, University of Sfax²

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: This paper describes the on-line Arabic handwriting recognition competition held at tenth International Conference on Document Analysis and Recognition (ICDAR in Proceedings of the 10th international conference on document analysis and recognition, vol 3, pp 1388–1392, 2009).

...read moreread less

Abstract: This paper describes the on-line Arabic handwriting recognition competition held at tenth International Conference on Document Analysis and Recognition (ICDAR in Proceedings of the 10th international conference on document analysis and recognition, vol 3, pp 1388–1392, 2009). This first competition uses the so-called ADAB database with Arabic on-line handwritten words. At this first competition, 3 groups with 7 different systems have participated. The systems were tested on known data (training datasets made available for the participants, sets 1 to 3) and on one test dataset that is unknown to all participants (set 4). The systems are compared on the most important characteristic of classification systems, the recognition rate. Additionally, the relative speed of the different systems was compared. A short description of the participating groups, their systems, the experimental setup, and the performed results is presented.

...read moreread less

55 citations

Journal Article•DOI•

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

[...]

Martin Reynaert¹•Institutions (1)

Tilburg University¹

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: The character confusion-based prototype of Text-Induced Corpus Clean-up is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents, showing that the system is not sensitive to domain variation.

...read moreread less

Abstract: We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.

...read moreread less

52 citations

Journal Article•DOI•

A probabilistic approach to printed document understanding

[...]

Eric Medvet¹, Alberto Bartoli¹, Giorgio Davanzo¹•Institutions (1)

University of Trieste¹

01 Dec 2011-International Journal on Document Analysis and Recognition

TL;DR: The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time and derived a general form for the probability that a sequence of blocks contains the searched information.

...read moreread less

Abstract: We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results—e.g., a success rate often greater than 90% even for classes with just two samples.

...read moreread less

49 citations

Journal Article•DOI•

Towards information retrieval on historical document collections: the role of matching procedures and special lexica

[...]

Annette Gotscharek¹, Ulrich Reffle¹, Christoph Ringlstetter¹, Klaus U. Schulz¹, Andreas Neumann - Show less +1 more•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: This work shows how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches, and asks if matching procedures alone suffice to lift IR on historical texts to a satisfactory level.

...read moreread less

Abstract: Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two alternative ways to solve this problem. In the first part of the paper, we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. In the second part of the paper, we ask if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries, it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Results indicate that for earlier periods, matching procedures alone do not lead to satisfactory results. We then describe experiments where the gain for recall obtained from historical lexica of distinct sizes is estimated.

...read moreread less

45 citations

Journal Article•DOI•

Grammar-based techniques for creating ground-truthed sketch corpora

[...]

Scott MacLean¹, George Labahn¹, Edward Lank¹, Mirette Marzouk¹, David Tausky¹ - Show less +1 more•Institutions (1)

University of Waterloo¹

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: A general approach to creating large, ground-truthed corpora for structured sketch domains such as mathematics, where random sketch templates are generated automatically using a grammar model of the sketch domain, and annotated with ground-truth.

...read moreread less

Abstract: Although publicly available, ground-truthed corpora have proven useful for training, evaluating, and comparing recognition systems in many domains, the availability of such corpora for sketch recognizers, and math recognizers in particular, is currently quite poor. This paper presents a general approach to creating large, ground-truthed corpora for structured sketch domains such as mathematics. In the approach, random sketch templates are generated automatically using a grammar model of the sketch domain. These templates are transcribed manually, then automatically annotated with ground-truth. The annotation procedure uses the generated sketch templates to find a matching between transcribed and generated symbols. A large, ground-truthed corpus of handwritten mathematical expressions presented in the paper illustrates the utility of the approach.

...read moreread less

43 citations

Journal Article•DOI•

Linear-quadratic blind source separating structure for removing show-through in scanned documents

[...]

Farnood Merrikh-Bayat¹, Massoud Babaie-Zadeh¹, Christian Jutten²•Institutions (2)

Sharif University of Technology¹, Institut Universitaire de France²

01 Dec 2011-International Journal on Document Analysis and Recognition

TL;DR: In this paper, a novel and general nonlinear model for canceling the show-through phenomenon is proposed based on a new recursive and extendible structure and a refined separating architecture is introduced for simultaneously removing theshow-through and blurring effects.

...read moreread less

Abstract: Digital documents are usually degraded during the scanning process due to the contents of the backside of the scanned manuscript. This is often caused by the show-through effect, i.e. the backside image that interferes with the main front side picture due to the intrinsic transparency of the paper. This phenomenon is one of the degradations that one would like to remove especially in the field of Optical Character Recognition (OCR) or document digitalization which require denoised texts as inputs. In this paper, we first propose a novel and general nonlinear model for canceling the show-through phenomenon. A nonlinear blind source separation algorithm is used for this purpose based on a new recursive and extendible structure. However, the results are restricted due to a blurring effect that appears during the scanning process due to the light transfer function of the paper. Consequently, for improving the results, we introduce a refined separating architecture for simultaneously removing the show-through and blurring effects.

...read moreread less

39 citations

Journal Article•DOI•

A word spotting framework for historical machine-printed documents

[...]

Anastasios L. Kesidis, Eleni Galiotou, B. Gatos, Ioannis Pratikakis

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: A word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine is proposed and has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century.

...read moreread less

Abstract: In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine. A preprocessing step is performed in order to improve the quality of the document images, while word segmentation is accomplished with the use of two complementary segmentation methodologies. In the proposed methodology, synthetic word images are created from keywords, and these images are compared to all the words in the digitized documents. A user feedback process is used in order to refine the search procedure. The methodology has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century. In order to improve the efficiency of accessing and search, natural language processing techniques have been addressed that comprise a morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates access to the semantic context of documents.

...read moreread less

Journal Article•DOI•

Confidence- and margin-based MMI/MPE discriminative training for off-line handwriting recognition

[...]

Philippe Dreuw¹, Georg Heigold¹, Hermann Ney¹•Institutions (1)

RWTH Aachen University¹

01 Sep 2011-International Journal on Document Analysis and Recognition

TL;DR: This work presents a novel confidence- and margin-based discriminative training approach for model adaptation of a hidden Markov model (HMM)-based handwriting recognition system to handle different handwriting styles and their variations.

...read moreread less

Abstract: We present a novel confidence- and margin-based discriminative training approach for model adaptation of a hidden Markov model (HMM)-based handwriting recognition system to handle different handwriting styles and their variations. Most current approaches are maximum-likelihood (ML) trained HMM systems and try to adapt their models to different writing styles using writer adaptive training, unsupervised clustering, or additional writer-specific data. Here, discriminative training based on the maximum mutual information (MMI) and minimum phone error (MPE) criteria are used to train writer-independent handwriting models. For model adaptation during decoding, an unsupervised confidence-based discriminative training on a word and frame level within a two-pass decoding process is proposed. The proposed methods are evaluated for closed-vocabulary isolated handwritten word recognition on the IFN/ENIT Arabic handwriting database, where the word error rate is decreased by 33% relative compared to a ML trained baseline system. On the large-vocabulary line recognition task of the IAM English handwriting database, the word error rate is decreased by 25% relative.

...read moreread less

Journal Article•DOI•

Towards historical document indexing: extraction of drop cap letters

[...]

Mickael Coustaty, Rudolf Pareti, Nicole Vincent, Jean-Marc Ogier

01 Sep 2011-International Journal on Document Analysis and Recognition

TL;DR: This paper deals with the difficult problem of indexing drop caps and specifically, considers the problem of letter extraction from this complex graphic images and proposes an original strategy based on an analysis of the features of the images to be indexed.

...read moreread less

Abstract: This paper deals with the difficult problem of indexing ancient graphic images. It tackles the particular case of indexing drop caps (also called Lettrines) and specifically, considers the problem of letter extraction from this complex graphic images. Based on an analysis of the features of the images to be indexed, an original strategy is proposed. This approach relies on filtering the relevant information, on the basis of Meyer decomposition. Then, in order to accommodate the variability of representation of the information, a Zipf’s law modeling enables detection of the regions belonging to the letter, what allows it to be segmented. The overall process is evaluated using a relevant set of images, which shows the relevance of the approach.

...read moreread less

Journal Article•DOI•

Setting up a competition framework for the evaluation of structure extraction from OCR-ed books

[...]

Antoine Doucet¹, Gabriella Kazai², Bodin Dresevic², Aleksandar Uzelac², Bogdan Radakovic², Nikola Todic² - Show less +2 more•Institutions (2)

University of Caen Lower Normandy¹, Microsoft²

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: The setup of the Book Structure Extraction competition run at ICDAR 2009 is described, the book collection used in the task, the collaborative construction of the ground truth, the evaluation measures, and the evaluation results are discussed.

...read moreread less

Abstract: This paper describes the setup of the Book Structure Extraction competition run at ICDAR 2009. The goal of the competition was to evaluate and compare automatic techniques for deriving structure information from digitized books, which could then be used to aid navigation inside the books. More specifically, the task that participants faced was to construct hyperlinked tables of contents for a collection of 1,000 digitized books. This paper describes the setup of the competition and its challenges. It introduces and discusses the book collection used in the task, the collaborative construction of the ground truth, the evaluation measures, and the evaluation results. The paper also introduces a data set to be used freely for research evaluation purposes.

...read moreread less

Journal Article•DOI•

EMERS: a tree matching–based performance evaluation of mathematical expression recognition systems

[...]

Kunal Sain¹, Abhishek Dasgupta², Utpal Garain³•Institutions (3)

Government College of Engineering & Textile Technology, Berhampore¹, Indian Institute of Science Education and Research, Kolkata², Indian Statistical Institute³

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: Performance evaluation of mathematical expression recognition systems is attempted and the changes required to convert the tree corresponding to the expression generated by the recognizer into the groundtruthed one are noted.

...read moreread less

Abstract: Performance evaluation of mathematical expression recognition systems is attempted. The proposed method assumes expressions (input as well as recognition output) are coded following MathML or TEX/LaTEX (which also gets converted into MathML) format. Since any MathML representation follows a tree structure, evaluation of performance has been modeled as a tree-matching problem. The tree corresponding to the expression generated by the recognizer is compared with the groundtruthed one by comparing the corresponding Euler strings. The changes required to convert the tree corresponding to the expression generated by the recognizer into the groundtruthed one are noted. The number of changes required to make such a conversion is basically the distance between the trees. This distance gives the performance measure for the system under testing. The proposed algorithm also pinpoints the positions of the changes in the output MathML file. Testing of the proposed evaluation method considers a set of example groundtruthed expressions and their corresponding recognized results produced by an expression recognition system.

...read moreread less

Journal Article•DOI•

Metrics for evaluating performance in document analysis: application to tables

[...]

Ana Costa e Silva¹•Institutions (1)

University of Edinburgh¹

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: A new pair of evaluation metrics are proposed that better suit document analysis’ needs and show their application to several table tasks and a road-map for creating Hidden Markov Models for the task is drawn.

...read moreread less

Abstract: Is an algorithm with high precision and recall at identifying table-parts also good at locating tables? Several document analysis tasks require merging or splitting certain document elements to form others. The suitability of the commonly used precision and recall for such division/aggregation tasks is arguable, since their underlying assumption is that the granularity of the items at input is the same as at output. We propose a new pair of evaluation metrics that better suit document analysis’ needs and show their application to several table tasks. In the process, we present a number of robust table location algorithms with which we draw a road-map for creating Hidden Markov Models for the task.

...read moreread less

Journal Article•DOI•

Text retrieval from early printed books

[...]

Simone Marinai¹•Institutions (1)

University of Florence¹

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: A word indexing and retrieval technique that does not require word segmentation and is tolerant to errors in character segmentation is proposed that is tested on four copies of the Gutenberg Bibles.

...read moreread less

Abstract: Retrieving text from early printed books is particularly difficult because in these documents, the words are very close one to the other and, similarly to medieval manuscripts, there is a large use of ligatures and abbreviations To address these problems, we propose a word indexing and retrieval technique that does not require word segmentation and is tolerant to errors in character segmentation Two main principles characterize the approach First, characters are identified in the pages and clustered with self-organizing map (SOM) During the retrieval, the similarity of characters is estimated considering the proximity of cluster centroids in the SOM space, rather than directly comparing the character images Second, query words are matched with the indexed sequence of characters by means of a dynamic time warping (DTW)-based approach The proposed technique integrates the SOM similarity and the information about the width of characters in the string matching process The best path in the DTW array is identified considering the widths of matching words with respect to the query so as to deal with broken or touching symbols The proposed method is tested on four copies of the Gutenberg Bibles

...read moreread less

Journal Article•DOI•

Unconstrained handwritten document retrieval

[...]

Huaigu Cao¹, Venu Govindaraju², Anurag Bhardwaj²•Institutions (2)

BBN Technologies¹, University at Buffalo²

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: A novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance and also discusses their performance measures using standard IR evaluation metrics.

...read moreread less

Abstract: With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.

...read moreread less

Journal Article•DOI•

Building compact recognizers of handwritten Chinese characters using precision constrained Gaussian model, minimum classification error training and parameter compression

[...]

Yongqiang Wang¹, Qiang Huo²•Institutions (2)

University of Hong Kong¹, Microsoft²

01 Sep 2011-International Journal on Document Analysis and Recognition

TL;DR: Experimental results on a handwritten character recognition task demonstrate that MCE-trained and compressed PCGM-based classifiers can achieve much higher recognition accuracies than their counterparts based on traditional modified quadratic discriminant function (MQDF) when the footprint of the classifiers has to be made very small, e.g., less than 2 MB.

...read moreread less

Abstract: In our previous work, a so-called precision constrained Gaussian model (PCGM) was proposed for character modeling to design compact recognizers of handwritten Chinese characters. A maximum likelihood training procedure was developed to estimate model parameters from training data. In this paper, we extend the above-mentioned work by using minimum classification error (MCE) training to improve recognition accuracy and using both split vector quantization and scalar quantization techniques to further compress model parameters. Experimental results on a handwritten character recognition task with a vocabulary of 2,965 Kanji characters demonstrate that MCE-trained and compressed PCGM-based classifiers can achieve much higher recognition accuracies than their counterparts based on traditional modified quadratic discriminant function (MQDF) when the footprint of the classifiers has to be made very small, e.g., less than 2 MB.

...read moreread less

Journal Article•DOI•

Rejection measurement based on linear discriminant analysis for document recognition

[...]

Chun Lei He¹, Louisa Lam¹, Ching Y. Suen¹•Institutions (1)

Concordia University Wisconsin¹

01 Sep 2011-International Journal on Document Analysis and Recognition

TL;DR: A new approach based on Linear Discriminant Analysis to reject less reliable classifier outputs and it represents a more comprehensive measurement than traditional rejection measurements such as First Rank Measurement and First Two Ranks Measurement.

...read moreread less

Abstract: In document recognition, it is often important to obtain high accuracy or reliability and to reject patterns that cannot be classified with high confidence. This is the case for applications such as the processing of financial documents in which errors can be very costly and therefore far less tolerable than rejections. This paper presents a new approach based on Linear Discriminant Analysis (LDA) to reject less reliable classifier outputs. To implement the rejection, which can be considered a two-class problem of accepting the classification result or otherwise, an LDA-based measurement is used to determine a new rejection threshold. This measurement (LDAM) is designed to take into consideration the confidence values of the classifier outputs and the relations between them, and it represents a more comprehensive measurement than traditional rejection measurements such as First Rank Measurement and First Two Ranks Measurement. Experiments are conducted on the CENPARMI database of numerals, the CENPARMI Arabic Isolated Numerals Database, and the numerals in the NIST Special Database 19. The results show that LDAM is more effective, and it can achieve a higher reliability while maintaining a high recognition rate on these databases of very different origins and sizes.

...read moreread less

Journal Article•DOI•

Harvesting maps on the web

[...]

Aman Goel¹, Matthew Michelson, Craig A. Knoblock¹•Institutions (1)

University of Southern California¹

01 Dec 2011-International Journal on Document Analysis and Recognition

TL;DR: An automatic method to find high-quality maps for a given geographic region using a Content-Based Image Retrieval approach that uses a new set of features for classification in order to capture the defining characteristics of a map.

...read moreread less

Abstract: Maps are one of the most valuable documents for gathering geospatial information about a region. Yet, finding a collection of diverse, high-quality maps is a significant challenge because there is a dearth of content-specific metadata available to identify them from among other images on the Web. For this reason, it is desirous to analyze the content of each image. The problem is further complicated by the variations between different types of maps, such as street maps and contour maps, and also by the fact that many high-quality maps are embedded within other documents such as PDF reports. In this paper, we present an automatic method to find high-quality maps for a given geographic region. Not only does our method find documents that are maps, but also those that are embedded within other documents. We have developed a Content-Based Image Retrieval (CBIR) approach that uses a new set of features for classification in order to capture the defining characteristics of a map. This approach is able to identify all types of maps irrespective of their subject, scale, and color in a highly scalable and accurate way. Our classifier achieves an F1-measure of 74%, which is an 18% improvement over the previous work in the area.

...read moreread less

Journal Article•DOI•

Domain-specific entity extraction from noisy, unstructured data using ontology-guided search

[...]

Sergey Bratus¹, Anna Rumshisky², Alexy Khrabrov¹, Rajenda Magar¹, Paul Thompson¹ - Show less +1 more•Institutions (2)

Dartmouth College¹, Brandeis University²

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: An algorithm for ontology-guided entity disambiguation that uses existing knowledge sources, such as domain-specific taxonomies and other structured data, to help develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians is presented.

...read moreread less

Abstract: Domain-specific knowledge is often recorded by experts in the form of unstructured text. For example, in the medical domain, clinical notes from electronic health records contain a wealth of information. Similar practices are found in other domains. The challenge we discuss in this paper is how to identify and extract part names from technicians repair notes, a noisy unstructured text data source from General Motors’ archives of solved vehicle repair problems, with the goal to develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians. In the present work, we discuss two approaches to this problem. We present an algorithm for ontology-guided entity disambiguation that uses existing knowledge sources, such as domain-specific taxonomies and other structured data. We illustrate its use in the automotive domain, using GM parts ontology and the unit structure of repair manuals text to build context models, which are then used to disambiguate mentions of part-related entities in the text. We also describe extraction of part names with a small amount of annotated data using hidden Markov models (HMM) with shrinkage, achieving an f-score of approximately 80%. Next, we used linear-chain conditional random fields (CRF) in order to model observation dependencies present in the repair notes. Using CRF did not lead to improved performance, but a slight improvement over the HMM results was obtained by using a weighted combination of the HMM and CRF models.

...read moreread less

Journal Article•DOI•

Learning on the fly: a font-free approach toward multilingual OCR

[...]

Andrew Kae¹, David A. Smith¹, Erik Learned-Miller¹•Institutions (1)

University of Massachusetts Amherst¹

01 Sep 2011-International Journal on Document Analysis and Recognition

TL;DR: A form of iterative contextual modeling that learns character models directly from the document it is trying to recognize in an incremental, iterative process to address difficult cases of optical character recognition.

...read moreread less

Abstract: Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem,” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but these are vulnerable to cases when the font of a particular document was not part of the training set or when there is so much noise in a document that the font model becomes weak. To address these difficult cases, we present a form of iterative contextual modeling that learns character models directly from the document it is trying to recognize. We use these learned models both to segment the characters and to recognize them in an incremental, iterative process. We present results comparable with those of a commercial OCR system on a subset of characters from a difficult test document in both English and Greek.

...read moreread less

Journal Article•DOI•

Robust named entity detection from optical character recognition output

[...]

Krishna Subramanian¹, Rohit Prasad¹, Prem Natarajan¹•Institutions (1)

BBN Technologies¹

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: This paper addresses the challenge of named entity detection in noisy OCR output and shows that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search.

...read moreread less

Abstract: In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.

...read moreread less

Journal Article•DOI•

Digital weight watching: reconstruction of scanned documents

[...]

Maarten Marx¹, Tim Gielissen¹•Institutions (1)

University of Amsterdam¹

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: Evaluation on the Dutch dataset shows that these reconstructions become two orders of magnitude smaller and still resemble the original to a high degree, and are easier to speed-read and evaluate for relevance, due to added hyperlinks and a presentation optimized for reading from a terminal.

...read moreread less

Abstract: A web portal providing access to over 250.000 scanned and OCRed cultural heritage documents is analyzed. The collection consists of the complete Dutch Hansard from 1917 to 1995. Each document consists of facsimile images of the original pages plus hidden OCRed text. The inclusion of images yields large file sizes of which less than 2% is the actual text. The search user interface of the portal provides poor ranking and not very informative document summaries (snippets). Thus, users are instrumental in weeding out non-relevant results. For that, they must assess the complete documents. This is a time-consuming and frustrating process because of long download and processing times of the large files. Instead of using the scanned images for relevance assessment, we propose to use a reconstruction of the original document from a purely semantic representation. Evaluation on the Dutch dataset shows that these reconstructions become two orders of magnitude smaller and still resemble the original to a high degree. In addition, they are easier to speed-read and evaluate for relevance, due to added hyperlinks and a presentation optimized for reading from a terminal. We describe the reconstruction process and evaluate the costs, the benefits, and the quality.

...read moreread less

Journal Article•DOI•

Classifying the typefaces of the Gutenberg 42-line bible

[...]

Aureli Alabert¹, Luz Ma. Rangel²•Institutions (2)

Autonomous University of Barcelona¹, University of Barcelona²

01 Dec 2011-International Journal on Document Analysis and Recognition

TL;DR: In this paper, the authors measured the dissimilarities among several printed characters of a single page in the Gutenberg 42-line bible, and proved statistically the existence of several different matrices from which the metal types were constructed.

...read moreread less

Abstract: We have measured the dissimilarities among several printed characters of a single page in the Gutenberg 42-line bible, and we prove statistically the existence of several different matrices from which the metal types were constructed. This is in contrast with the prevailing theory, which states that only one matrix per character was used in the printing process of Gutenberg’s greatest work. The main mathematical tool for this purpose is cluster analysis, combined with a statistical test for outliers. We carry out the research with two letters, $${\texttt{i}}$$ and $${\texttt{a}}$$. In the first case, an exact clustering method is employed; in the second, with more specimens to be classified, we resort to an approximate agglomerative clustering method. The results show that the letters form clusters according to their shape, with significant shape differences among clusters, and allow to conclude, with a very small probability of error, that indeed the metal types used to print them were cast from several different matrices.

...read moreread less

Journal Article•DOI•

Report from the AND 2009 working group on noisy text datasets

[...]

Simone Marinai¹, Dimosthenis Karatzas²•Institutions (2)

University of Florence¹, Autonomous University of Barcelona²

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: This document is a report of the discussions by the participants to the working group on “Noisy Text Datasets” organized during the Third Workshop on Analytics for Noisy Unstructured Text Data (AND 2009) held in Barcelona (Spain) July 23, 24, 2009.

...read moreread less

Abstract: This document is a report of the discussions by the participants to the working group on “Noisy Text Datasets” organized during the Third Workshop on Analytics for Noisy Unstructured Text Data (AND 2009) held in Barcelona (Spain) July 23, 24, 2009

...read moreread less

Journal Article•DOI•

A protocol to characterize the descriptive power and the complementarity of shape descriptors

[...]

Muriel Visani¹, Oriol Ramos Terrades², Salvatore Tabbone•Institutions (2)

University of La Rochelle¹, Polytechnic University of Valencia²

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: An innovative protocol to characterizing upstream the complementarity of shape descriptors is proposed, the originality of which is to be as independent of the final application as possible and which relies on new quantitative and qualitative measures.

...read moreread less

Abstract: Most document analysis applications rely on the extraction of shape descriptors, which may be grouped into different categories, each category having its own advantages and drawbacks (O.R. Terrades et al. in Proceedings of ICDAR’07, pp. 227–231, 2007). In order to improve the richness of their description, many authors choose to combine multiple descriptors. Yet, most of the authors who propose a new descriptor content themselves with comparing its performance to the performance of a set of single state-of-the-art descriptors in a specific applicative context (e.g. symbol recognition, symbol spotting...). This results in a proliferation of the shape descriptors proposed in the literature. In this article, we propose an innovative protocol, the originality of which is to be as independent of the final application as possible and which relies on new quantitative and qualitative measures. We introduce two types of measures: while the measures of the first type are intended to characterize the descriptive power (in terms of uniqueness, distinctiveness and robustness towards noise) of a descriptor, the second type of measures characterizes the complementarity between multiple descriptors. Characterizing upstream the complementarity of shape descriptors is an alternative to the usual approach where the descriptors to be combined are selected by trial and error, considering the performance characteristics of the overall system. To illustrate the contribution of this protocol, we performed experimental studies using a set of descriptors and a set of symbols which are widely used by the community namely ART and SC descriptors and the GREC 2003 database.

...read moreread less

Journal Article•DOI•

Supervised semantic relation mining from linguistically noisy text documents

[...]

Cristina Giannone, Roberto Basili, Paolo Naggar, Alessandro Moschitti¹•Institutions (1)

University of Trento¹

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: The hybrid model for mining text relations between named entities is presented, which can deal with data highly affected by linguistic noise and is robust to non-conventional languages as dialects, jargon expressions or coded words typically contained in such text.

...read moreread less

Abstract: In this paper, we present models for mining text relations between named entities, which can deal with data highly affected by linguistic noise. Our models are made robust by: (a) the exploitation of state-of-the-art statistical algorithms such as support vector machines (SVMs) along with effective and versatile pattern mining methods, e.g. word sequence kernels; (b) the design of specific features capable of capturing long distance relationships; and (c) the use of domain prior knowledge in the form of ontological constraints, e.g. bounds on the type of relation arguments given by the semantic categories of the involved entities. This property allows for keeping small the training data required by SVMs and consequently lowering the system design costs. We empirically tested our hybrid model in the very complex domain of business intelligence, where the textual data are constituted by reports on investigations into criminal enterprises based on police interrogatory reports, electronic eavesdropping and wiretaps. The target relations are typically established between entities, as they are mentioned in these information sources. The experiments on mining such relations show that our approach with small training data is robust to non-conventional languages as dialects, jargon expressions or coded words typically contained in such text.

...read moreread less