scispace - formally typeset
Search or ask a question

Showing papers in "International Journal on Document Analysis and Recognition in 2011"


Journal ArticleDOI
TL;DR: The contest details including the dataset, the ground truth and the evaluation criteria are described and the results of the 12 participating methods as well as of two state-of-the-art algorithms are presented.
Abstract: ICDAR 2009 Handwriting Segmentation Contest was organized in the context of ICDAR2009 conference in order to record recent advances in off-line handwriting segmentation. The contest includes handwritten document images produced by many writers in several languages (English, French, German and Greek). These images are manually annotated in order to produce the ground truth which corresponds to the correct text line and word segmentation result. For the evaluation, a well-established approach is used based on counting the number of matches between the entities detected by the segmentation algorithm and the entities in the ground truth. This paper describes the contest details including the dataset, the ground truth and the evaluation criteria and presents the results of the 12 participating methods as well as of two state-of-the-art algorithms. A description of the winning algorithms is also given.

139 citations


Journal ArticleDOI
TL;DR: A continuous improvement of the recognition rate from competition to competition of more than 5% can be observed and this is a very important result of this competition.
Abstract: This paper describes the Arabic handwriting recognition competition held at ICDAR 2009. This third competition (the first two were held at ICDAR 2005 and 2007, respectively) again used the IfN/ENIT-database with Arabic handwritten Tunisian town names. This very successful database is used today by more than 82 research groups from universities, research centers, and industries worldwide. At ICDAR 2009, 7 groups with 17 systems participated in the competition. The system evaluation was made on one known dataset and on two datasets unknown to the participants. The systems were compared based on the recognition rates achieved. Additionally, the relative speeds of the systems were compared. A description of the participating groups, their systems, and the results achieved are presented. As a very important result of this competition, a continuous improvement of the recognition rate from competition to competition of more than 5% can be observed.

108 citations


Journal ArticleDOI
TL;DR: The contest details including the evaluation measures used as well as the performance of the 43 submitted methods along with a short description of the top five algorithms are described.
Abstract: DIBCO 2009 is the first International Document Image Binarization Contest organized in the context of ICDAR 2009 conference. The general objective of the contest is to identify current advances in document image binarization using established evaluation performance measures. This paper describes the contest details including the evaluation measures used as well as the performance of the 43 submitted methods along with a short description of the top five algorithms.

81 citations


Journal ArticleDOI
Lianwen Jin1, Yan Gao1, Gang Liu1, Yunyang Li1, Kai Ding1 
TL;DR: The SCUT-COUCH2009 database is the first publicly available large vocabulary online Chinese handwriting database containing multi-type character/word samples and some evaluation results on the database are reported using state-of-the-art recognizers for benchmarking.
Abstract: A comprehensive online unconstrained Chinese handwriting dataset, SCUT-COUCH2009, is introduced in this paper. As a revision of SCUT-COUCH2008 [1], the SCUT-COUCH2009 database consists of more datasets with larger vocabularies and more writers. The database is built to facilitate the research of unconstrained online Chinese handwriting recognition. It is comprehensive in the sense that it consists of 11 datasets of different vocabularies, named GB1, GB2, TradGB1, Big5, Pinyin, Letters, Digit, Symbol, Word8888, Word17366 and Word44208. In particular, the SCUT-COUCH2009 database contains handwritten samples of 6,763 single Chinese characters in the GB2312-80 standard, 5,401 traditional Chinese characters of the Big5 standard, 1,384 traditional Chinese characters corresponding to the level 1 characters of the GB2312-80 standard, 8,888 frequently used Chinese words, 17,366 daily-used Chinese words, 44,208 complete words from the Fourth Edition of “The Contemporary Chinese Dictionary”, 2,010 Pinyin and 184 daily-used symbols. The samples were collected using PDAs (Personal Digit Assistant) and smart phones with touch screens and were contributed by more than 190 persons. The total number of character samples is over 3.6 million. The SCUT-COUCH2009 database is the first publicly available large vocabulary online Chinese handwriting database containing multi-type character/word samples. We report some evaluation results on the database using state-of-the-art recognizers for benchmarking.

72 citations


Journal ArticleDOI
TL;DR: This paper describes the on-line Arabic handwriting recognition competition held at tenth International Conference on Document Analysis and Recognition (ICDAR in Proceedings of the 10th international conference on document analysis and recognition, vol 3, pp 1388–1392, 2009).
Abstract: This paper describes the on-line Arabic handwriting recognition competition held at tenth International Conference on Document Analysis and Recognition (ICDAR in Proceedings of the 10th international conference on document analysis and recognition, vol 3, pp 1388–1392, 2009). This first competition uses the so-called ADAB database with Arabic on-line handwritten words. At this first competition, 3 groups with 7 different systems have participated. The systems were tested on known data (training datasets made available for the participants, sets 1 to 3) and on one test dataset that is unknown to all participants (set 4). The systems are compared on the most important characteristic of classification systems, the recognition rate. Additionally, the relative speed of the different systems was compared. A short description of the participating groups, their systems, the experimental setup, and the performed results is presented.

55 citations


Journal ArticleDOI
TL;DR: The character confusion-based prototype of Text-Induced Corpus Clean-up is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents, showing that the system is not sensitive to domain variation.
Abstract: We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.

52 citations


Journal ArticleDOI
TL;DR: The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time and derived a general form for the probability that a sequence of blocks contains the searched information.
Abstract: We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results—e.g., a success rate often greater than 90% even for classes with just two samples.

49 citations


Journal ArticleDOI
TL;DR: This work shows how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches, and asks if matching procedures alone suffice to lift IR on historical texts to a satisfactory level.
Abstract: Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two alternative ways to solve this problem. In the first part of the paper, we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. In the second part of the paper, we ask if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries, it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Results indicate that for earlier periods, matching procedures alone do not lead to satisfactory results. We then describe experiments where the gain for recall obtained from historical lexica of distinct sizes is estimated.

45 citations


Journal ArticleDOI
TL;DR: A general approach to creating large, ground-truthed corpora for structured sketch domains such as mathematics, where random sketch templates are generated automatically using a grammar model of the sketch domain, and annotated with ground-truth.
Abstract: Although publicly available, ground-truthed corpora have proven useful for training, evaluating, and comparing recognition systems in many domains, the availability of such corpora for sketch recognizers, and math recognizers in particular, is currently quite poor. This paper presents a general approach to creating large, ground-truthed corpora for structured sketch domains such as mathematics. In the approach, random sketch templates are generated automatically using a grammar model of the sketch domain. These templates are transcribed manually, then automatically annotated with ground-truth. The annotation procedure uses the generated sketch templates to find a matching between transcribed and generated symbols. A large, ground-truthed corpus of handwritten mathematical expressions presented in the paper illustrates the utility of the approach.

43 citations


Journal ArticleDOI
TL;DR: In this paper, a novel and general nonlinear model for canceling the show-through phenomenon is proposed based on a new recursive and extendible structure and a refined separating architecture is introduced for simultaneously removing theshow-through and blurring effects.
Abstract: Digital documents are usually degraded during the scanning process due to the contents of the backside of the scanned manuscript. This is often caused by the show-through effect, i.e. the backside image that interferes with the main front side picture due to the intrinsic transparency of the paper. This phenomenon is one of the degradations that one would like to remove especially in the field of Optical Character Recognition (OCR) or document digitalization which require denoised texts as inputs. In this paper, we first propose a novel and general nonlinear model for canceling the show-through phenomenon. A nonlinear blind source separation algorithm is used for this purpose based on a new recursive and extendible structure. However, the results are restricted due to a blurring effect that appears during the scanning process due to the light transfer function of the paper. Consequently, for improving the results, we introduce a refined separating architecture for simultaneously removing the show-through and blurring effects.

39 citations


Journal ArticleDOI
TL;DR: A word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine is proposed and has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century.
Abstract: In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine. A preprocessing step is performed in order to improve the quality of the document images, while word segmentation is accomplished with the use of two complementary segmentation methodologies. In the proposed methodology, synthetic word images are created from keywords, and these images are compared to all the words in the digitized documents. A user feedback process is used in order to refine the search procedure. The methodology has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century. In order to improve the efficiency of accessing and search, natural language processing techniques have been addressed that comprise a morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates access to the semantic context of documents.

Journal ArticleDOI
TL;DR: This work presents a novel confidence- and margin-based discriminative training approach for model adaptation of a hidden Markov model (HMM)-based handwriting recognition system to handle different handwriting styles and their variations.
Abstract: We present a novel confidence- and margin-based discriminative training approach for model adaptation of a hidden Markov model (HMM)-based handwriting recognition system to handle different handwriting styles and their variations. Most current approaches are maximum-likelihood (ML) trained HMM systems and try to adapt their models to different writing styles using writer adaptive training, unsupervised clustering, or additional writer-specific data. Here, discriminative training based on the maximum mutual information (MMI) and minimum phone error (MPE) criteria are used to train writer-independent handwriting models. For model adaptation during decoding, an unsupervised confidence-based discriminative training on a word and frame level within a two-pass decoding process is proposed. The proposed methods are evaluated for closed-vocabulary isolated handwritten word recognition on the IFN/ENIT Arabic handwriting database, where the word error rate is decreased by 33% relative compared to a ML trained baseline system. On the large-vocabulary line recognition task of the IAM English handwriting database, the word error rate is decreased by 25% relative.

Journal ArticleDOI
TL;DR: This paper deals with the difficult problem of indexing drop caps and specifically, considers the problem of letter extraction from this complex graphic images and proposes an original strategy based on an analysis of the features of the images to be indexed.
Abstract: This paper deals with the difficult problem of indexing ancient graphic images. It tackles the particular case of indexing drop caps (also called Lettrines) and specifically, considers the problem of letter extraction from this complex graphic images. Based on an analysis of the features of the images to be indexed, an original strategy is proposed. This approach relies on filtering the relevant information, on the basis of Meyer decomposition. Then, in order to accommodate the variability of representation of the information, a Zipf’s law modeling enables detection of the regions belonging to the letter, what allows it to be segmented. The overall process is evaluated using a relevant set of images, which shows the relevance of the approach.

Journal ArticleDOI
TL;DR: The setup of the Book Structure Extraction competition run at ICDAR 2009 is described, the book collection used in the task, the collaborative construction of the ground truth, the evaluation measures, and the evaluation results are discussed.
Abstract: This paper describes the setup of the Book Structure Extraction competition run at ICDAR 2009. The goal of the competition was to evaluate and compare automatic techniques for deriving structure information from digitized books, which could then be used to aid navigation inside the books. More specifically, the task that participants faced was to construct hyperlinked tables of contents for a collection of 1,000 digitized books. This paper describes the setup of the competition and its challenges. It introduces and discusses the book collection used in the task, the collaborative construction of the ground truth, the evaluation measures, and the evaluation results. The paper also introduces a data set to be used freely for research evaluation purposes.

Journal ArticleDOI
TL;DR: Performance evaluation of mathematical expression recognition systems is attempted and the changes required to convert the tree corresponding to the expression generated by the recognizer into the groundtruthed one are noted.
Abstract: Performance evaluation of mathematical expression recognition systems is attempted. The proposed method assumes expressions (input as well as recognition output) are coded following MathML or TEX/LaTEX (which also gets converted into MathML) format. Since any MathML representation follows a tree structure, evaluation of performance has been modeled as a tree-matching problem. The tree corresponding to the expression generated by the recognizer is compared with the groundtruthed one by comparing the corresponding Euler strings. The changes required to convert the tree corresponding to the expression generated by the recognizer into the groundtruthed one are noted. The number of changes required to make such a conversion is basically the distance between the trees. This distance gives the performance measure for the system under testing. The proposed algorithm also pinpoints the positions of the changes in the output MathML file. Testing of the proposed evaluation method considers a set of example groundtruthed expressions and their corresponding recognized results produced by an expression recognition system.

Journal ArticleDOI
TL;DR: A new pair of evaluation metrics are proposed that better suit document analysis’ needs and show their application to several table tasks and a road-map for creating Hidden Markov Models for the task is drawn.
Abstract: Is an algorithm with high precision and recall at identifying table-parts also good at locating tables? Several document analysis tasks require merging or splitting certain document elements to form others. The suitability of the commonly used precision and recall for such division/aggregation tasks is arguable, since their underlying assumption is that the granularity of the items at input is the same as at output. We propose a new pair of evaluation metrics that better suit document analysis’ needs and show their application to several table tasks. In the process, we present a number of robust table location algorithms with which we draw a road-map for creating Hidden Markov Models for the task.

Journal ArticleDOI
TL;DR: A word indexing and retrieval technique that does not require word segmentation and is tolerant to errors in character segmentation is proposed that is tested on four copies of the Gutenberg Bibles.
Abstract: Retrieving text from early printed books is particularly difficult because in these documents, the words are very close one to the other and, similarly to medieval manuscripts, there is a large use of ligatures and abbreviations To address these problems, we propose a word indexing and retrieval technique that does not require word segmentation and is tolerant to errors in character segmentation Two main principles characterize the approach First, characters are identified in the pages and clustered with self-organizing map (SOM) During the retrieval, the similarity of characters is estimated considering the proximity of cluster centroids in the SOM space, rather than directly comparing the character images Second, query words are matched with the indexed sequence of characters by means of a dynamic time warping (DTW)-based approach The proposed technique integrates the SOM similarity and the information about the width of characters in the string matching process The best path in the DTW array is identified considering the widths of matching words with respect to the query so as to deal with broken or touching symbols The proposed method is tested on four copies of the Gutenberg Bibles

Journal ArticleDOI
TL;DR: A novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance and also discusses their performance measures using standard IR evaluation metrics.
Abstract: With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.

Journal ArticleDOI
TL;DR: Experimental results on a handwritten character recognition task demonstrate that MCE-trained and compressed PCGM-based classifiers can achieve much higher recognition accuracies than their counterparts based on traditional modified quadratic discriminant function (MQDF) when the footprint of the classifiers has to be made very small, e.g., less than 2 MB.
Abstract: In our previous work, a so-called precision constrained Gaussian model (PCGM) was proposed for character modeling to design compact recognizers of handwritten Chinese characters. A maximum likelihood training procedure was developed to estimate model parameters from training data. In this paper, we extend the above-mentioned work by using minimum classification error (MCE) training to improve recognition accuracy and using both split vector quantization and scalar quantization techniques to further compress model parameters. Experimental results on a handwritten character recognition task with a vocabulary of 2,965 Kanji characters demonstrate that MCE-trained and compressed PCGM-based classifiers can achieve much higher recognition accuracies than their counterparts based on traditional modified quadratic discriminant function (MQDF) when the footprint of the classifiers has to be made very small, e.g., less than 2 MB.

Journal ArticleDOI
TL;DR: A new approach based on Linear Discriminant Analysis to reject less reliable classifier outputs and it represents a more comprehensive measurement than traditional rejection measurements such as First Rank Measurement and First Two Ranks Measurement.
Abstract: In document recognition, it is often important to obtain high accuracy or reliability and to reject patterns that cannot be classified with high confidence. This is the case for applications such as the processing of financial documents in which errors can be very costly and therefore far less tolerable than rejections. This paper presents a new approach based on Linear Discriminant Analysis (LDA) to reject less reliable classifier outputs. To implement the rejection, which can be considered a two-class problem of accepting the classification result or otherwise, an LDA-based measurement is used to determine a new rejection threshold. This measurement (LDAM) is designed to take into consideration the confidence values of the classifier outputs and the relations between them, and it represents a more comprehensive measurement than traditional rejection measurements such as First Rank Measurement and First Two Ranks Measurement. Experiments are conducted on the CENPARMI database of numerals, the CENPARMI Arabic Isolated Numerals Database, and the numerals in the NIST Special Database 19. The results show that LDAM is more effective, and it can achieve a higher reliability while maintaining a high recognition rate on these databases of very different origins and sizes.

Journal ArticleDOI
TL;DR: An automatic method to find high-quality maps for a given geographic region using a Content-Based Image Retrieval approach that uses a new set of features for classification in order to capture the defining characteristics of a map.
Abstract: Maps are one of the most valuable documents for gathering geospatial information about a region. Yet, finding a collection of diverse, high-quality maps is a significant challenge because there is a dearth of content-specific metadata available to identify them from among other images on the Web. For this reason, it is desirous to analyze the content of each image. The problem is further complicated by the variations between different types of maps, such as street maps and contour maps, and also by the fact that many high-quality maps are embedded within other documents such as PDF reports. In this paper, we present an automatic method to find high-quality maps for a given geographic region. Not only does our method find documents that are maps, but also those that are embedded within other documents. We have developed a Content-Based Image Retrieval (CBIR) approach that uses a new set of features for classification in order to capture the defining characteristics of a map. This approach is able to identify all types of maps irrespective of their subject, scale, and color in a highly scalable and accurate way. Our classifier achieves an F1-measure of 74%, which is an 18% improvement over the previous work in the area.

Journal ArticleDOI
TL;DR: An algorithm for ontology-guided entity disambiguation that uses existing knowledge sources, such as domain-specific taxonomies and other structured data, to help develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians is presented.
Abstract: Domain-specific knowledge is often recorded by experts in the form of unstructured text. For example, in the medical domain, clinical notes from electronic health records contain a wealth of information. Similar practices are found in other domains. The challenge we discuss in this paper is how to identify and extract part names from technicians repair notes, a noisy unstructured text data source from General Motors’ archives of solved vehicle repair problems, with the goal to develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians. In the present work, we discuss two approaches to this problem. We present an algorithm for ontology-guided entity disambiguation that uses existing knowledge sources, such as domain-specific taxonomies and other structured data. We illustrate its use in the automotive domain, using GM parts ontology and the unit structure of repair manuals text to build context models, which are then used to disambiguate mentions of part-related entities in the text. We also describe extraction of part names with a small amount of annotated data using hidden Markov models (HMM) with shrinkage, achieving an f-score of approximately 80%. Next, we used linear-chain conditional random fields (CRF) in order to model observation dependencies present in the repair notes. Using CRF did not lead to improved performance, but a slight improvement over the HMM results was obtained by using a weighted combination of the HMM and CRF models.

Journal ArticleDOI
TL;DR: A form of iterative contextual modeling that learns character models directly from the document it is trying to recognize in an incremental, iterative process to address difficult cases of optical character recognition.
Abstract: Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem,” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but these are vulnerable to cases when the font of a particular document was not part of the training set or when there is so much noise in a document that the font model becomes weak. To address these difficult cases, we present a form of iterative contextual modeling that learns character models directly from the document it is trying to recognize. We use these learned models both to segment the characters and to recognize them in an incremental, iterative process. We present results comparable with those of a commercial OCR system on a subset of characters from a difficult test document in both English and Greek.

Journal ArticleDOI
TL;DR: This paper addresses the challenge of named entity detection in noisy OCR output and shows that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search.
Abstract: In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.

Journal ArticleDOI
TL;DR: Evaluation on the Dutch dataset shows that these reconstructions become two orders of magnitude smaller and still resemble the original to a high degree, and are easier to speed-read and evaluate for relevance, due to added hyperlinks and a presentation optimized for reading from a terminal.
Abstract: A web portal providing access to over 250.000 scanned and OCRed cultural heritage documents is analyzed. The collection consists of the complete Dutch Hansard from 1917 to 1995. Each document consists of facsimile images of the original pages plus hidden OCRed text. The inclusion of images yields large file sizes of which less than 2% is the actual text. The search user interface of the portal provides poor ranking and not very informative document summaries (snippets). Thus, users are instrumental in weeding out non-relevant results. For that, they must assess the complete documents. This is a time-consuming and frustrating process because of long download and processing times of the large files. Instead of using the scanned images for relevance assessment, we propose to use a reconstruction of the original document from a purely semantic representation. Evaluation on the Dutch dataset shows that these reconstructions become two orders of magnitude smaller and still resemble the original to a high degree. In addition, they are easier to speed-read and evaluate for relevance, due to added hyperlinks and a presentation optimized for reading from a terminal. We describe the reconstruction process and evaluate the costs, the benefits, and the quality.

Journal ArticleDOI
TL;DR: In this paper, the authors measured the dissimilarities among several printed characters of a single page in the Gutenberg 42-line bible, and proved statistically the existence of several different matrices from which the metal types were constructed.
Abstract: We have measured the dissimilarities among several printed characters of a single page in the Gutenberg 42-line bible, and we prove statistically the existence of several different matrices from which the metal types were constructed. This is in contrast with the prevailing theory, which states that only one matrix per character was used in the printing process of Gutenberg’s greatest work. The main mathematical tool for this purpose is cluster analysis, combined with a statistical test for outliers. We carry out the research with two letters, $${\texttt{i}}$$ and $${\texttt{a}}$$. In the first case, an exact clustering method is employed; in the second, with more specimens to be classified, we resort to an approximate agglomerative clustering method. The results show that the letters form clusters according to their shape, with significant shape differences among clusters, and allow to conclude, with a very small probability of error, that indeed the metal types used to print them were cast from several different matrices.

Journal ArticleDOI
TL;DR: This document is a report of the discussions by the participants to the working group on “Noisy Text Datasets” organized during the Third Workshop on Analytics for Noisy Unstructured Text Data (AND 2009) held in Barcelona (Spain) July 23, 24, 2009.
Abstract: This document is a report of the discussions by the participants to the working group on “Noisy Text Datasets” organized during the Third Workshop on Analytics for Noisy Unstructured Text Data (AND 2009) held in Barcelona (Spain) July 23, 24, 2009

Journal ArticleDOI
TL;DR: An innovative protocol to characterizing upstream the complementarity of shape descriptors is proposed, the originality of which is to be as independent of the final application as possible and which relies on new quantitative and qualitative measures.
Abstract: Most document analysis applications rely on the extraction of shape descriptors, which may be grouped into different categories, each category having its own advantages and drawbacks (O.R. Terrades et al. in Proceedings of ICDAR’07, pp. 227–231, 2007). In order to improve the richness of their description, many authors choose to combine multiple descriptors. Yet, most of the authors who propose a new descriptor content themselves with comparing its performance to the performance of a set of single state-of-the-art descriptors in a specific applicative context (e.g. symbol recognition, symbol spotting...). This results in a proliferation of the shape descriptors proposed in the literature. In this article, we propose an innovative protocol, the originality of which is to be as independent of the final application as possible and which relies on new quantitative and qualitative measures. We introduce two types of measures: while the measures of the first type are intended to characterize the descriptive power (in terms of uniqueness, distinctiveness and robustness towards noise) of a descriptor, the second type of measures characterizes the complementarity between multiple descriptors. Characterizing upstream the complementarity of shape descriptors is an alternative to the usual approach where the descriptors to be combined are selected by trial and error, considering the performance characteristics of the overall system. To illustrate the contribution of this protocol, we performed experimental studies using a set of descriptors and a set of symbols which are widely used by the community namely ART and SC descriptors and the GREC 2003 database.

Journal ArticleDOI
TL;DR: The hybrid model for mining text relations between named entities is presented, which can deal with data highly affected by linguistic noise and is robust to non-conventional languages as dialects, jargon expressions or coded words typically contained in such text.
Abstract: In this paper, we present models for mining text relations between named entities, which can deal with data highly affected by linguistic noise. Our models are made robust by: (a) the exploitation of state-of-the-art statistical algorithms such as support vector machines (SVMs) along with effective and versatile pattern mining methods, e.g. word sequence kernels; (b) the design of specific features capable of capturing long distance relationships; and (c) the use of domain prior knowledge in the form of ontological constraints, e.g. bounds on the type of relation arguments given by the semantic categories of the involved entities. This property allows for keeping small the training data required by SVMs and consequently lowering the system design costs. We empirically tested our hybrid model in the very complex domain of business intelligence, where the textual data are constituted by reports on investigations into criminal enterprises based on police interrogatory reports, electronic eavesdropping and wiretaps. The target relations are typically established between entities, as they are mentioned in these information sources. The experiments on mining such relations show that our approach with small training data is robust to non-conventional languages as dialects, jargon expressions or coded words typically contained in such text.