scispace - formally typeset
Proceedings ArticleDOI

A Data Mining Approach to Reading Order Detection

Reads0
Chats0
TLDR
This paper investigates the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples and induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout component.
Abstract
Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the "chains" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

TL;DR: A robust layout-aware multimodal network named XYLayoutLM is proposed to capture and leverage rich layout information from proper reading orders produced by the Augmented XY Cut and achieves competitive results on document understanding tasks.
Proceedings ArticleDOI

Document layout analysis and reading order determination for a reading robot

TL;DR: The proposed algorithm is applied to a large number of document images and the experimental results show that it makes the reading robot be able to read paper documents of different languages, even with complex layout structure.
Journal ArticleDOI

Ranking Sentences for Keyphrase Extraction: A Relational Data Mining Approach☆

TL;DR: A probabilistic relational data mining method is presented to model preference relations on sentences of document images and this is used to rank the sentences which will form the final summary.
Book ChapterDOI

Learning to order: a relational approach

TL;DR: This work assumes that the correct succession of elements in a training sequence is given, so that it is possible to induce the definition of two predicates, first/1 and succ/2, which are then used to establish an ordering relationship.
Posted Content

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

TL;DR: The dataset as discussed by the authors contains reading order, text, and layout information for 500,000 document images covering a wide spectrum of document types, and it performs almost perfectly in reading order detection and significantly improves both open-source and commercial OCR engines in their results in their experiments.
References
More filters
Journal ArticleDOI

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

TL;DR: The Bayesian classifier is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption, and will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain.
Proceedings ArticleDOI

Rank aggregation methods for the Web

TL;DR: A set of techniques for the rank aggregation problem is developed and compared to that of well-known methods, to design rank aggregation techniques that can be used to combat spam in Web searches.
Journal ArticleDOI

Learning to order things

TL;DR: An on-line algorithm for learning preference functions that is based on Freund and Schapire's "Hedge" algorithm is considered, and it is shown that the problem of finding the ordering that agrees best with a learned preference function is NP-complete.
Journal ArticleDOI

Document understanding for a broad class of documents

TL;DR: In this article, the authors present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents, from geometric features and spatial relations to the textual features and content are employed in the analysis.
Related Papers (5)