A Data Mining Approach to Reading Order Detection

doi:10.1109/ICDAR.2007.4377050

Proceedings ArticleDOI

A Data Mining Approach to Reading Order Detection

Michelangelo Ceci, +3 more

- Vol. 2, pp 924-928

Chats0

TLDR

This paper investigates the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples and induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout component.

Abstract:

Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the "chains" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.

A Data Mining Approach to Reading Order Detection

Citations

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

Document layout analysis and reading order determination for a reading robot

Ranking Sentences for Keyphrase Extraction: A Relational Data Mining Approach☆

Learning to order: a relational approach

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

References

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Rank aggregation methods for the Web

Learning to order things

Hierarchical representation of optically scanned documents

Document understanding for a broad class of documents

Related Papers (5)

Automated discovery of dependencies between logical components in document image understanding

Logical Structure Analysis of Document Images Based on Emergent Computation

Extracting discriminative concepts for domain adaptation in text mining

Integrating Perceptual Signal Features within a Multi-facetted Conceptual Model for Automatic Image Retrieval

Representation Learning for Information Extraction from Form-like Documents