scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2023"


Posted ContentDOI
23 Apr 2023
TL;DR: Paragraph2Graph as discussed by the authors is a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets while being adaptable to business scenarios with strict separation.
Abstract: Document layout analysis has a wide range of requirements across various domains, languages, and business scenarios. However, most current state-of-the-art algorithms are language-dependent, with architectures that rely on transformer encoders or language-specific text encoders, such as BERT, for feature extraction. These approaches are limited in their ability to handle very long documents due to input sequence length constraints and are closely tied to language-specific tokenizers. Additionally, training a cross-language text encoder can be challenging due to the lack of labeled multilingual document datasets that consider privacy. Furthermore, some layout tasks require a clean separation between different layout components without overlap, which can be difficult for image segmentation-based algorithms to achieve. In this paper, we present Paragraph2Graph, a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets while being adaptable to business scenarios with strict separation. With only 19.95 million parameters, our model is suitable for industrial applications, particularly in multi-language scenarios.

Book ChapterDOI
01 Jan 2023
TL;DR: In this article , the authors presented a fast and efficient method based on hybrid modeling using graphs and structural analysis for textual separation of non-textual components in document images, which used the RLSA smoothing algorithm for segmentation and the DTW (Dynamic Time Warping) algorithm will be used to match word features.
Abstract: Libraries contain huge quantities of printed Arabic historical documents. These documents usually contain both text and graphics. In order to digitize them in an automatic way, it is essential to be able to first separate the text from the graphics and then to use OCR for text recognition. This separation of text and graphics is a crucial step that conditions the performance of document recognition and indexing systems. It involves the text area identification and automatic separation from graphical components in a document image. This work presents a fast and efficient method based on hybrid modeling using graphs and structural analysis for textual separation of non-textual components in document images. We propose to use the RLSA smoothing algorithm for segmentation and the DTW (Dynamic Time Warping) algorithm will be used to match word features. Simulation results show the efficiency of the proposed approach.

Proceedings ArticleDOI
01 Jan 2023
TL;DR: In this article , a few-shot learning framework was proposed to solve the layout analysis problem in ancient handwritten document analysis and achieved state-of-the-art performance on the publicly available DIVA-HisDB dataset.
Abstract: Layout analysis is a task of uttermost importance in ancient handwritten document analysis and represents a fundamental step toward the simplification of subsequent tasks such as optical character recognition and automatic transcription. However, many of the approaches adopted to solve this problem rely on a fully supervised learning paradigm. While these systems achieve very good performance on this task, the drawback is that pixel-precise text labeling of the entire training set is a very time-consuming process, which makes this type of information rarely available in a real-world scenario. In the present paper, we address this problem by proposing an efficient few-shot learning framework that achieves performances comparable to current state-of-the-art fully supervised methods on the publicly available DIVA-HisDB dataset.

Posted ContentDOI
03 Jun 2023
TL;DR: In this paper , the authors proposed an end-to-end framework for offline processing of handwritten semi-structured documents, and benchmarked it on the FIR dataset, which is more challenging than most existing document analysis datasets, since it combines a wide variety of handwritten text with printed text.
Abstract: State-of-the-art offline Optical Character Recognition (OCR) frameworks perform poorly on semi-structured handwritten domain-specific documents due to their inability to localize and label form fields with domain-specific semantics. Existing techniques for semi-structured document analysis have primarily used datasets comprising invoices, purchase orders, receipts, and identity-card documents for benchmarking. In this work, we build the first semi-structured document analysis dataset in the legal domain by collecting a large number of First Information Report (FIR) documents from several police stations in India. This dataset, which we call the FIR dataset, is more challenging than most existing document analysis datasets, since it combines a wide variety of handwritten text with printed text. We also propose an end-to-end framework for offline processing of handwritten semi-structured documents, and benchmark it on our novel FIR dataset. Our framework used Encoder-Decoder architecture for localizing and labelling the form fields and for recognizing the handwritten content. The encoder consists of Faster-RCNN and Vision Transformers. Further the Transformer-based decoder architecture is trained with a domain-specific tokenizer. We also propose a post-correction method to handle recognition errors pertaining to the domain-specific terms. Our proposed framework achieves state-of-the-art results on the FIR dataset outperforming several existing models

Journal ArticleDOI
TL;DR: In this paper , a novel framework for semantic layout analysis and characterization of handwritten manuscripts is proposed, which enables the derivation of implicit information and semantic characteristics, which can be effectively utilized in dozens of practical applications for various purposes.
Abstract: A document layout can be more informative than merely a document’s visual and structural appearance. Thus, document layout analysis (DLA) is considered a necessary prerequisite for advanced processing and detailed document image analysis to be further used in several applications and different objectives. This research extends the traditional approaches of DLA and introduces the concept of semantic document layout analysis (SDLA) by proposing a novel framework for semantic layout analysis and characterization of handwritten manuscripts. The proposed SDLA approach enables the derivation of implicit information and semantic characteristics, which can be effectively utilized in dozens of practical applications for various purposes, in a way bridging the semantic gap and providing more understandable high-level document image analysis and more invariant characterization via absolute and relative labeling. This approach is validated and evaluated on a large dataset of Arabic handwritten manuscripts comprising complex layouts. The experimental work shows promising results in terms of accurate and effective semantic characteristic-based clustering and retrieval of handwritten manuscripts. It also indicates the expected efficacy of using the capabilities of the proposed approach in automating and facilitating many functional, real-life tasks such as effort estimation and pricing of transcription or typing of such complex manuscripts.