scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: In order to achieve robusmess and avoid brittleness when applying the system to real life document images, the notion of a thick boundary interpretation for a qualitative relation is introduced.

15 citations

Patent
06 Nov 2008
TL;DR: In this paper, a method in a document analysis system automatically extracts from each received electronic document image and text features, in which the image features are indicative of how the document is laid out or textually-organized and therefore indicative of a corresponding document category.
Abstract: A method in a document analysis system automatically extracts from each received electronic document image and text features, in which the image features are indicative of how the document is laid out or textually-organized and therefore indicative of a corresponding document category, next compares the extracted image and text features with feature sets associated with each document category, and then classifies each document to a document category, the feature set of which best matches the extracted features of the document.

15 citations

Proceedings ArticleDOI
23 Sep 2007
TL;DR: A flexible and effective example- based approach for labeling title pages which can be used for automated extraction of bibliographic data and has equivalent and partially better performance when compared to other more complex labeling methods known from the literature.
Abstract: This paper presents a flexible and effective example- based approach for labeling title pages which can be used for automated extraction of bibliographic data. The labels of interest are "title", "author", "abstract" and "affiliation". The method takes a set of labeled document layouts and a single unlabeled document layout as input and finds the best matching layout in the set. The labels of this layout are used to label the new layout. The similarity measure for layouts combines structural layout similarity and textural similarity on the block-level. Experimental results yield accuracy rates from 94.8% to 99.6% obtained on the publicly available MARG dataset. This shows that our lightweight method has equivalent and partially better performance when compared to other more complex labeling methods known from the literature.

15 citations

Proceedings ArticleDOI
18 Apr 2020
TL;DR: HJDataset as discussed by the authors is a large-scale dataset of historical Japanese documents with complex layout annotations, which contains over 250,000 layout element annotations of seven types, including bounding boxes and masks of the content regions.
Abstract: Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks. The dataset is available at https://dell-research-harvard.github.io/HJDataset/.

15 citations

Proceedings ArticleDOI
22 Dec 2011
TL;DR: A modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images using a Support Vector Machine (SVM) classifier.
Abstract: Document layout analysis is a pre-processing step to convert handwritten/printed documents into electronic form through Optical Character Recognition (OCR) system. Handwritten documents are usually unstructured i.e. they do not have a specific layout and most documents may contain some non-text regions e.g. graphs, tables, diagrams etc. Therefore, such documents cannot be directly given as input to the OCR system without suppressing the non-text regions in the documents. The traditional Run Length Smoothing Algorithm (RLSA) does not produce good results for handwritten document pages, since the text components in it have lesser pixel density than those in printed text. In present work, a modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images. The components in the document pages are then classified into text/non-text groups using a Support Vector Machine (SVM) classifier. The method shows a success rate of 83.3% on a dataset of 3000 components.

15 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189