scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Proceedings ArticleDOI
31 Aug 2005
TL;DR: A filter-based method was designed to organize the features in clusters, which allows finding a good subset of input features during each cycle, which reduce the computations.
Abstract: The purpose of this work is to develop a pattern recognition system simulating the human vision. A transparent neural network, with context returns is used. The context returns consist in using global vision to correct local vision (i.e. input data are corrected according to neural network outputs). In order not to compute all the input features during these context returns, a filter-based method was designed to organize the features in clusters. This allows finding a good subset of input features during each cycle, which reduce the computations. The method interest is shown in the case of logical document structure retrieval.

5 citations

Journal ArticleDOI
04 Aug 2022-Signals
TL;DR: It is shown that the presented Mask R-CNN-based method can successfully segment text lines, even in such a challenging scenario, and introduced a new challenging dataset of Arabic historical manuscripts, VML-AHTE, where numerous diacritics are present.
Abstract: Text line extraction is an essential preprocessing step in many handwritten document image analysis tasks. It includes detecting text lines in a document image and segmenting the regions of each detected line. Deep learning-based methods are frequently used for text line detection. However, only a limited number of methods tackle the problems of detection and segmentation together. This paper proposes a holistic method that applies Mask R-CNN for text line extraction. A Mask R-CNN model is trained to extract text lines fractions from document patches, which are further merged to form the text lines of an entire page. The presented method was evaluated on the two well-known datasets of historical documents, DIVA-HisDB and ICDAR 2015-HTR, and achieved state-of-the-art results. In addition, we introduce a new challenging dataset of Arabic historical manuscripts, VML-AHTE, where numerous diacritics are present. We show that the presented Mask R-CNN-based method can successfully segment text lines, even in such a challenging scenario.

5 citations

Proceedings ArticleDOI
01 Nov 2017
TL;DR: A novel technique for layout analysis of documents with complex Manhattan layouts that requires only one parameter - the number of gaussians to fit the height histogram data and is therefore easy to automate and adapt to many documents.
Abstract: This paper proposes a novel technique for layout analysis of documents with complex Manhattan layouts. The technique is designed for Indic script newspapers and works on many types of documents not necessarily with Indic scripts with Manhattan layout. The main idea behind the algorithm is to categorise the physical elements of a document into noise, text, titles and graphics based on their heights. A histogram of heights is computed from the bounding boxes of connected components and a multigaussian fit is used to discover optimal split points between the categories. The gaussian with the highest peak is assumed to correspond to running text. Running text regions are grouped into blocks using nearest neighbour analysis. These initial regions are further refined using a second-level classification of the other elements into graphics, light-coloured text on a dark background, and graphical separators. The resulting layouts show accuracies comparable to some of the best and most popular algorithms such as MHS (winner of ICDAR-RDCL2015 competition) and PRImA's Aletheia (tool developed by PRImA Research Lab). Results of testing on many Indic script newspapers and other documents, and comparison with Aletheia and MHS on ICDAR dataset show its performance. Our initial results on an Indic document dataset show high performance in identifying running text (> 98%) with an accuracy of 82% on identifying the other elements. Ground truth data for the Indic script newspaper documents is being generated for a more extensive quantitative testing. The strength of our algorithm is that it requires only one parameter - the number of gaussians to fit the height histogram data and is therefore easy to automate and adapt to many documents.

5 citations

Patent
29 Mar 2005
TL;DR: In this paper, the copying machine 1 specifies a character carrying portion in the image data outputted by the scanner, performs character recognition processing to generate 1st character information on the basis of recognized character data, and compares the 1st information with 2nd character information as information on characters included in a document to be controlled that is referred to from document management index data 18b, thereby deciding whether the document and the document having similar document contents.
Abstract: PROBLEM TO BE SOLVED: To provide an information processor that can suitably restrain copying processing and FAX processing of various kinds of documents including general documents generated and used in an office, an information processing method and a program therefor. SOLUTION: A scanner 16 reads the document and outputs image data. A copying machine 1 specifies a character carrying portion in the image data outputted by the scanner, performs character recognition processing to generate 1st character information on the basis of recognized character data, and compares the 1st character information with 2nd character information as information on characters included in a document to be controlled that is referred to from document management index data 18b, thereby deciding whether the document and the document to be controlled have similar document contents. Then the copying machine 1 specifies a document to be controlled which have document contents similar to that of the document in accordance with the decision and refers to control information of the document to be controlled specified from the document management index data 18b to decide whether document processing in response to a request to process the document is performed. COPYRIGHT: (C)2007,JPO&INPIT

5 citations

Proceedings ArticleDOI
23 Sep 2007
TL;DR: This paper presents its work on automatically locating charts from document pages, which is an important stage in the chart image recognition and understanding system currently being developed, and proposes a set of simple statistical features for building the classifier.
Abstract: This paper presents our work on automatically locating charts from document pages, which is an important stage in our chart image recognition and understanding system currently being developed. To achieve this, there are two sub-goals to be reached: locating figure blocks in a given document image, and building a classifier to differentiate charts from non- chart figures. For the first sub-goal, besides traditional logical block labelling, relevant text blocks such as text descriptions and labels in a figure must be included in the located figure blocks to facilitate the interpretation processes in the following stages. For the second sub- goal, we propose a set of simple statistical features for building the classifier. We tested our system with the entire collection of scanned journal pages in the University of Washington database I. The experimental results are discussed in this paper.

5 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189