scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
Kenichi Okihara1
01 Oct 2008
TL;DR: A watermark information embedding apparatus as discussed by the authors generates a document image from electronic document data that has been input to the watermark embedding system, modifies the document data based upon the document image and embeds information in the electronic document.
Abstract: A watermark information embedding apparatus generates a document image from electronic document data that has been input thereto, modifies the electronic document data based upon the document image and embeds information in the electronic document data. The apparatus includes a document image generator for generating a document image from the electronic document data; a document analyzer for detecting layout information of each constituent image in the generated document image; a normalization information calculation unit for calculating normalization information, which is for normalizing placement of each constituent image, based upon the detected layout information; a modification unit for modifying the electronic document data; and an embedding unit for embedding information in the modified electronic document data.

5 citations

Journal ArticleDOI
TL;DR: Experiments show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.

5 citations

Proceedings ArticleDOI
01 Dec 2013
TL;DR: This paper proposes a generalized scheme for detection and removal of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves from a scanned document page.
Abstract: Performance of an OCR system is badly affected due to presence of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves. Such annotation lines are drawn by a reader usually in free hand in order to summarize some text or to mark the keywords within a document page. In this paper, we propose a generalized scheme for detection and removal of these hand-drawn annotations from a scanned document page. An underline drawn by hand is roughly horizontal or has a tolerable undulation, whereas for a hand-drawn curved line, the slope usually changes at a gradual pace. Based on this observation, we detect the cover of an annotation object-be it straight or curved-as a sequence of straight edge segments. The novelty of the proposed method lies in its ability to compute the exact cover of the annotation object, even when it touches or passes through any text character. After getting the annotation cover, an effective method of inpainting is used to quantify the regions where text reconstruction is needed. We have done our experimentation with various documents written in English, and some results are presented here to show the efficiency and robustness of the proposed method.

5 citations

Proceedings ArticleDOI
Zongyi Liu1, Ray Smith1
25 Aug 2013
TL;DR: This paper presents an equation detector built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, and it has been built into the open source Tesseract that can be accessed and used by the OCR community.
Abstract: Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.

5 citations

Proceedings ArticleDOI
Fabrice Matulic1
01 Nov 2008
TL;DR: A technique to represent a document as a selection of its most eye-catching pages, intended as part of a document catalogue system and user interface, in which multiple page thumbnails are shown for each document.
Abstract: Document summarization is a task which is difficult to perform automatically, especially if the document is only available as raw pixel data. This paper presents a technique to represent a document as a selection of its most eye-catching pages. The algorithm looks for salient features such as illustrations, diagrams, large titles, headings etc. that cause a page to stand out and ranks its conspicuousness according to the colour, size and number of such elements. A filter function can also be applied to introduce some spread in the selection process, if desired, in order to avoid cases where the extracted pages are too close to each other. The algorithm is intended as part of a document catalogue system and user interface, in which multiple page thumbnails are shown for each document. The aim is to broaden and enrich a documentpsilas visual profile beyond the traditional front cover icon and generally to increase its appeal to potential readers during their browsing experience.

5 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189