Home
/
Topics
/
Document layout analysis

Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.

...read moreread less

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1969

Papers

PDF

Open Access

More filters

Patent•

Information processing apparatus and method, computer program and computer-readable recording medium

[...]

Kenichi Okihara¹•Institutions (1)

Canon Inc.¹

01 Oct 2008

TL;DR: A watermark information embedding apparatus as discussed by the authors generates a document image from electronic document data that has been input to the watermark embedding system, modifies the document data based upon the document image and embeds information in the electronic document.

...read moreread less

Abstract: A watermark information embedding apparatus generates a document image from electronic document data that has been input thereto, modifies the electronic document data based upon the document image and embeds information in the electronic document data. The apparatus includes a document image generator for generating a document image from the electronic document data; a document analyzer for detecting layout information of each constituent image in the generated document image; a normalization information calculation unit for calculating normalization information, which is for normalizing placement of each constituent image, based upon the detected layout information; a modification unit for modifying the electronic document data; and an embedding unit for embedding information in the modified electronic document data.

...read moreread less

5 citations

Journal Article•DOI•

Automatic generation of structured hyperdocuments from document images

[...]

Jiyeon Lee¹, Jeong-Seon Park¹, Hyeran Byun², Jongsub Moon¹, Seong-Whan Lee¹ - Show less +1 more•Institutions (2)

Korea University¹, Yonsei University²

01 Feb 2002-Pattern Recognition

TL;DR: Experiments show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.

...read moreread less

5 citations

Proceedings Article•DOI•

Removal of hand-drawn annotation lines from document images by digital-geometric analysis and inpainting

[...]

Sanjoy Pratihar¹, Partha Bhowmick¹, Shamik Sural¹, Jayanta Mukhopadhyay¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Dec 2013

TL;DR: This paper proposes a generalized scheme for detection and removal of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves from a scanned document page.

...read moreread less

Abstract: Performance of an OCR system is badly affected due to presence of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves. Such annotation lines are drawn by a reader usually in free hand in order to summarize some text or to mark the keywords within a document page. In this paper, we propose a generalized scheme for detection and removal of these hand-drawn annotations from a scanned document page. An underline drawn by hand is roughly horizontal or has a tolerable undulation, whereas for a hand-drawn curved line, the slope usually changes at a gradual pace. Based on this observation, we detect the cover of an annotation object-be it straight or curved-as a sequence of straight edge segments. The novelty of the proposed method lies in its ability to compute the exact cover of the annotation object, even when it touches or passes through any text character. After getting the annotation cover, an effective method of inpainting is used to quantify the regions where text reconstruction is needed. We have done our experimentation with various documents written in English, and some results are presented here to show the efficiency and robustness of the proposed method.

...read moreread less

5 citations

Proceedings Article•DOI•

A Simple Equation Region Detector for Printed Document Images in Tesseract

[...]

Zongyi Liu¹, Ray Smith¹•Institutions (1)

Google¹

25 Aug 2013

TL;DR: This paper presents an equation detector built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, and it has been built into the open source Tesseract that can be accessed and used by the OCR community.

...read moreread less

Abstract: Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.

...read moreread less

5 citations

Proceedings Article•DOI•

Automatic selection of visually attractive pages for thumbnail display in document list view

[...]

Fabrice Matulic¹•Institutions (1)

ETH Zurich¹

01 Nov 2008

TL;DR: A technique to represent a document as a selection of its most eye-catching pages, intended as part of a document catalogue system and user interface, in which multiple page thumbnails are shown for each document.

...read moreread less

Abstract: Document summarization is a task which is difficult to perform automatically, especially if the document is only available as raw pixel data. This paper presents a technique to represent a document as a selection of its most eye-catching pages. The algorithm looks for salient features such as illustrations, diagrams, large titles, headings etc. that cause a page to stand out and ranks its conspicuousness according to the colour, size and number of such elements. A filter function can also be applied to introduce some spread in the selection process, if desired, in order to avoid cases where the extracted pages are too close to each other. The algorithm is intended as part of a document catalogue system and user interface, in which multiple page thumbnails are shown for each document. The aim is to broaden and enrich a documentpsilas visual profile beyond the traditional front cover icon and generally to increase its appeal to potential readers during their browsing experience.

...read moreread less

5 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
…
180
181
182
183
184
185
186
…
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

Network Information

Performance

Metrics

1,488

Papers

35,779

Citations

No. of papers in the topic in previous years
Year	Papers
2023	5
2022	19
2021	34
2020	19
2019	14
2018	9

Document layout analysis

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics