Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Evaluating invariances in document layout functions

[...]

Alexander J. Macdonald¹, David F. Brailsford¹, John Lumley²•Institutions (2)

University of Nottingham¹, Hewlett-Packard²

10 Oct 2006

TL;DR: This work looks at deriving and exploiting the invariant properties of layout functions from their formal specifications, and proposes future work on generic extraction of invariance from such properties for certain classes of layouts functions.

...read moreread less

Abstract: With the development of variable-data-driven digital presses where each document printed is potentially unique there is a need for pre-press optimization to identify material that is invariant from document to document. In this way rasterisation can be confined solely to those areas which change between successive documents thereby alleviating a potential performance bottleneck.Given a template document specified in terms of layout functions, where actual data is bound at the last possible moment before printing, we look at deriving and exploiting the invariant properties of layout functions from their formal specifications. We propose future work on generic extraction of invariance from such properties for certain classes of layout functions.

...read moreread less

6 citations

Semantic text mining and its application in biomedical domain

[...]

Xiaohua Hu¹, Illhoi Yoo¹•Institutions (1)

Drexel University¹

01 Jan 2006

TL;DR: A semantic way to generate reasonable hypotheses based on evidence from biomedical literature using the complementary structures in disjoint literatures is introduced.

...read moreread less

Abstract: A huge amount of biomedical knowledge and novel discoveries have been produced and collected in text databases or digital libraries, such as MEDLINE, because the most natural form to store information is text In order to cope with this pressing text information overload, text mining is employed However, traditional text mining approaches have several problems, such as the use of the vector representation for documents In this thesis, we introduce a semantic text mining approach that can overcome the traditional problems This approach consists of important text mining components Those components are graphical representation method for documents that relies on domain ontologies, document clustering taking advantage of the scale-free network theory to mine the corpus-level graphical representation, text summarization, and a semantic version of Swanson's ABC model The primary contributions of this dissertation are four-fold First we introduce graphical representation method for documents that take advantage of domain ontology Second, the semantic document clustering approach is unique in that it provides users with document cluster models from an ontology-enriched scale-free representation of a set of documents, which are the summaries for each document cluster, and which also explain document categorization Third, in order to maximize the usefulness of document clustering, we introduce a text summarization approach that makes use of document cluster models Finally, we introduce a semantic way to generate reasonable hypotheses based on evidence from biomedical literature using the complementary structures in disjoint literatures

...read moreread less

6 citations

Patent•

System, method and article for applying temporal elements to the attributes of a static document object

[...]

John Busfield, Gregory Pulier

18 Feb 2000

TL;DR: In this article, a system, method and article of manufacture for applying temporal elements to the attributes of a static document object is proposed, which is preferably implemented as a software program operating on a computer system, such as a personal computer or workstation.

...read moreread less

Abstract: A system, method and article of manufacture for applying temporal elements to the attributes of a static document object is preferably implemented as a software program operating on a computer system, such as a personal computer or workstation. The software program includes computer implemented steps for: (i) importing an existing document (or alternatively authoring the document directly using the program), wherein the document is structured according to an object model; (ii) scanning the document for the defined structure and extracting the various elements (or nodes) and their attributes; (iii) displaying the extracted document structure to the user of the system; (iv) adding one or more temporal elements to the attributes of the document; and (v) saving the document along with any additional information necessary to execute and implement the temporal elements so as to create a dynamic document object.

...read moreread less

6 citations

Journal Article•DOI•

Searching for web information more efficiently using presentational layout analysis

[...]

Miloš Kovačević, Michelangelo Diligenti, Marco Gori, Veljko Milutinovic

01 Jan 2003-International Journal of Electronic Business

TL;DR: A new, hierarchical representation that includes browser screen coordinates for every HTML object on a page and a classification system which, taking into account the proposed document layout analysis clearly outperforms standard systems by 10%; or more.

...read moreread less

Abstract: Extracting and processing information from web pages is an important task in many areas such as constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a bag of words and then to perform additional processing on such a flat representation. In this paper, we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object on a page. Using visual information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the centre of a page. Initial experiments have shown that, using our heuristics, defined areas are recognised properly in 73%; of cases. Finally, we introduce a classification system which, taking into account the proposed document layout analysis clearly outperforms standard systems by 10%; or more.

...read moreread less

6 citations

Proceedings Article•DOI•

Document skew detection based on hough space derivatives

[...]

Felix Stahlberg¹, Stephan Vogel¹•Institutions (1)

Qatar Computing Research Institute¹

23 Aug 2015

TL;DR: A very generic and robust method which can cope with a wide variety of document types and writing systems is proposed, which uses derivatives in the Hough space to identify directions with sudden changes in their projection profiles.

...read moreread less

Abstract: One of the basic challenges in page layout analysis of scanned document images is the estimation of the document skew. Precise skew correction is particularly important when the document is to be passed to an optical character recognition system. In this paper, we propose a very generic and robust method which can cope with a wide variety of document types and writing systems. It uses derivatives in the Hough space to identify directions with sudden changes in their projection profiles. We show that this criterion is useful to identify the horizontal and vertical direction with respect to the document. We test our method on the DISEC'13 data set for document skew detection. Our results are comparable to the best systems in the literature.

...read moreread less

6 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

35,779

Citations

No. of papers in the topic in previous years
Year	Papers
2023	5
2022	19
2021	34
2020	19
2019	14
2018	9

Document layout analysis

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics