scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Proceedings Article
01 Jan 2009
TL;DR: This work aims at introducing in the document processing framework of DOMINUS qualitative techniques based on the lexical taxonomy WordNet and its extension WordNet Domains for text categorization and keyword extraction, that can support the currently embedded techniquesbased on quantitative approaches.
Abstract: The large availability of documents in digital format posed the problem of efficient and effective retrieval mechanisms. This involves the ability to process natural language, which is a significantly complex task. Traditional algorithms based on term matching between the document and the query, although efficient, are not able to catch the intended meaning of both, and hence cannot ensure effectiveness. To step on toward semantics, problems such as polysemy and synonimy must be tackled automatically by text processing systems. This work aims at introducing in the document processing framework of DOMINUS qualitative techniques based on the lexical taxonomy WordNet and its extension WordNet Domains for text categorization and keyword extraction, that can support the currently embedded techniques based on quantitative approaches. In particular, a density function is exploited to assign the proper importance to the involved concepts and domains. Preliminary results on texts of different subjects confirm its effectiveness.

8 citations

Patent
Kosei Fume1
22 Jan 2009
TL;DR: A document processing apparatus includes an extracting unit that extracts text document information from a document data; an analyzing unit that analyzes a modification relation of a character string included in the text documents; and an attribute unit that assigns an attribute indicating details of the modification relation to the character string, and embeds the attribute in text documents as discussed by the authors.
Abstract: A document processing apparatus includes an extracting unit that extracts text document information from a document data; an analyzing unit that analyzes a modification relation of a character string included in the text document information; an attribute unit that assigns an attribute indicating details of the modification relation to the character string, and embeds the attribute in the text document information; a document specifying unit that specifies a document-specifying character string that specifies other text document information, using the text document information in which the attribute is embedded by the attribute unit; and a document-identification unit that assigns document identification information to the document-specifying character string, and embeds the document identification information in the text document information.

8 citations

Patent
27 Sep 2000
TL;DR: In this article, a method and apparatus for formatting a computer-generated document for output, such as printing, is provided, where information necessary to generate a document is extracted from a database and a layout program assigns specific layout parameters to each layout identifier, which specify the placement of an associated print data record within a document.
Abstract: A method and apparatus for formatting a computer-generated document for output, such as printing, is provided. Information necessary to generate a document is extracted from a database. The extraction program assigns a layout identifier to each data record retrieved from the database based on the type of information contained within the data record and how the information is to be formatted in the document. A layout program assigns specific layout parameters to each layout identifier, which specify the placement of an associated print data record within a document. Next, a formatting program applies the set of layout parameters to a data stream containing a plurality of data records to create a formatted document. The various elements of the invention such as the data extraction program, the database, the layout program and the formatter, may be integrated into a single software program, co-resident on a single computer system, or distributed across various computer systems on a network. It is also contemplated that the one or more of the various elements of the invention such as the formatter, the extraction program, or the layout program could be embodied as hardware instead of software.

8 citations

Book ChapterDOI
20 Dec 2005
TL;DR: A hierarchical framework for document segmentation is proposed as an optimization problem and the novelty of this approach lies in learning the segmentation parameters in the absence of groundtruth.
Abstract: A hierarchical framework for document segmentation is proposed as an optimization problem The model incorporates the dependencies between various levels of the hierarchy unlike traditional document segmentation algorithms This framework is applied to learn the parameters of the document segmentation algorithm using optimization methods like gradient descent and Q-learning The novelty of our approach lies in learning the segmentation parameters in the absence of groundtruth

8 citations

Journal ArticleDOI
01 Jan 2014
TL;DR: An approach for Document Layout Analysis based on local correlation features that identifies and extracts illustrations in digitized documents by learning the discriminative patterns of textual and pictorial regions.
Abstract: In this paper we propose an approach for Document Layout Analysis based on local correlation features. We identify and extract illustrations in digitized documents by learning the discriminative patterns of textual and pictorial regions. The proposal has been demonstrated to be effective on historical datasets and to outperform the state-of-the-art in presence of challenging documents with a large variety of pictorial elements.

8 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189