Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Book•

Adaptive subject description in document retrieval

[...]

Michael David Gordon

01 Jan 1984

TL;DR: This study sees that communication (feedback) about the queries of inquirers searching for a given document can be incorporated by a retrieval system in order to redescribe that document so that its description matches better those queries.

...read moreread less

Abstract: The central problem in document retrieval is that the subject of a document may be described in many different ways and, similarly, different inquirers may express similar information needs by a variety of different queries. This variance makes it difficult to get the "right" documents into the hands of the "right" inquirers, for retrieving a document by means of its subject description depends on that subject description adequately matching an inquirer's query. Document descriptions comprise only one part of a retrieval system, and a "good" document description is one that describes the subject of a document in a way that will match the queries of inquirers who will find that document relevant to their information need. In this study, we see that communication (feedback) about the queries of inquirers searching for a given document can be incorporated by a retrieval system in order to redescribe that document so that its description matches better those queries. An adaptive (genetic) algorithm, responsible for such redescription, achieves two aims: first, it increases the probability of a document's subject description matching a query to which the document is relevant (equivalently, it increases the degree of association between a document and a relevant query); second, the algorithm decreases the probability of a document's subject description matching a query to which the document is not relevant (equivalently, it decreases the degree of association between a document and a non-relevant query). Simulation experiments demonstrate the success of adaptive subject redescription in achieving these aims. The simulation technique, itself, is novel: By establishing a set of queries, (to some of which a document is relevant, the rest of which it is not), and measuring the association between the document's description and each of these queries, we obtain estimates of system recall and fallout without building an actual document collection. The method of obtaining such "simulated queries" is described. The simulation technique may help provide a solution to the problem of predicting the performance of a large-scale retrieval system based on its operation in a smaller-scale experimental setting.

...read moreread less

6 citations

Proceedings Article•DOI•

Abstract argumentation for reading order detection

[...]

Stefano Ferilli¹, Domenico Grieco¹, Domenico Redavid, Floriana Esposito¹•Institutions (1)

University of Bari¹

16 Sep 2014

TL;DR: Experimental results show that the automatic strategy for identifying the correct reading order of a document page's components based on abstract argumentation is effective in more complex cases, and requires less background knowledge, than previous solutions that have been proposed in the literature.

...read moreread less

Abstract: Detecting the reading order among the layout components of a document's page is fundamental to ensure effectiveness or even applicability of subsequent content extraction steps. While in single-column documents the reading flow can be straightforwardly determined, in more complex documents the task may become very hard. This paper proposes an automatic strategy for identifying the correct reading order of a document page's components based on abstract argumentation. The technique is unsupervised, and works on any kind of document based only on general assumptions about how humans behave when reading documents. Experimental results show that it is effective in more complex cases, and requires less background knowledge, than previous solutions that have been proposed in the literature.

...read moreread less

6 citations

Proceedings Article•DOI•

Profile Based Information Retrieval from Printed Document Images

[...]

S. Abirami¹, D. Manjula¹•Institutions (1)

Anna University¹

14 Aug 2007

TL;DR: This paper performs a profile based Information Retrieval from printed document image collections based on word profiles identified to match the word images in Bilingual document images.(English and Tamil).

...read moreread less

Abstract: This paper performs a profile based Information Retrieval from printed document image collections. Keywords are valuable indexing tools and if they can be identified at the image level, extensive computation during recognition will be avoided. Printed documents can be scanned to produce document images. Instead of converting entire document images into text equivalent, word profiles are identified to match the word images in Bilingual document images.(English and Tamil). During retrieval, the same profile could be extracted from the user specified word and can be matched with the word images in the document. This yields a faster result even in a quality-degraded document. This kind of Information Retrieval (Keyword Based Search) can be adapted in Digital Libraries, which employs digitized documents instead of text processing. This promotes efficient search in document images irrespective of the language.

...read moreread less

6 citations

Patent•

Electronic file device for document image

[...]

Yosuke Furukawa, Takuya Sugita

25 Jun 1986

TL;DR: In this paper, a document image reducing means 15 reduces the document image in a text document image file means 14 and displays their list on an image display means 13, so that the operator can utilize the image information directly and easily read the document images.

...read moreread less

Abstract: PURPOSE:To facilitate the retrieval of a document image by using an image obtained by reducing an actual filed document image itself as a means for indexing in addition to key words. CONSTITUTION:A document image reducing means 15 reduces the document image in a document image file means 14. The command for file retrieval is inputted by a command input means 12 and then the key word for document images is inputted, so that a CPU 16 displays the number of document images corresponding to the key word in the document image file means 14. When an operator commands the reference of the reduced image, the CPU 16 reduces said number of document images in the document image file means 14 and displays their list on an image display means 13. The operator, therefore, utilizes the image information directly and easily reads the document images.

...read moreread less

6 citations

Journal Article•DOI•

Personal Document Management and Retrieval: A Knowledge-Based Approach

[...]

Xien Fan, Peter A. Ng¹•Institutions (1)

New Jersey Institute of Technology¹

01 Sep 1998-Journal of Systems Integration

TL;DR: This paper incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage and proposes a knowledge-based query-preprocessing algorithm, which reduces the search space.

...read moreread less

Abstract: This paper presents a knowledge-based approach to managing and retrieving personal documents. The dual document models consist of a document type hierarchy and a folder organization. The document type hierarchy is used to capture the layout, logical and conceptual structures of documents. The folder organization mimics the user's real-world document filing system for organizing and storing documents in an office environment. Predicate-based representation of documents is formalized for specifying knowledge about documents. Document filing and retrieval are predicate-driven. The filing criteria for the folders, which are specified in terms of predicates, govern the grouping of frame instances, regardless of their document types. We incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage. This architecture supports various text-based information retrieval techniques and content-based multimedia information retrieval techniques. The paper also proposes a knowledge-based query-preprocessing algorithm, which reduces the search space. For automating the document filing and retrieval, a predicate evaluation engine with a knowledge base is proposed. The learning agent is responsible for acquiring the knowledge needed by the evaluation engine.

...read moreread less

6 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

35,779

Citations

No. of papers in the topic in previous years
Year	Papers
2023	5
2022	19
2021	34
2020	19
2019	14
2018	9

Document layout analysis

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics