scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Book
01 Jan 1984
TL;DR: This study sees that communication (feedback) about the queries of inquirers searching for a given document can be incorporated by a retrieval system in order to redescribe that document so that its description matches better those queries.
Abstract: The central problem in document retrieval is that the subject of a document may be described in many different ways and, similarly, different inquirers may express similar information needs by a variety of different queries. This variance makes it difficult to get the "right" documents into the hands of the "right" inquirers, for retrieving a document by means of its subject description depends on that subject description adequately matching an inquirer's query. Document descriptions comprise only one part of a retrieval system, and a "good" document description is one that describes the subject of a document in a way that will match the queries of inquirers who will find that document relevant to their information need. In this study, we see that communication (feedback) about the queries of inquirers searching for a given document can be incorporated by a retrieval system in order to redescribe that document so that its description matches better those queries. An adaptive (genetic) algorithm, responsible for such redescription, achieves two aims: first, it increases the probability of a document's subject description matching a query to which the document is relevant (equivalently, it increases the degree of association between a document and a relevant query); second, the algorithm decreases the probability of a document's subject description matching a query to which the document is not relevant (equivalently, it decreases the degree of association between a document and a non-relevant query). Simulation experiments demonstrate the success of adaptive subject redescription in achieving these aims. The simulation technique, itself, is novel: By establishing a set of queries, (to some of which a document is relevant, the rest of which it is not), and measuring the association between the document's description and each of these queries, we obtain estimates of system recall and fallout without building an actual document collection. The method of obtaining such "simulated queries" is described. The simulation technique may help provide a solution to the problem of predicting the performance of a large-scale retrieval system based on its operation in a smaller-scale experimental setting.

6 citations

Proceedings ArticleDOI
16 Sep 2014
TL;DR: Experimental results show that the automatic strategy for identifying the correct reading order of a document page's components based on abstract argumentation is effective in more complex cases, and requires less background knowledge, than previous solutions that have been proposed in the literature.
Abstract: Detecting the reading order among the layout components of a document's page is fundamental to ensure effectiveness or even applicability of subsequent content extraction steps. While in single-column documents the reading flow can be straightforwardly determined, in more complex documents the task may become very hard. This paper proposes an automatic strategy for identifying the correct reading order of a document page's components based on abstract argumentation. The technique is unsupervised, and works on any kind of document based only on general assumptions about how humans behave when reading documents. Experimental results show that it is effective in more complex cases, and requires less background knowledge, than previous solutions that have been proposed in the literature.

6 citations

Proceedings ArticleDOI
S. Abirami1, D. Manjula1
14 Aug 2007
TL;DR: This paper performs a profile based Information Retrieval from printed document image collections based on word profiles identified to match the word images in Bilingual document images.(English and Tamil).
Abstract: This paper performs a profile based Information Retrieval from printed document image collections. Keywords are valuable indexing tools and if they can be identified at the image level, extensive computation during recognition will be avoided. Printed documents can be scanned to produce document images. Instead of converting entire document images into text equivalent, word profiles are identified to match the word images in Bilingual document images.(English and Tamil). During retrieval, the same profile could be extracted from the user specified word and can be matched with the word images in the document. This yields a faster result even in a quality-degraded document. This kind of Information Retrieval (Keyword Based Search) can be adapted in Digital Libraries, which employs digitized documents instead of text processing. This promotes efficient search in document images irrespective of the language.

6 citations

Patent
25 Jun 1986
TL;DR: In this paper, a document image reducing means 15 reduces the document image in a text document image file means 14 and displays their list on an image display means 13, so that the operator can utilize the image information directly and easily read the document images.
Abstract: PURPOSE:To facilitate the retrieval of a document image by using an image obtained by reducing an actual filed document image itself as a means for indexing in addition to key words. CONSTITUTION:A document image reducing means 15 reduces the document image in a document image file means 14. The command for file retrieval is inputted by a command input means 12 and then the key word for document images is inputted, so that a CPU 16 displays the number of document images corresponding to the key word in the document image file means 14. When an operator commands the reference of the reduced image, the CPU 16 reduces said number of document images in the document image file means 14 and displays their list on an image display means 13. The operator, therefore, utilizes the image information directly and easily reads the document images.

6 citations

Journal ArticleDOI
TL;DR: This paper incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage and proposes a knowledge-based query-preprocessing algorithm, which reduces the search space.
Abstract: This paper presents a knowledge-based approach to managing and retrieving personal documents. The dual document models consist of a document type hierarchy and a folder organization. The document type hierarchy is used to capture the layout, logical and conceptual structures of documents. The folder organization mimics the user's real-world document filing system for organizing and storing documents in an office environment. Predicate-based representation of documents is formalized for specifying knowledge about documents. Document filing and retrieval are predicate-driven. The filing criteria for the folders, which are specified in terms of predicates, govern the grouping of frame instances, regardless of their document types. We incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage. This architecture supports various text-based information retrieval techniques and content-based multimedia information retrieval techniques. The paper also proposes a knowledge-based query-preprocessing algorithm, which reduces the search space. For automating the document filing and retrieval, a predicate evaluation engine with a knowledge base is proposed. The learning agent is responsible for acquiring the knowledge needed by the evaluation engine.

6 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189