scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Proceedings ArticleDOI
18 Sep 2012
TL;DR: This work introduces an approach that segments text appearing in page margins from manuscripts with complex layout format, independent of block segmentation, as well as pixel level analysis.
Abstract: Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format. Simple and discriminative features are extracted in a connected-component level and subsequently robust feature vectors are generated. Multilayer perception classifier is exploited to classify connected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmentation and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of complex side-notes layout formats, achieving a segmentation accuracy of about 95%.

65 citations

Patent
14 Nov 1997
TL;DR: In this article, a document image is segmented into layout objects, and the system computes attributes and features for each segmented layout object before any document images are transmitted between a client and a server.
Abstract: In a document search and retrieval system, document images are segmented into layout objects. Each layout object identifies different structural elements in a document image. In addition, the system computes attributes and features for each segmented layout object. Before any document images are transmitted between a client and a server, users specify which document image attributes and features are most relevant to their browsing or searching tasks. Transmission (and/or display) of document images is then divided into two stages. During the first stage, those layout objects which are identified as having the specified features or attributes are transmitted at a first or high resolution; the remaining layout objects in an image are transmitted at a second or lower resolution (or in the form of bounding polygons). If the second stage is invoked, those remaining layout objects are re-transmitted at the first or high resolution. The second stage of transmission may be invoked when either a user request is received or when there is a system timeout.

64 citations

Patent
23 Feb 2010
TL;DR: In this article, a system for electronically distilling information from a business document uses a network scanner to electronically scan a platen area, having a businessdocument thereon, to create a bitmap.
Abstract: A system for electronically distilling information from a business document uses a network scanner to electronically scan a platen area, having a business document thereon, to create a bitmap. A network server carries out a segmentation process to segment the scan generated bitmap into a bitmap object, the bitmap object corresponding to the scanned business document; a bitmap to text conversion process to convert the bitmap object into a block of text; a semantic recognition process to generate a structured representation of semantic entities corresponding to the scanned business document; and a document generation process to convert the structured representation into a structure text file. The semantic recognition process includes the processes of generating, for each line of text having a keyword therein, a terminal symbol corresponding to the keyword therein; generating, for each line of text not having a keyword therein and absent of numeric characters, an alphabetic terminal symbol; generating, for each line of text not having a keyword therein and having a numeric character therein, an alphanumeric terminal symbol; generating a string of terminal symbols from the generated terminal symbols; determining a probable parsing of the generated string of terminal symbols; labeling each text line, according to a determined function, with non-terminal symbols; and parsing the business document information text into fields of business document information text based upon the non-terminal symbol of each text line and the determined probable parsing of the generated string of terminal symbols.

63 citations

Journal ArticleDOI
TL;DR: This paper proposes a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents.
Abstract: Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

62 citations

Patent
28 Dec 2001
TL;DR: In this paper, a system and method for automatically generating a hierarchical table of contents or outline for indexing a document and identifying clusters of related information in the document is presented, which employs a unique and novel combination of latent semantic indexing techniques to identify related blocks and major topic changes within the document with scale space segmentation techniques.
Abstract: A system and method for automatically generating a hierarchical table of contents or outline for indexing a document and identifying clusters of related information in the document. The document may comprise text, audio, video, or a multimedia presentation. The invention employs a unique and novel combination of latent semantic indexing techniques to identify related blocks and major topic changes within the document with scale space segmentation techniques to respectively identify self-similar blocks within the document and to thus find topic changes of various sizes at block edges. The invention then produces a visual presentation of the semantic structure of the document.

62 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189