scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
04 Aug 2005
TL;DR: In this paper, a translation device consisting of a character recognition unit that recognizes text data in a text region of an input image, a translator that translates the text data from the text region to the image region, and a layout configuration processor that generates data containing the translated text data and graphics in the input image is described.
Abstract: A translation device comprises a character recognition unit that recognizes text data in a text region of an input image; a translator that translates the text data in the text region; and a layout configuration processor that generates data containing the translated text data in the text region and graphics in the input image, wherein a layout of the input image is maintained in a layout of the image of the data generated by the layout configuration processor.

32 citations

Patent
24 Jan 2000
TL;DR: In this paper, the style of an example document is determined by examining the example file for syntax patterns that are required in a document of this type, each pattern is used to create a section template (a sub-template for a larger template).
Abstract: A system and method of using an example document to create another document with the same style. The style is determined by examining the example file for syntax patterns that are required in a document of this type. Each pattern is used to create a section template (a sub-template for a larger template). After all the required sub-templates have been defined, by examining the example, we have a document template that may be used to format new documents. Along with user-specific content, a document generator uses the captured document template to generate sections of a new document. When a section of a document is generated, the sub-template that corresponds to that section of a document is inserted with user-specific content. The generated file ends up with the same kind of text spacing and positioning, ordering of sections, presence of annotations and other nonfunctional attributes as the example.

32 citations

Patent
06 Jul 2012
TL;DR: In this article, a method, a storage medium and a system for document content reconstruction are provided in a digital content delivery and online education services platform to enable delivery of textbooks and other copyrighted material to multi-platform web browser applications.
Abstract: A method, a storage medium and a system for document content reconstruction are provided in a digital content delivery and online education services platform to enable delivery of textbooks and other copyrighted material to multi-platform web browser applications. The method comprises ingesting a document page in an unstructured document format. The method further comprises extracting one or more images and metadata associated with the images and text and fonts associated with the texts from the document page. In addition, the method comprises coalescing text into paragraphs and creating a structured document page in a markup language format using the extracted images, text and fonts rendered with layout fidelity to the original ingested document page.

32 citations

Patent
04 Dec 2004
TL;DR: Manifold representations of content are: multiple versions of anything that might appear in a document, from text, to images, to even such things as stylistic conventions as mentioned in this paper.
Abstract: A user interface for a system and method for improving document layout on arbitrary devices of different resolutions and size using manifold representations of content. Manifold representations of content are: multiple versions of anything that might appear in a document, from text, to images, to even such things as stylistic conventions. The specific content is selected and formatted dynamically, on the fly, by a layout engine in order to best adapt to a given viewing situation.

32 citations

Journal ArticleDOI
TL;DR: A novel hybrid method, which includes three main stages to deal with document layout analysis or page segmentation, which is the combination of connected component analysis and multilevel homogeneity structure and achieves a higher accuracy compared to other methods.
Abstract: Document layout analysis or page segmentation is the task of decomposing document images into many different regions such as texts, images, separators, and tables. It is still a challenging problem due to the variety of document layouts. In this paper, we propose a novel hybrid method, which includes three main stages to deal with this problem. In the first stage, the text and non-text elements are classified by using minimum homogeneity algorithm. This method is the combination of connected component analysis and multilevel homogeneity structure. Then, in the second stage, a new homogeneity structure is combined with an adaptive mathematical morphology in the text document to get a set of text regions. Besides, on the non-text document, further classification of non-text elements is applied to get separator regions, table regions, image regions, etc. The final stage, in refinement region and noise detection process, all regions both in the text document and non-text document are refined to eliminate noises and get the geometric layout of each region. The proposed method has been tested with the dataset of ICDAR2009 page segmentation competition and many other databases with different languages. The results of these tests showed that our proposed method achieves a higher accuracy compared to other methods. This proves the effectiveness and superiority of our method.

31 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189