Topic
Document layout analysis
About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.
Papers published on a yearly basis
Papers
More filters
•
30 May 2006
TL;DR: In this article, an original compound document and a modified compound document are analyzed to determine and mark the location of embedded objects, and a comparison is performed between an original primary document and the modified primary document, the output of which is a comparison output document.
Abstract: A method and system for comparing compound documents. An original compound document and a modified compound document are analyzed to determine and mark the location of embedded objects. A comparison is performed between an original primary document and the modified primary document, ignoring the embedded objects, the output of which is a comparison output document. The embedded objects are compared by copying the contents of the embedded objects to compatible documents, comparing the embedded object from the original compound document and the embedded object from the modified compound document, the output of which is inserted into the comparison output document using the location markers of the embedded objects.
31 citations
••
11 Jun 2006
TL;DR: An HTML Web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-converted-HTML files), shows that segmenting the entire Web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.
Abstract: We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.
31 citations
••
31 Aug 2005TL;DR: The proposed method, named selective CRLA, has been successfully applied to extraction of text from commercial magazine pages with complicated layouts and is capable of processing documents with both Manhattan and non-Manhattan layouts.
Abstract: The constrained run-length algorithm (CRLA) is a well-known technique for page segmentation. The algorithm is fast and can be used to partition documents with Manhattan layouts. It is not, however, suited to deal with pages with layouts beyond the Manhattan format, e.g. irregular halftone images embedded in text paragraphs. A modified version of the CRLA, named selective CRLA, is presented in this paper. The selective CRLA is capable of processing documents with both Manhattan and non-Manhattan layouts. The selective CRLA is performed twice with different sets of parameters on a label image derived from the input document image. After both of its executions, the yielded text regions are extracted. The proposed method has been successfully applied to extraction of text from commercial magazine pages with complicated layouts.
31 citations
•
08 Jan 2009TL;DR: In this article, a combined image and text document is described, where a scanned image of a document can be generated utilizing a scanning application, and text representations of text that is included in the document can also be generated using a character recognition application.
Abstract: A combined image and text document is described. In embodiment(s), a scanned image of a document can be generated utilizing a scanning application, and text representations of text that is included in the document can be generated utilizing a character recognition application. Position data of the text representations can be correlated with locations of corresponding text in the scanned image of the document. The scanned image can then be rendered for display overlaid with the text representations as a transparent overlay, where the scanned image and the text representations are independently user-selectable for display. A user-selectable input can be received to display the text representations without the scanned image, the scanned image without the text representations, or to display the text representations adjacent the scanned image.
30 citations
•
17 Mar 1999TL;DR: A layout analysis section analyzes a layout structure of an input image and a layout information memory section stores layout information representing a relationship between the layout structure and a corresponding area in the input image.
Abstract: A document image processing apparatus. A layout analysis section analyzes a layout structure of an input image. A layout information memory section stores layout information representing a relationship between the layout structure and a corresponding area in the input image. An image display section displays the corresponding area in the input image according to the layout information. An indication input section inputs an indication to modify the corresponding area in the input image displayed. A modification section modifies the corresponding area in the input image and the layout structure of the corresponding area in the layout information according to the indication.
30 citations