Showing papers on "Document layout analysis published in 2003"

PDF

Open Access

Proceedings Article•DOI•

Document structure analysis algorithms: a literature survey

[...]

Song Mao¹, Azriel Rosenfeld¹, Tapas Kanungo²•Institutions (2)

University of Maryland, College Park¹, IBM²

13 Jan 2003

TL;DR: This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches.

...read moreread less

Abstract: Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.

...read moreread less

278 citations

High Performance Document Layout Analysis

[...]

Thomas M. Breuel

01 Jan 2003

TL;DR: This paper summarize research in document layout analysis carried out over the last few years in the laboratory, which has developed a number of novel geometric algorithms and statistical methods that are applicable to a wide variety of languages and layouts.

...read moreread less

Abstract: In this paper, I summarize research in document layout analysis carried out over the last few years in our laboratory. Correct document layout analysis is a key step in document capture conversions into electronic formats, optical character recognition (OCR), information retrieval from scanned documents, appearance-based document retrieval, and reformatting of documents for on-screen display. We have developed a number of novel geometric algorithms and statistical methods. Layout analysis systems built from these algorithms are applicable to a wide variety of languages and layouts, and have proven to be robust to the presence of noise and spurious features in a page image. The system itself consists of reusable and independent software modules that can be reconfigured to be adapted to different languages and applications. Currently, we are using them for electronic book and document capture applications. If there is commercial or government demand, we are interested in adapting these tools to information retrieval and intelligence applications.

...read moreread less

114 citations

Patent•

Electronic document modification

[...]

Alexander K. Schowtka, Eliza H. Royal, Daniel R. Malone, Robert L. Dulaney

30 May 2003

TL;DR: In this article, a number of possible document layouts and possible document designs are stored and a document is based on the combination of stored layouts and one of the stored designs, while the user is viewing the electronic document, controls are available to the user allowing the user to view and select among alternate layouts and designs for that document.

...read moreread less

Abstract: Electronic document design methods and computer programs allowing a user to separately control and modify layout and the design components of an electronic document. A number of possible document layouts and possible document designs are stored. A document is based on the combination of one of the stored layouts and one of the stored designs. While the user is viewing the electronic document, controls are available to the user allowing the user to view and select among alternate layouts and designs for that document. Color schemes and font schemes for the document may also be separated controlled.

...read moreread less

96 citations

Patent•

Document structure identifier

[...]

David Slocombe

20 May 2003

TL;DR: In this article, a method of automated document structure identification based on visual cues is proposed, which can be applied in the generation of extensible mark-up language files, natural language parsing and search engine ranking mechanisms.

...read moreread less

Abstract: A method of automated document structure identification based on visual cues is disclosed herein The two dimensional layout of the document is analyzed to discern visual cues related to the structure of the document, and the text of the document is tokenized so that similarly structured elements are treated similarly The method can be applied in the generation of extensible mark-up language files, natural language parsing and search engine ranking mechanisms

...read moreread less

91 citations

Proceedings Article•DOI•

Discerning structure from freeform handwritten notes

[...]

Michael Shilman¹, Zile Wei¹, Sashi Raghupathy¹, Patrice Y. Simard¹, D. Jones¹ - Show less +1 more•Institutions (1)

Microsoft¹

03 Aug 2003

TL;DR: This paper presents an integrated approach to parsing textual structure in freeform handwritten notes that solves the layout analysis and classification problems simultaneously: the problems are so tightly coupled that it is not possible to solve one without the other for real user notes.

...read moreread less

Abstract: This paper presents an integrated approach to parsing textual structure in freeform handwritten notes. Text-graphics classification and text layout analysis are classical problems in printed document analysis, but the irregularity in handwriting and content in freeform notes reveals limitations in existing approaches. We advocate an integrated technique that solves the layout analysis and classification problems simultaneously: the problems are so tightly coupled that it is not possible to solve one without the other for real user notes. We tune and evaluate our approach on a large corpus of unscripted user files and reflect on the difficult recognition scenarios that we have encountered in practice.

...read moreread less

87 citations

Journal Article•DOI•

Document retrieval from compressed images

[...]

Yue Lu¹, Chew Lim Tan¹•Institutions (1)

National University of Singapore¹

01 Apr 2003-Pattern Recognition

TL;DR: Preliminary experimental results with the document images captured from students’ theses show that the proposed approach to retrieve the documents from CCITT Group 4 compressed document images has achieved a promising performance.

...read moreread less

85 citations

Patent•

Apparatus, method, and computer program product for document manipulation which embeds information in document data

[...]

Hiroyuki Sayuda¹, Norio Yamamoto¹•Institutions (1)

Fuji Xerox¹

13 Mar 2003

TL;DR: In this article, a method for document manipulation which embeds additional information in document data in which layout and position of a element have been defined comprises a process of generating rendered image data by rendering a region where additional information is to be embedded in the document, embedding additional information within a part of the rendered image, and merging a images of the part in which the additional information embedded in a rendered image with a predetermined region in a original document data.

...read moreread less

Abstract: A method for document manipulation which embeds additional information in document data in which layout and position of a element have been defined comprises a process of generating rendered image data by rendering a region where additional information is to be embedded in the document, a process of embedding additional information in a part of the rendered image data, and a process of merging a images of the part in which the additional information embedded in the rendered image data with a predetermined region in a original document data.

...read moreread less

59 citations

Patent•

Method, system, and apparatus for generating structured document files

[...]

Jinhong Katherine Guo¹, Yue Ma•Institutions (1)

Panasonic¹

20 Aug 2003

TL;DR: In this article, a method, system, apparatus, and graphical user interface (GUI) for generating structured document files from a document image is disclosed, by segmenting the document image into one or more zones containing respective text images.

...read moreread less

Abstract: A method, system, apparatus, and graphical user interface (GUI) for generating structured document files from a document image is disclosed. Structured document files are generated by segmenting the document image into one or more zones containing respective text images, converting the respective text images to digital text, automatically identifying layout information for each of the one or more zones, labeling each of the one or more zones in accordance with a schema, and automatically associating mark-up language tags with the labeled zones to generate the structured document files responsive to the identified layout information and a model file.

...read moreread less

55 citations

Proceedings Article•DOI•

Script-based classification of hand-written text documents in a multilingual environment

[...]

V. Singhal¹, N. Navin, D. Ghosh•Institutions (1)

Indian Institute of Technology Guwahati¹

10 Mar 2003

TL;DR: This paper proposes to preprocess the input document images so as to compensate for the variations due to writing style and thereby making them suitable for analysis on the basis of their visual appearances, and applies denoising, thinning, pruning, m-connectivity and text size normalization in sequence.

...read moreread less

Abstract: Script-based text document classification is an important field of research in the context of multilingual textual document processing But, all script identification techniques available in the literature so far do not consider handwritten documents Variations in the writing style, character size, inter-line and inter-word spacings, etc make the recognition process difficult and unreliable when these script identification algorithms, more specifically visual appearance based approaches, are applied directly on hand-written documents Therefore, in this paper, we propose to preprocess the input document images so as to compensate for the variations due to writing style and thereby making them suitable for analysis on the basis of their visual appearances Accordingly, we apply denoising, thinning, pruning, m-connectivity and text size normalization in sequence Multi-channel Gabor filtering is used to extract texture features that characterize the visual appearances of the document images Experimental result proves the potentiality of our proposed method of script identification for hand-written text document classification

...read moreread less

51 citations

Proceedings Article•DOI•

Arabic newspaper page segmentation

[...]

Karim Hadjar¹, Rolf Ingold¹•Institutions (1)

University of Fribourg¹

03 Aug 2003

TL;DR: The performance of segmentation algorithms and their adaptation in order to treat complex structured Arabic documents such as newspapers, as well as promising experimental results are described.

...read moreread less

Abstract: The aim of layout analysis is to extract the geometricstructure from a document image. It consists of labelinghomogenous regions of a document image. This paperdescribes the performance of segmentation algorithmsand their adaptation in order to treat complex structuredArabic documents such as newspapers. Experimentaltests have been carried out on four different phases ofnewspaper image analysis: thread recognition, framerecognition, image text separation, text line recognition,and line merging into blocks. Some promisingexperimental results are reported.

...read moreread less

47 citations

Patent•

Semantics-bases indexing in a distributed data processing system

[...]

Brandon Brockway¹, Tiffany Durham¹, Cheryl Malatras¹, Gregory Roberts¹•Institutions (1)

IBM¹

05 Jun 2003

TL;DR: In this article, a distributed data processing system, including providing document structure templates comprising model document structures and semantics for the model document structure, identifying the structure of a document, selecting a document structure template in dependence upon the document and the model documents in the template, and storing search keywords from the document in records in a semantics-based search index according to the semantics from the selected template.

...read moreread less

Abstract: Indexing information in a distributed data processing system, including providing document structure templates comprising model document structures and semantics for the model document structures; identifying the structure of a document; selecting a document structure template in dependence upon the structure of the document and the model document structures in the document structure templates; and storing search keywords from the document in records in a semantics-based search index according to the semantics from the selected document structure template. Selecting a document structure template in dependence upon the structure of the document and the model document structures in the document structure templates typically further comprises comparing the structure of the document and the model document structures in the templates; and selecting a template whose model document structure matches the structure of the document.

...read moreread less

Proceedings Article•DOI•

Automated detection and segmentation of table of contents page from document images

[...]

Sekhar Mandal, Shyama Prosad Chowdhury, Amit Kumar Das, Bhabatosh Chanda

03 Aug 2003

TL;DR: This work presents a fully automatic identification and segmentation of a table of contents (TOC) page from a scanned document to help develop a digital document library.

...read moreread less

Abstract: With an aim to extract the structural information from the table of contents (TOC) to help develop a digital document library, the requirement of identifying/segmenting the TOC page is obvious. The objective to create a digital document library is to provide a non-labour intensive, cheap and flexible way of storing, representing and managing the paper document in electronic form to facilitate indexing, viewing, printing and extracting the intended portions. Information from the TOC pages is to be extracted for use in a document database for effective retrieval of the required pages. We present a fully automatic identification and segmentation of a table of contents (TOC) page from a scanned document.

...read moreread less

Proceedings Article•DOI•

Automated segmentation of math-zones from document images

[...]

Shyama Prosad Chowdhury, Sekhar Mandal, Amit Kumar Das, Bhabatosh Chanda

03 Aug 2003

TL;DR: This paper presents fully auotmatic segmentation of displayed-math zones from the document image, using only the spatial layout information of math-formulas and equations, so as to help commercial OCR systems which cannot discern math-zones and also for the identification and arrangement of math symbols by others.

...read moreread less

Abstract: With an aim to high-level understanding of the mathematicalcontents in a document image the requirement ofmath-zone extraction and recognition technique is obviousIn this paper we present fully auotmatic segmentation ofdisplayed-math zones from the document image, using onlythe spatial layout information of math-formulas and equations,so as to help commercial OCR systems which cannotdiscern math-zones and also for the identification and arrangementof math symbols by others

...read moreread less

Proceedings Article•DOI•

An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis

[...]

T.M. Breuel

03 Aug 2003

TL;DR: An algorithm is presented that finds globally maximal whitespace rectangles on page images at arbitrary orientations and eliminates the need for page rotation correction prior to background analysis and can be applied to considerably more complex page layouts than previously possible.

...read moreread less

Abstract: The analysis of the background structure (whitespace) of page images has become an important technique for physical document layout analysis. Globally maximal whites-pace rectangles have been previously demonstrated to constitute a concise representation of the major layout features of documents. However, previous methods for computing maximal whitespace rectangles were limited to axis-aligned rectangles. This paper presents an algorithm that finds globally maximal whitespace rectangles on page images at arbitrary orientations. The new algorithm eliminates the need for page rotation correction prior to background analysis and can be applied to considerably more complex page layouts than previously possible. The algorithm is resolution independent and takes as input a list of foreground shapes (e.g., character or word bounding boxes or polygons) and a set of parameter ranges; it outputs the N largest non-overlapping maximal whitespace rectangles whose parameters (location, width, height, orientation) fall within the required parameter ranges. Examples of applications of the method to severely skewed documents, as well as the UW3 database, are presented.

...read moreread less

Proceedings Article•DOI•

Conversion of PDF documents into HTML: a case study of document image analysis

[...]

Fuad Rahman, Hassan Alam

09 Nov 2003

TL;DR: This paper discusses how image-processing techniques can be used to perform document layout analysis of complex multiple-column PDF documents, which allows the conversion of these documents into the HTML format keeping the logical and physical layout intact.

...read moreread less

Abstract: Portable document format (PDF) has become the de facto standard in many fields because of its independence of local formatting restrictions and its accurate reproducibility. On the other hand, HTML documents are becoming an integral form of our lives by being the dominant form for information exchange within the World Wide Web environment. This paper discusses how image-processing techniques can be used to perform document layout analysis of complex multiple-column PDF documents. This analysis allows the conversion of these documents into the HTML format keeping the logical and physical layout intact.

...read moreread less

Patent•

Document information display system and method, and document search method

[...]

Osamu Imaichi¹, Tetsuo Nishikawa¹, Toru Hisamitsu¹, Makoto Iwayama¹, Masakazu Fujio¹ - Show less +1 more•Institutions (1)

Hitachi¹

27 Feb 2003

TL;DR: In this paper, two document units were extracted from a document database and relevance degrees between individual elements of a group of the document units are calculated, and the results are displayed on a two-dimensional coordinate plane depending on the relevance degree.

...read moreread less

Abstract: The present invention visualizes the contents of a plurality of documents without a lack of the listing property. Two document units are extracted from a document database and relevance degrees between individual elements of a group of the document units are calculated. The results are displayed on a two-dimensional coordinate plane depending on the relevance degree.

...read moreread less

Proceedings Article•DOI•

Correcting the document layout: a machine learning approach

[...]

Donato Malerba¹, Floriana Esposito¹, O. Altamura¹, Michelangelo Ceci¹, Margherita Berardi¹ - Show less +1 more•Institutions (1)

University of Bari¹

03 Aug 2003

TL;DR: A machine learning approach to support the user during the correction of the layout analysis by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions.

...read moreread less

Abstract: In this paper, a machine learning approach to support the user during the correction of the layout analysis is proposed. Layout analysis is the process of extracting a hierarchical structure describing the layout of a page. In our approach, the layout analysis is performed in two steps: firstly, the global analysis determines possible areas containing paragraphs, sections, columns, figures and tables, and secondly, the local analysis groups together blocks that possibly fall within the same area. The result of the local analysis process strongly depends on the quality of the results of the first step. We investigate the possibility of supporting the user during the correction of the results of the global analysis. This is done by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions. Experimental results on a set of multi-page documents are reported and commented.

...read moreread less

Patent•

Presentation data-generating device, presentation data-generating system, data-management device, presentation data-generating method and machine-readable storage medium

[...]

Atsuko Yagi¹•Institutions (1)

Ricoh¹

25 Mar 2003

TL;DR: In this paper, a presentation data-generating device includes a first part for acquiring document data sets, each containing contents to be presented, and a second part for transforming the document data set based on the layout data and the transformation data sets to generate a unified presentation data set.

...read moreread less

Abstract: A presentation data-generating device includes a first part for acquiring document data sets, each containing contents to be presented, a second part for acquiring transformation data sets, each defining a transformation rule between the document data sets and a presentation data set, and, a third part for acquiring layout data containing layout information that defines a layout of the document data sets. The device further includes a fourth part for transforming the document data sets based on the layout data and the transformation data sets to generate a unified presentation data set that presents contents of the document data sets based on the layout defined in the layout data.

...read moreread less

Proceedings Article•DOI•

SmartNails: Display and image dependent thumbnails

[...]

Kathrin Berkner¹, Edward L. Schwartz¹, Christophe Marle²•Institutions (2)

Ricoh¹, Microsoft²

19 Dec 2003

TL;DR: In this article, a novel image representation of compound document images, called SmartNails, is presented to overcome poor readability of text and recognizability of image features in low resolution thumbnails.

...read moreread less

Abstract: In order to overcome poor readability of text and recognizability of image features in low resolution thumbnails, a novel image representation of compound document images - a SmartNail representation - is presented. SmartNails are replacements or supplements to traditional thumbnails for compound documents and contain cropped and scaled image and text segments. Image- and text-based analysis are merged to generate a layout for a particular display size with selected readable text and recognizable image regions. The analysis is efficiently performed by using information from document layout analysis and JPEG 2000 compressed file headers.

...read moreread less

Journal Article•DOI•

Text Retrieval from Document Images Based on Word Shape Analysis

[...]

Chew Lim Tan¹, Weihua Huang¹, Sam Yuan Sung¹, Zhaohui Yu¹, Yi Xu¹ - Show less +1 more•Institutions (1)

National University of Singapore¹

01 May 2003-Applied Intelligence

TL;DR: This paper proposes a method of text retrieval from document images using a similarity measure based on word shape analysis that directly extract image features instead of using optical character recognition.

...read moreread less

Abstract: In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.

...read moreread less

Patent•

Electronic equipment, server, and presentation method of layout script text

[...]

Naomasa Takahashi

21 Jul 2003

TL;DR: In this paper, an electronic equipment capable of displaying input data from an external interface in a display region defined by a layout script text and of enabling a user to easily select a layout which is desired to be adopted is provided.

...read moreread less

Abstract: An electronic equipment capable of displaying input data from an external interface in a display region defined by a layout script text and of enabling a user to easily select a layout which is desired to be adopted is provided. The electronic equipment, such as a television set, takes in a layout script text from outside through, for example, a network to store it in the electronic equipment, which layout script text defines at least media information and a display region of the media information, which is input through various interfaces such as a video input terminal, broadcast reception, a reader of a detachably mountable storage medium. The user selects an arbitrary layout script text among the stored layout script texts. Then, the electronic equipment reproduces media information in accordance with the selected layout script text to display the media information.

...read moreread less

Skew estimation of binary document images using static and dynamic thresholds useful for document image mosaicing.

[...]

Palaiahnakote Shivakumara, G. Hemantha Kumar, Devanur S. Guru, P. Nagabhushan

01 Jan 2003

TL;DR: This paper presents a computationally efficient procedure for skew detection in digitized text documents which is based on Linear Regression Analysis and provides good and accurate results.

...read moreread less

Abstract: This paper presents a computationally efficient procedure for skew detection in digitized text documents which is based on Linear Regression Analysis. The determination of the skew angle in text documents is essential in Optical Character Recognition (OCR) systems and Document Image Mosaicing (DIM). We use the Linear Regression formula to estimate a skew angle for each text line segment of the given skewed text document. The part of the text line is extracted using static and dynamic thresholds of projection profile based method. The proposed method is tested on variety of text documents and it provides good and accurate results

...read moreread less

Patent•

System for processing structured document

[...]

Kazuyoshi Tanaka, 一義田中

18 Nov 2003

TL;DR: In this paper, the authors propose a system for processing a structured document that can reduce labor for programming, in a software program for handling structured documents, by eliminating data conversion processing for processing of document contents and data extraction processing for selection of necessary data.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a system for processing a structured document that can reduce labor for programming, in a software program for handling a structured document, by eliminating data conversion processing for processing of document contents and data extraction processing for selection of necessary data. SOLUTION: A structured document holds a document structure definition represented by declarations of document elements forming document contents and by a set of semantic relations defined between the document elements, and a set of instances of the respective document elements matching the document structure definition. The system for processing a structured document, which comprises data reading means for reading the instances of document elements from the structured document and editing them into data processible by a software program to provide them, comprises basic data structure selecting means for selecting and specifying a data structure of the data provided by the data reading means, from known basic data structures or object structures, such as array, set, list, tree, graph and table structures. COPYRIGHT: (C)2005,JPO&NCIPI

...read moreread less

Journal Article•DOI•

A non-contact method of capturing low-resolution text for OCR

[...]

Majid Mirmehdi¹, Paul Clark¹, J. Lam¹•Institutions (1)

University of Bristol¹

22 Apr 2003-Pattern Analysis and Applications

TL;DR: A novel automatic text reading system is introduced using an active camera focused on text regions already located in the scene (using this recent work), and a number of images are captured over the text region to construct a high-resolution mosaic composite of the whole region.

...read moreread less

Abstract: Document recognition is a lively research area with much effort concentrated on optical character recognition. Less attention is paid to locating and extracting text from the general (non-desktop, non-scanner) environment. Such contact-free extraction of text from a general scene has applications in the context of wearable computing, robotic vision, point and click document capture, or as an aid for visually handicapped people. Here, a novel automatic text reading system is introduced using an active camera focused on text regions already located in the scene (using our recent work). Initially, a located region of text is analysed to determine the optimal zoom that would foveate onto it. Then a number of images are captured over the text region to construct a high-resolution mosaic composite of the whole region. This magnified image of the text is suitable for reading by humans or for recognition by OCR, or even for text-to speech synthesis. Although we employed a low resolution camera, we still obtained very good results.

...read moreread less

Patent•

Document search/browse method and document search/browse system

[...]

Takeshi Eisaki, Katsumi Marukawa, Takeuchi Sayaka, 勝美丸川, 健永崎, 沙弥香竹内 - Show less +2 more

29 Oct 2003

TL;DR: In this article, the authors present a method that enables a search and browse of document image groups through the application of a document structure analysis technique and a character recognition technique as searching/browsing means for paper documents and document images.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a method that enables a search and browse of a document image group through the application of a document structure analysis technique and a character recognition technique as searching/browsing means for paper documents and document images. SOLUTION: A highly functional document image search/browse system separates an OCR and a document processing apparatus, adopts as OCR output formats data (reading hypothesis data) holding multiple hypotheses of character line extraction, character segmentation and character recognition, and document structure data having ruled line information, frame information, character line information, browse attribute information and the like about a document image, and provides a function of important keyword extraction and document search from typed and handwritten character strings using OCR-added data, and of document display intended by a browser using the document structure data. COPYRIGHT: (C)2005,JPO&NCIPI

...read moreread less

Patent•

Apparatus and program for designing form layout

[...]

Masaaki Kimura, Nobuyuki Miyagaki, Okamoto Mitsuru, Akira Tanaka, Makoto Wada, 和田真, 宮垣信幸, 岡本充, 木村正明, 田中明 - Show less +6 more

27 Mar 2003

TL;DR: In this article, the layout design of the layout information is performed so as to be treatable independently of the writing data obtained by converting the information into electronic positional information, which is used for printing ruled lines or symbols on the forms.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a system capable of performing original layout designing of forms and printing thereof, which are used, when making electron data of contents input by handwriting by a small lot. SOLUTION: The layout design of the layout information is performed so as to be treatable independently of the writing data obtained by converting the information, in which the layout information for printing ruled lines or symbols on the forms is filled in onto the surface of the form by handwriting, into electronic positional information. The system is provided with a layout information receiving means for receiving the input of the layout information created to be treatable, independently of the writing data and to be set on the form, and an identifier giving means for giving an identifier by each layout information received by the layout information receiving means. COPYRIGHT: (C)2004,JPO

...read moreread less

Proceedings Article•DOI•

A system for the automatic layout segmentation and classification of digital documents

[...]

Luigi Cinque, Stefano Levialdi, Alessio Malizia

17 Sep 2003

TL;DR: A new system for document recognition is presented that follows the open source methodologies, XML description for document segmentation and classification, which turns out to be beneficial in terms of classification precision, and general-purpose availability.

...read moreread less

Abstract: Paper document recognition is fundamental for office automation becoming every day a more powerful tool in those fields where information is still on paper. Document recognition follows from data acquisition, from both journals and entire books, in order to transform them into digital objects. We present a new system for document recognition that follows the open source methodologies, XML description for document segmentation and classification, which turns out to be beneficial in terms of classification precision, and general-purpose availability.

...read moreread less

Journal Article•DOI•

Managing very large document collections using semantics

[...]

Guoren Wang¹, Hongjun Lu², Ge Yu¹, Yubin Bao¹•Institutions (2)

Northeastern University (China)¹, Hong Kong University of Science and Technology²

01 May 2003-Journal of Computer Science and Technology

TL;DR: A semantic document management system XBASE is designed and implemented based on the semantics and the functions of three main modules, X-Loader,X-Explorer and X-Query.

...read moreread less

Abstract: In this paper, a system is presented where documents are no longer identified by their file names. Instead, a document is represented by its semantics in terms of descriptor and content vector. The descriptor of a document consists of a set of attributes, such as date of creation, its type, its size, annotations, etc. The content vector of a document consists of a set of terms extracted from the document. In this paper, a semantic document management system XBASE is designed and implemented based on the semantics and the functions of three main modules, X-Loader, X-Explorer and X-Query.

...read moreread less

Proceedings Article•DOI•

Graphics extraction in PDF document

[...]

Hui Chao¹•Institutions (1)

Hewlett-Packard¹

20 Jan 2003

TL;DR: This paper presents a bottom up approach to recognize graphic illustration in PDF document and shows how this technique can be used in automatic figure extraction, document re-flow and document transformation.

...read moreread less

Abstract: PDF is a document format for final presentation. It preserves the original document layout but often not the document logical structure. Graphic illustrations such as figures and tables in PDF often consist of ungrouped graphic primitives such as lines, curves and small text elements. In this paper, we present a bottom up approach to recognize graphic illustration in PDF document. Vicinities of page elements in both 2D space and indexes in layer are used to understand the logical connection between elements. Graphics recognition and elements grouping for illustration is an important part in understanding the document logical structure. This technique can be used in automatic figure extraction, document re-flow and document transformation.

...read moreread less

Patent•

Document retrieval system, document retrieval method, and storage medium

[...]

Eiichiro Toshima¹, 英一朗戸島•Institutions (1)

Canon Inc.¹

30 Apr 2003

TL;DR: In this paper, text feature information based on texts included in the documents and image feature based on document images are held in a plurality of the respective documents to retrieve an original of an electronic document with practical response performance without constraining a user.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To retrieve an original of an electronic document with practical response performance without constraining a special burden on a user. SOLUTION: Text feature information based on texts included in the documents and image feature information based on document images are held in a plurality of the respective documents. A character recognition processing is executed for an image data of the retrieved document to acquire the text feature information based on the obtained text, and to acquire the image feature information (lay-out information) based on an image data of the retrieved document (S92-S94). A memory is retrieved using the text feature information and the image feature information acquired as to the retrieved document to retrieve the document corresponding to the retrieved document out of the plurality of documents (S98-S101). COPYRIGHT: (C)2005,JPO&NCIPI

...read moreread less