scispace - formally typeset
Search or ask a question
Book ChapterDOI

Document Classification and Information Extraction

01 Jan 1996-pp 97-145
TL;DR: In TEXPROS, the task of document classification is to determine the types of the office documents and the document classification subsystem identifies the corresponding frame template of the document.
Abstract: In Chapter 4 and 5, we turn our attention to the techniques used for document classification and information extraction [60, 61, 62, 174, 175]. In TEXPROS, the task of document classification is to determine the types of the office documents. That is, given an office document, the document classification subsystem identifies the corresponding frame template of the document. By identifying the defined type of the documents, it is possible to implement efficient storage and access methods to enhance the performance of retrieval. The task of information extraction is extracting from the contents of the document the most relevant information pertinent to the user. That is, given an office document, the information extraction subsystem forms its frame instance by instantiating its corresponding frame template. The document classification and information extraction can be achieved in aid of analyzing the document structures.
Citations
More filters
Proceedings ArticleDOI
08 Dec 1997
TL;DR: A three-level conceptual architecture is presented as the reference model of HyTEXPROS, a hypermedia information retrieval system with hypertext functionalities that describes the whole information base as a network of nodes connected with links, including the metadata and the original documents.
Abstract: TEXPROS is an intelligent document processing and retrieval system, which supports storing, extracting, classifying, categorizing, retrieving and browsing information and documents. We extend TEXPROS to a hypermedia information retrieval system called HyTEXPROS with hypertext functionalities. It describes the whole information base as a network of nodes connected with links, including the metadata and the original documents. Through hypertext functionalities, a user can construct dynamically an information path by browsing through pieces of the information base. A three-level conceptual architecture is presented as the reference model of HyTEXPROS. A detailed description of HyTEXPROS using the first order logic calculus is also proposed.

2 citations

Journal ArticleDOI
TL;DR: An approach to creating the representatives of documents and queries is described as a basis for the proposed ranking model, and an open retrieval environment is created, which can be customized for different application domains.
Abstract: In this paper, a ranking-based document retrieval model is proposed to incorporate with the browsing process. In TEXPROS (TEXt PROcessing System), the interactive browsing process is designed to allow the interactions between the system and a user for forming a strategy of retrieving documents and information from the document base. By gathering information, users could reformulate queries dynamically. During the browsing sessions, a predicate augmented an infrastructure (called Operation Network) is used to present the information about the document types, the synopses of the documents and where documents are deposited. The outcome of using the concept-based retrieval for searching requested documents with partial descriptions could be a large volume of returned documents. The ranking model is used to rank the returned documents according to the degree of their relevancy to the query. Based on the TEXPROS's dual models, an approach to creating the representatives of documents and queries is described as a basis for the proposed ranking model. By integrating the ranking model and the browsing system as a whole, an open retrieval environment is created, which can be customized for different application domains.