scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2003"


Book ChapterDOI
20 Oct 2003
TL;DR: A simplistic upper-level ontology is introduced which starts with some basic philosophic distinctions and goes down to the most popular entity types, thus providing many of the inter-domain common sense concepts and allowing easy domain-specific extensions.
Abstract: The Semantic Web realization depends on the availability of critical mass of metadata for the web content, linked to formal knowledge about the world. This paper presents our vision about a holistic system allowing annotation, indexing, and retrieval of documents with respect to real-world entities. A system (called KIM), partially implementing this concept is shortly presented and used for evaluation and demonstration. Our understanding is that a system for semantic annotation should be based upon specific knowledge about the world, rather than indifferent to any ontological commitments and general knowledge. To assure efficiency and reusability of the metadata we introduce a simplistic upper-level ontology which starts with some basic philosophic distinctions and goes down to the most popular entity types (people, companies, cities, etc.), thus providing many of the inter-domain common sense concepts and allowing easy domain-specific extensions. Based on the ontology, an extensive knowledge base of entities descriptions is maintained. Semantically enhanced information extraction system providing automatic annotation with references to classes in the ontology and instances in the knowledge base is presented. Based on these annotations, we perform IR-like indexing and retrieval, further extended using the ontology and knowledge about the specific entities.

366 citations


Proceedings ArticleDOI
28 Jul 2003
TL;DR: This work presents a quantitative evaluation of various passage retrieval algorithms for question answering, implemented in a framework called Pauchok, and presents three important findings: Boolean querying schemes perform well in the question answering task.
Abstract: Passage retrieval is an important component common to many question answering systems. Because most evaluations of question answering systems focus on end-to-end performance, comparison of common components becomes difficult. To address this shortcoming, we present a quantitative evaluation of various passage retrieval algorithms for question answering, implemented in a framework called Pauchok. We present three important findings: Boolean querying schemes perform well in the question answering task. The performance differences between various passage retrieval algorithms vary with the choice of document retriever, which suggests significant interactions between document retrieval and passage retrieval. The best algorithms in our evaluation employ density-based measures for scoring query terms. Our results reveal future directions for passage retrieval and question answering.

345 citations


Book ChapterDOI
20 Oct 2003
TL;DR: The KIM platform allows KIM-based applications to use it for automatic semantic annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases.
Abstract: The KIM platform provides a novel Knowledge and Information Management infrastructure and services for automatic semantic annotation, indexing, and retrieval of documents. It provides mature infrastructure for scaleable and customizable information extraction (IE) as well as annotation and document management, based on GATE. In order to provide basic level of performance and allow easy bootstrapping of applications, KIM is equipped with an upper-level ontology and a knowledge base providing extensive coverage of entities of general importance. The ontologies and knowledge bases involved are handled using cutting edge Semantic Web technology and standards, including RDF(S) repositories, ontology middleware and reasoning. From technical point of view, the platform allows KIM-based applications to use it for automatic semantic annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases. This paper presents the KIM platform, with emphasize on its architecture, interfaces, tools, and other technical issues.

291 citations


Proceedings ArticleDOI
20 May 2003
TL;DR: This paper proposes a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page and achieves 27% performance improvement on Web Track dataset.
Abstract: In contrast to traditional document retrieval, a web page as a whole is not a good information unit to search because it often contains multiple topics and a lot of irrelevant information from navigation, decoration, and interaction part of the page. In this paper, we propose a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page. Compared with simple DOM based segmentation method, our page segmentation scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. By using our VIPS algorithm to assist the selection of query expansion terms in pseudo-relevance feedback in web information retrieval, we achieve 27% performance improvement on Web Track dataset.

265 citations


Proceedings ArticleDOI
03 Nov 2003
TL;DR: The experimental results demonstrate that using content-based retrieval in hybrid peer-to-peer networks is both more accurate and more efficient for some digital library environments than more common alternatives such as Gnutella 0.6.
Abstract: Hybrid peer-to-peer architectures use special nodes to provide directory services for regions of the network ("regional directory services"). Hybrid peer-to-peer architectures are a potentially powerful model for developing large-scale networks of complex digital libraries, but peer-to-peer networks have so far tended to use very simple methods of resource selection and document retrieval. In this paper, we study the application of content-based resource selection and document retrieval to hybrid peer-to-peer networks. The directory nodes that provide regional directory services construct and use the content models of neighboring nodes to determine how to route query messages through the network. The leaf nodes that provide information use content-based retrieval to decide which documents to retrieve for queries. The experimental results demonstrate that using content-based retrieval in hybrid peer-to-peer networks is both more accurate and more efficient for some digital library environments than more common alternatives such as Gnutella 0.6.

220 citations


Proceedings Article
01 Jan 2003
TL;DR: The first year of TREC Genomics Track featured two tasks: ad hoc retrieval and information extraction, which centered around the Gene Reference into Function (GeneRIF) resource of the National Library of Medicine.
Abstract: The first year of TREC Genomics Track featured two tasks: ad hoc retrieval and information extraction. Both tasks centered around the Gene Reference into Function (GeneRIF) resource of the National Library of Medicine, which was used as both pseudorelevance judgments for ad hoc document retrieval as well as target text for information extraction. The track attracted 29 groups who participated in one or both tasks.

157 citations


Proceedings Article
01 Jan 2003
TL;DR: NLP needs to be optimized for IR in order to be effective and document retrieval is not an ideal application for NLP, at least given the current state-of-the-art in NLP.
Abstract: Many Natural Language Processing (NLP) techniques have been used in Information Retrieval. The results are not encouraging. Simple methods (stopwording, porter-style stemming, etc.) usually yield significant improvements, while higher-level processing (chunking, parsing, word sense disambiguation, etc.) only yield very small improvements or even a decrease in accuracy. At the same time, higher-level methods increase the processing and storage cost dramatically. This makes them hard to use on large collections. We review NLP techniques and come to the conclusion that (a) NLP needs to be optimized for IR in order to be effective and (b) document retrieval is not an ideal application for NLP, at least given the current state-of-the-art in NLP. Other IR-related tasks, e.g., question answering and information extraction, seem to be better suited.

156 citations


Proceedings ArticleDOI
03 Nov 2003
TL;DR: This work proposes a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection, that is effective for query expansion for web retrieval.
Abstract: Hundreds of millions of users each day use web search engines to meet their information needs Advances in web search effectiveness are therefore perhaps the most significant public outcomes of IR research Query expansion is one such method for improving the effectiveness of ranked retrieval by adding additional terms to a query In previous approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run We propose a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection Our scheme is effective for query expansion for web retrieval: our results show relative improvements over unexpanded full text retrieval of 26%--29%, and 18%--20% over an optimised, conventional expansion approach

152 citations


Proceedings ArticleDOI
07 Jul 2003
TL;DR: This paper describes and evaluates a theoretically motivated method for removing unwanted meanings directly from the original query in vector models, with the same vector negation operator as used in quantum logic.
Abstract: Standard IR systems can process queries such as "web NOT internet", enabling users who are interested in arachnids to avoid documents about computing. The documents retrieved for such a query should be irrelevant to the negated query term. Most systems implement this by reprocessing results after retrieval to remove documents containing the unwanted string of letters.This paper describes and evaluates a theoretically motivated method for removing unwanted meanings directly from the original query in vector models, with the same vector negation operator as used in quantum logic. Irrelevance in vector spaces is modelled using orthogonality, so query vectors are made orthogonal to the negated term or terms.As well as removing unwanted terms, this form of vector negation reduces the occurrence of synonyms and neighbours of the negated terms by as much as 76% compared with standard Boolean methods. By altering the query vector itself, vector negation removes not only unwanted strings but unwanted meanings.

145 citations


01 Jan 2003
TL;DR: This paper summarize research in document layout analysis carried out over the last few years in the laboratory, which has developed a number of novel geometric algorithms and statistical methods that are applicable to a wide variety of languages and layouts.
Abstract: In this paper, I summarize research in document layout analysis carried out over the last few years in our laboratory. Correct document layout analysis is a key step in document capture conversions into electronic formats, optical character recognition (OCR), information retrieval from scanned documents, appearance-based document retrieval, and reformatting of documents for on-screen display. We have developed a number of novel geometric algorithms and statistical methods. Layout analysis systems built from these algorithms are applicable to a wide variety of languages and layouts, and have proven to be robust to the presence of noise and spurious features in a page image. The system itself consists of reusable and independent software modules that can be reconfigured to be adapted to different languages and applications. Currently, we are using them for electronic book and document capture applications. If there is commercial or government demand, we are interested in adapting these tools to information retrieval and intelligence applications.

114 citations


Journal ArticleDOI
01 Apr 2003
TL;DR: This work demonstrates how the World Wide Web can be mined in a fully automated manner for discovering the semantic similarity relationships among the concepts surfaced during an electronic brainstorming session, and thus improving the accuracy of automated clustering meeting messages.
Abstract: This work demonstrates how the World Wide Web can be mined in a fully automated manner for discovering the semantic similarity relationships among the concepts surfaced during an electronic brainstorming session, and thus improving the accuracy of automated clustering meeting messages. Our novel Context Sensitive Similarity Discovery (CSSD) method takes advantage of the meeting context when selecting a subset of Web pages for data mining, and then conducts regular concept co-occurrence analysis within that subset. Our results have implications on reducing information overload in applications of text technologies such as email filtering, document retrieval, text summarization, and knowledge management.

11 Nov 2003
TL;DR: This thesis investigates the usefulness of different standard and novel document retrieval approaches in the context of question answering and compares them with respect to their ability to identify documents containing a correct answer.
Abstract: is one of the most valuable goods in modern society. With the rise of computers, storing huge amounts of data has become efficient and inexpensive. Although we are now in a position where we have unprecedented amounts of information at our finger tips, the question arises how to access these large amounts of data to find the information one is interested in. The issue of developing methods and tools for finding automatically relevant information is addressed by the research area of information retrieval, and, over the last decades, sophisticated document retrieval systems have been developed. One particular branch of information retrieval is question answering. Question answering systems enable users to pose full natural language questions, as opposed to keyword-based queries, which are commonly used in document retrieval. In recent years, question answering has witnessed a renaissance, which is mainly due to the availability of large corpora. Current question answering systems depend strongly on document retrieval as a means for identifying documents that are likely to contain answer to a given question. This thesis investigates the usefulness of different standard and novel document retrieval approaches in the context of question answering. More specifically, it compares them with respect to their ability to identify documents containing a correct answer. In addition, we also investigate to what extent the quality of a particular document retrieval approach has an impact on the overall performance of a specific question answering system.

Proceedings ArticleDOI
13 Oct 2003
TL;DR: A re-ranking method to improve Web image retrieval by reordering the images retrieved from an image search engine based on a relevance model, which is a probabilistic model that evaluates the relevance of the HTML document linking to the image, and assigns a probability of relevance.
Abstract: Web image retrieval is a challenging task that requires efforts from image processing, link structure analysis, and Web text retrieval. Since content-based image retrieval is still considered very difficult, most current large-scale Web image search engines exploit text and link structure to "understand" the content of the Web images. However, local text information, such as caption, filenames and adjacent text, is not always reliable and informative. Therefore, global information should be taken into account when a Web image retrieval system makes relevance judgment. We propose a re-ranking method to improve Web image retrieval by reordering the images retrieved from an image search engine. The re-ranking process is based on a relevance model, which is a probabilistic model that evaluates the relevance of the HTML document linking to the image, and assigns a probability of relevance. The experiment results showed that the re-ranked image retrieval achieved better performance than original Web image retrieval, suggesting the effectiveness of the re-ranking method. The relevance model is learned from the Internet without preparing any training data and independent of the underlying algorithm of the image search engines. The re-ranking process should be applicable to any image search engines with little effort.

Proceedings ArticleDOI
15 Dec 2003
TL;DR: This work presents an effective and efficient approach for word image matching by using gradient-based binary features that has much higher retrieval accuracy and is 893 times faster than Dynamic Time Warping with profile-based shape features.
Abstract: Existing word image retrieval algorithms suffer from either low retrieval precision or high computation complexity. We present an effective and efficient approach for word image matching by using gradient-based binary features. Experiments over a large database of handwritten word images show that the proposed approach consistently outperforms the existing best handwritten word image retrieval algorithm. Dynamic Time Warping (DTW) with profile-based shape features. Not only does the proposed approach have much higher retrieval accuracy, but also it is 893 times faster than DTW.

Journal ArticleDOI
TL;DR: A document discovery tool based on Conceptual Clustering by Formal Concept Analysis that allows users to navigate e-mail using a visual lattice metaphor rather than a tree to aid knowledge discovery in document collections.
Abstract: This paper discusses a document discovery tool based on Conceptual Clustering by Formal Concept Analysis. The program allows users to navigate e-mail using a visual lattice metaphor rather than a tree. It implements a virtual. le structure over e-mail where files and entire directories can appear in multiple positions. The content and shape of the lattice formed by the conceptual ontology can assist in e-mail discovery. The system described provides more flexibility in retrieving stored e-mails than what is normally available in e-mail clients. The paper discusses how conceptual ontologies can leverage traditional document retrieval systems and aid knowledge discovery in document collections.

Patent
11 Aug 2003
TL;DR: In this paper, a document retrieval system capable of obtaining information requested by the user with a high degree of accuracy is presented, where the query input section receives query input and extracts keywords.
Abstract: A document retrieval system capable of obtaining information requested by the user with a high degree of accuracy. In this system, the query input section 102 receives query input by the user. The keyword extraction section 104 analyzes the input query and extracts keywords. The keyword type assignment section 106 decides the type of each extracted keyword and assigns a keyword type. The question type decision section 108 decides the question type. The keyword classification section 110 classifies the keywords to which the keyword types are assigned into a major type and minor type with reference to the keyword classification rules stored in the keyword classification rule storage section 112. The document retrieval section 114 searches a document collection stored in the document storage section 116 using the classified keyword groups and obtains the document of the retrieved result.

Journal ArticleDOI
TL;DR: Preliminary experimental results with the document images captured from students’ theses show that the proposed approach to retrieve the documents from CCITT Group 4 compressed document images has achieved a promising performance.

Journal ArticleDOI
TL;DR: This work improves the current Web information retrieval approach by raising the efficiency of information retrieval, enhancing the preciseness and mobility of information services, and enabling intelligent information services.

Proceedings ArticleDOI
15 Dec 2003
TL;DR: This paper presents a lightweight but robust approach to combining topic and polarity thus enabling content access systems to select content based on a certain opinion about a certain topic.
Abstract: Retrieving document sby subject matter is the general goal of information retrieval and othe rcontent access systems. There are aspects of textual content, however, which form equally valid election critiria. One such aspect is that of sentiment or polarity - indicating the author's opinion or emotional relationship with some topic. Recent work in this are has treated polarity effectively as a discrete aspect of text. In this paper we present a lightweight but robust approach to combining topic and polarity thus enabling content access systems to select content based on a certain opinion about a certain topic.

Proceedings Article
01 Jan 2003
TL;DR: MIT CSAIL’s entry in the TREC Question Answering track focused on integrating Web-based techniques with more traditional strategies based on document retrieval and named-entity detection, and identified this class of techniques as the knowledge mining approach to question answering.
Abstract: MIT CSAIL’s entry in this year’s TREC Question Answering track focused on integrating Web-based techniques with more traditional strategies based on document retrieval and named-entity detection. We believe that achieving high performance in the question answering task requires a combination of multiple strategies designed to capitalize on different characteristics of various resources. The system we deployed for the TREC evaluation last year relied exclusively on the World Wide Web to answer factoid questions (Lin et al., 2002). The advantages that the Web offers are well known and have been exploited by previous systems (Brill et al., 2001; Clarke et al., 2001; Dumais et al., 2002). The immense amount of freely available unstructured text provides data redundancy, which can be leveraged with simple pattern matching techniques involving the expected answer formulations. In many ways, we can utilize huge quantities of data to overcome many thorny problems in natural language processing such as lexical ambiguity and paraphrases. Furthermore, Web search engines such as Google provide a convenient front-end for accessing and filtering enormous amounts of Web data. We have identified this class of techniques as the knowledge mining approach to question answering (Lin and Katz, 2003). In addition to viewing the Web as a repository of unstructured documents, we can also take advantage of structured and semistructured sources available on the Web using knowledge annotation techniques (Katz, 1997; Lin and Katz, 2003). Through empirical analysis of real world natural language questions, we have noticed that large classes of commonly occurring queries can be parameterized and captured using a simple object–property–value data model (Katz et al., 2002). Furthermore, such a data model is easy to impose on Web resources through a framework of wrapper scripts. These techniques allow our system to view the Web as if it were a “virtual database” and use knowledge contained therein to answer user questions. While the Web is undeniably a useful resource for question answering, it is not without drawbacks. Useful knowledge on the Web is often drowned out by the sheer amount of irrelevant material, and statistical techniques are often insufficient to separate right answers from wrong ones. Overcoming these obstacles will require addressing many outstanding issues in computational linguistics: anaphora resolution, paraphrase normalization, temporal reference calculation, and lexical disambiguation, just to name a few. Furthermore, the setup of the TREC evaluations necessitates an extra step in the question answering process for systems that extract answers from external sources, typically known as answer projection. For every Web-derived answer, a system must find a supporting document from the AQUAINT corpus, even if the corpus was not used in the answer extraction process. This year’s main task included definition and list questions in addition to factoid questions. Although Web-based techniques have proven effective in handling factoid questions, they are less applicable to tackling definition and list questions. The datadriven approach implicitly assumes that each natural language question has a unique answer. Since a single answer instance is sufficient, algorithms were designed to trade recall for precision. For list and definition questions, however, a more balanced approach is required, since multiple answers are not only desired, but necessary. We believe that the best strategy is to integrate Web-based approaches with more traditional question answering techniques driven by document retrieval and named-entity detection. Corpusand Web-based strategies should play complementary roles in an overall question answering framework.

Journal ArticleDOI
TL;DR: The results indicate that the use of local link information improves precision by 74% and when global link information is used, precision improves by 35%, however, when only the first 10 documents in theranking are considered, the average gain in precision obtained is higher than the gain obtained withthe use of global link Information.
Abstract: Information derived from the cross-references among the documents in a hyperlinked environment, usually referred to as link information, is considered important since it can be used to effectively improve document retrieval. Depending on the retrieval strategy, link information can be local or global. Local link information is derived from the set of documents returned as answers to the current user query. Global link information is derived from all the documents in the collection. In this work, we investigate how the use of local link information compares to the use of global link information. For the comparison, we run a series of experiments using a large document collection extracted from the Web. For our reference collection, the results indicate that the use of local link information improves precision by 74p. When global link information is used, precision improves by 35p. However, when only the first 10 documents in the ranking are considered, the average gain in precision obtained with the use of global link information is higher than the gain obtained with the use of local link information. This is an interesting result since it provides insight and justification for the use of global link information in major Web search engines, where users are mostly interested in the first 10 answers. Furthermore, global information can be computed in the background, which allows speeding up query processing.

Patent
09 Jun 2003
TL;DR: In this paper, a document retrieval method and system for separately performing a process for correcting erroneously recognized characters existing in characteristic character strings within a seed document or the documents to be registered and tolerating inaccuracies existing in the documents targeted for retrieval is described.
Abstract: Disclosed are a document retrieval method and system for separately performing a process for correcting erroneously recognized characters existing in characteristic character strings within a seed document or the documents to be registered and a process for tolerating erroneously recognized characters existing in the documents targeted for retrieval. The process for correcting erroneously recognized characters existing in characteristic character strings extracts characteristic character strings from a read document, replaces the extracted characteristic character strings containing erroneously recognized characters with character strings appropriate for document retrieval, and selects characteristic character strings for use in actual document retrieval.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: Experiments using the TREC data show that incorporating user query history, as context information, consistently improves the retrieval performance in both average precision and precision at 20 documents.
Abstract: In this poster,we incorporate user query history, as context information, to improve the retrieval performance in interactive retrieval. Experiments using the TREC data show that incorporating such context information indeed consistently improves the retrieval performance in both average precision and precision at 20 documents.

Patent
06 Jan 2003
TL;DR: In this paper, a method and apparatus for document management including assigning labels to gaps in document production is presented, where the gaps may correspond to events that caused an individual to begin and/or cease production/modification of documents.
Abstract: A method and apparatus for document management including assigning labels to gaps in document production. The gaps may correspond to events that caused an individual to begin and/or cease production/modification of documents. Such events can be, for example, a vacation, a business trip, a meeting, etc. The labels can be used for document retrieval purposes. In one embodiment predetermined events, either predefined or user-defined, are used to automatically label gaps in document production. In one embodiment, the invention provides links to documents that correspond to gaps in document production where the links can be used for document retrieval purposes.

Proceedings ArticleDOI
03 Aug 2003
TL;DR: This work proposes a stringmatching-based method for word-spotting in on-line documents that achieves a precision of 92.3% at a recall rate of 90% on a database of 6,672 wordswritten by 10 different writers.
Abstract: Recent advances in on-line data capturing technologiesand its widespread deployment in devices like PDAsand notebook PCs is creating large amounts of handwrittendata that need to be archived and retrieved efficiently.Word-spotting, which is based on a direct comparison ofa handwritten keyword to words in the document, is commonlyused for indexing and retrieval. We propose a stringmatching-based method for word-spotting in on-line documents.The retrieval algorithm achieves a precision of92.3% at a recall rate of 90% on a database of 6,672 wordswritten by 10 different writers. Indexing experiments showan accuracy of 87.5% using a database of 3,872 on-linewords.

Journal ArticleDOI
TL;DR: A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes, which tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.
Abstract: Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: The process of producing a test collection for patent retrieval, the NTCIR-3 Patent Retrieval Collection, is described, which includes two years of Japanese patent applications and 31 topics produced by professional patent searchers, and experimental results obtained are reported.
Abstract: Reflecting the rapid growth in the utilization of large test collections for information retrieval since the 1990s, extensive comparative experiments have been performed to explore the effectiveness of various retrieval models. However, most collections were intended for retrieving newspaper articles and technical abstracts. In this paper, we describe the process of producing a test collection for patent retrieval, the NTCIR-3 Patent Retrieval Collection, which includes two years of Japanese patent applications and 31 topics produced by professional patent searchers. We also report experimental results obtained by using this collection to re-examine the effectiveness of existing retrieval models in the context of patent retrieval. The relative superiority among existing retrieval models did not significantly differ depending on the document genre, that is, patents and newspaper articles. Issues related to patent retrieval are also discussed.

Book ChapterDOI
21 Aug 2003
TL;DR: The monolingual, bilingual, and multilingual retrieval experiments using the CLEF 2003 test collection show that document translation- based retrieval is slightly better than the query translation-based retrieval on the CLEFs.
Abstract: This paper describes monolingual, bilingual, and multilingual retrieval experiments using the CLEF 2003 test collection. The paper compares query translation-based multilingual retrieval with document translation-based multilingual retrieval where the documents are translated into the query language by translating the document words individually using machine translation systems or statistical translation lexicons derived from parallel texts. The multilingual retrieval results show that document translation-based retrieval is slightly better than the query translation-based retrieval on the CLEF 2003 test collection. Furthermore, combining query translation and document translation in multilingual retrieval achieves even better performance.

Proceedings ArticleDOI
05 Mar 2003
TL;DR: This work briefly reviews the major variations of the language model approach and how they have been used to develop a range of retrieval-related language technologies, including cross-lingual IR and distributed search.
Abstract: One of the major challenges in the field of information retrieval (IR) is to specify a formal framework that both describes the important processes involved in finding relevant information, and successfully predicts which techniques will provide good effectiveness in terms of accuracy. A recent approach that has shown considerable promise uses generative models of text (language models) to describe the IR processes. We briefly review the major variations of the language model approach and how they have been used to develop a range of retrieval-related language technologies, including cross-lingual IR and distributed search. We also discuss how this approach could be used with structured data extracted from text.

Patent
21 Mar 2003
TL;DR: This paper used Probabilistic Latent Semantic Analysis (PLSA) model and select segmentation points based on similarity values between pairs of adjacent text blocks, which is a framework for both text segmentation and topic identification.
Abstract: Systems and methods for determining the topic structure of a document including text utilize a Probabilistic Latent Semantic Analysis (PLSA) model and select segmentation points based on similarity values between pairs of adjacent text blocks. PLSA forms a framework for both text segmentation and topic identification. The use of PLSA provides an improved representation for the sparse information in a text block, such as a sentence or a sequence of sentences. Topic characterization of each text segment is derived from PLSA parameters that relate words to "topics", latent variables in the PLSA model, and "topics" to text segments. A system executing the method exhibits significant performance improvement. Once determined, the topic structure of a document may be employed for document retrieval and/or document summarization.