scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings Article
01 Jan 1994
TL;DR: This work combines a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies, which provides good retrieval performance on the TREC-1 andTREC-2 tests without the need for any kind of word stemming or stopword removal.
Abstract: N-gram based representations for documents have several distinct advantages for various document processing tasks. First, they provide a more robust representation in the face of grammatical and typographical errors in the documents. Secondly, N-gram representations require no linguistic preparations such as word-stemming or stopword removal. Thus they are ideal in situations requiring multi-language operations. Vector processing retrieval models also have some unique advantages for information retrieval tasks. In particular, they provide a simple, uniform representation for documents and queries, and an intuitively appealing document similarity measure. Also, modern vector space models have good retrieval performance characteristics. In this work, we combine these two ideas by using a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies. The resulting system provides good retrieval performance on the TREC-1 and TREC-2 tests without the need for any kind of word stemming or stopword removal. We also have begun testing the system on Spanish language documents.

133 citations

ReportDOI
06 Mar 2000
TL;DR: Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time.
Abstract: : In recent years, we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can efficiently categorize and retrieve relevant information. Retrieval techniques based on dimensionality reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast dimensionality reduction algorithm, called concept indexing (CI) that is equally effective for unsupervised and supervised dimensionality reduction. CI computes a k-dimensional representation of a collection of documents by first clustering the documents into k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, when CI is used to compute the dimensionality reduction in a supervised setting, it greatly improves the performance of traditional classification algorithms such as C4.5 and kNN.

133 citations

Proceedings ArticleDOI
24 Aug 2002
TL;DR: Topical relations expressed as lexical chains on extended WordNet improve the performance of a question answering system by increasing the document retrieval recall and by providing the much needed axioms that link question keywords with answers.
Abstract: The paper presents a method for finding topically related words on an extended WordNet. By exploiting the information in the WordNet glosses, the connectivity between the synsets is dramatically increased. Topical relations expressed as lexical chains on extended WordNet improve the performance of a question answering system by increasing the document retrieval recall and by providing the much needed axioms that link question keywords with answers.

133 citations

Journal ArticleDOI
Xiaoming Fan1, Jianyong Wang1, Xu Pu1, Lizhu Zhou1, Bing Lv1 
TL;DR: This article presents an effective framework named GHOST (abbreviation for GrapHical framewOrk for name diSambiguaTion), to solve the problem in digital libraries to distinguish publications written by authors with identical names, and devise a novel similarity metric.
Abstract: Name ambiguity stems from the fact that many people or objects share identical names in the real world. Such name ambiguity decreases the performance of document retrieval, Web search, information integration, and may cause confusion in other applications. Due to the same name spellings and lack of information, it is a nontrivial task to distinguish them accurately. In this article, we focus on investigating the problem in digital libraries to distinguish publications written by authors with identical names. We present an effective framework named GHOST (abbreviation for GrapHical framewOrk for name diSambiguaTion), to solve the problem systematically. We devise a novel similarity metric, and utilize only one type of attribute (i.e., coauthorship) in GHOST. Given the similarity matrix, intermediate results are grouped into clusters with a recently introduced powerful clustering algorithm called Affinity Propagation. In addition, as a complementary technique, user feedback can be used to enhance the performance. We evaluated the framework on the real DBLP and PubMed datasets, and the experimental results show that GHOST can achieve both high precision and recall.

132 citations

Patent
18 Jan 1996
TL;DR: In this paper, a document retrieval and display system for retrieving source documents in different languages from servers linked by a communication network, translating the retrieved source documents as necessary, storing the translated documents, and displaying the source documents and translated documents at a client device connected to the communication network.
Abstract: A document retrieval and display system for retrieving source documents in different languages from servers linked by a communication network, translating the retrieved source documents as necessary, storing the translated documents, and displaying the source documents and translated documents at a client device connected to the communication network. The translation process is activated automatically by a control module, and is carried out by a machine translation module. The control module decides when a translation is necessary, selects whether to display the source document or a translated document at the client device, and determines when the source document has been updated and must be retranslated.

132 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111