Topic
Document retrieval
About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.
Papers published on a yearly basis
Papers
More filters
•
28 Jul 2000
TL;DR: In this article, a system and method for text-based document retrieval is proposed, which is based on utilizing information contained in the document collection about the statistics of word relationships (context) to facilitate the specification of search queries and document comparison.
Abstract: A system and method for document retrieval is disclosed The invention addresses a major problem in text-based document retrieval: rapidly finding a small subset of documents in a large document collection (eg Web pages on the Internet) that are relevant to a limited set of query terms supplied by the user The invention is based on utilizing information contained in the document collection about the statistics of word relationships (“context”) to facilitate the specification of search queries and document comparison The method consists of first compiling word relationships into a context database that captures the statistics of word proximity and occurrence throughout the document collection At retrieval time, a search matrix is computed from a set of user-supplied keywords and the context database For each document in the collection, a similar matrix is computed using the contents of the document and the context database Document relevance is determined by comparing the similarity of the search and document matrices The disclosed system therefore retrieves documents with contextual similarity rather than word frequency similarity, simplifying search specification while allowing greater search precision
221 citations
••
03 Nov 2003TL;DR: The experimental results demonstrate that using content-based retrieval in hybrid peer-to-peer networks is both more accurate and more efficient for some digital library environments than more common alternatives such as Gnutella 0.6.
Abstract: Hybrid peer-to-peer architectures use special nodes to provide directory services for regions of the network ("regional directory services"). Hybrid peer-to-peer architectures are a potentially powerful model for developing large-scale networks of complex digital libraries, but peer-to-peer networks have so far tended to use very simple methods of resource selection and document retrieval. In this paper, we study the application of content-based resource selection and document retrieval to hybrid peer-to-peer networks. The directory nodes that provide regional directory services construct and use the content models of neighboring nodes to determine how to route query messages through the network. The leaf nodes that provide information use content-based retrieval to decide which documents to retrieve for queries. The experimental results demonstrate that using content-based retrieval in hybrid peer-to-peer networks is both more accurate and more efficient for some digital library environments than more common alternatives such as Gnutella 0.6.
220 citations
••
18 May 2013
TL;DR: A recommender (called Refoqus) based on machine learning is proposed, which is trained with a sample of queries and relevant results and automatically recommends a reformulation strategy that should improve its performance, based on the properties of the query.
Abstract: There are more than twenty distinct software engineering tasks addressed with text retrieval (TR) techniques, such as, traceability link recovery, feature location, refactoring, reuse, etc. A common issue with all TR applications is that the results of the retrieval depend largely on the quality of the query. When a query performs poorly, it has to be reformulated and this is a difficult task for someone who had trouble writing a good query in the first place. We propose a recommender (called Refoqus) based on machine learning, which is trained with a sample of queries and relevant results. Then, for a given query, it automatically recommends a reformulation strategy that should improve its performance, based on the properties of the query. We evaluated Refoqus empirically against four baseline approaches that are used in natural language document retrieval. The data used for the evaluation corresponds to changes from five open source systems in Java and C++ and it is used in the context of TR-based concept location in source code. Refoqus outperformed the baselines and its recommendations lead to query performance improvement or preservation in 84% of the cases (in average).
215 citations
••
TL;DR: A modified technique is presented that attempts to match the likelihood of retrieving a document of a certain length to thelihood of documents of that length being judged relevant, and it is shown that this technique yields significant improvements in retrieval effectiveness.
Abstract: In the TREC collection -a large full-text experimental text collection with widely varying document lengths -we observe that the likelihood of a document being judged relevant by a user increases with the document length. We show that a retrieval strategy, such as the vector-space cosine match, that retrieves documents of different lengths with roughly equal probability, will not optimally retrieve useful documents from such a collection. We present a modified technique that attempts to match the likelihood of retrieving a document of a certain length to the likelihood of documents of that length being judged relevant, and show that this technique yields significant improvements in retrieval effectiveness.
215 citations
••
PARC1
TL;DR: The results suggest that Scatter/Gather induces a more coherent conceptual image of a text collection, a richer vocabulary for constructing search queries, and communicates the distribution of relevant documents over clusters of documents in the collection.
Abstract: Scatter/Gather is a cluster-based browsing technique for large text collections. Users are presented with automatically computed summaries of the contents of clusters of similar documents and provided with a method for navigating through these summaries at different levels of granularity. The aim of the technique is to communicate information about the topic structure of very large collections. We tested the effectiveness of Scatter/Gather as a simple pure document retrieval tool, and studied its effects on the incidental learning of topic structure. When compared to interactions involving simple keyword-based search, the results suggest that Scatter/Gather induces a more coherent conceptual image of a text collection, a richer vocabulary for constructing search queries, and communicates the distribution of relevant documents over clusters of documents in the collection.
213 citations