scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
01 Jan 1994
TL;DR: This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems and proposes and empirically validate general adaptive methods which improve the ability of a large class of retrieval systems to rank documents effectively.
Abstract: This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems. The goal of a ranked retrieval system is to manage a large collection of text documents and to order documents for a user based on the estimated relevance of the documents to the user's information need (or query). The ordering enables the user to quickly find documents of interest. Ranked retrieval is a difficult problem because of the ambiguity of natural language, the large size of the collections, and because of the varying needs of users and varying collection characteristics. We propose and empirically validate general adaptive methods which improve the ability of a large class of retrieval systems to rank documents effectively. Our main adaptive method is to numerically optimize free parameters in a retrieval system by minimizing a non-metric criterion function. The criterion measures how well the system is ranking documents relative to a target ordering, defined by a set of training queries which include the users' desired document orderings. Thus, the system learns parameter settings which better enable it to rank relevant documents before irrelevant. The non-metric approach is interesting because it is a general adaptive method, an alternative to supervised methods for training neural networks in domains in which rank order or prioritization is important. A second adaptive method is also examined, which is applicable to a restricted class of retrieval systems but which permits an analytic solution. The adaptive methods are applied to a number of problems in text retrieval to validate their utility and practical efficiency. The applications include: A dimensionality reduction of vector-based document representations to a vector space in which inter-document similarity more accurately predicts semantic association; the estimation of a similarity measure which better predicts the relevance of documents to queries; and the estimation of a high-performance neural network combination of multiple retrieval systems into a single overall system. The applications demonstrate that the approaches improve performance and adapt to varying retrieval environments. We also compare the methods to numerous alternative adaptive methods in the text retrieval literature, with very positive results.

50 citations

Journal ArticleDOI
TL;DR: Experimental results indicate that SSRank consistently and almost always significantly outperforms the baseline methods, given the same amount of labeled data, because SSRank can effectively leverage the use of unlabeled data in learning.
Abstract: This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.

50 citations

Proceedings Article
01 Jan 1992
TL;DR: This paper investigates whether a completely automatic, statistical expansion technique that uses a general-purpose thesaurus as a source of related concepts is viable for large collection and results indicate that the particular expansion technique used here improved the performance of some queries, but degrades theperformance of other queries.
Abstract: This paper investigates whether a completely automatic, statistical expansion technique that uses a general-purpose thesaurus as a source of related concepts is viable for large collection. The retrieval results indicate that the particular expansion technique used here improved the performance of some queries, but degrades the performance of other queries. The variability of the method is attributable to two main factors: the choice of concepts that are expanded and the confounding effects expansion has on cencept weights. Addressing these problems will require both a better method for determining the important concepts of a text and a better method for determining the correct sense of an ambiguous word

50 citations

Journal ArticleDOI
TL;DR: This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval and shows significantly improved performance compared to Salton's vector space model.
Abstract: A map of text documents arranged using the Self-Organizing Map (SOM) algorithm (1) is organized in a meaningful manner so that items with similar content appear at nearby locations of the 2-dimensional map display, and (2) clusters the data, resulting in an approximate model of the data distribution in the high-dimensional document space. This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval. Furthermore, experiments on the well-known CISI collection l3r show significantly improved performance compared to Salton's vector space model, measured by average precision (AP) when retrieving a small, fixed number of best documents. Regarding comparison with Latent Semantic Indexing the results are inconclusive.

50 citations

Book ChapterDOI
19 Aug 2002
TL;DR: It is shown how a rigorous evaluation of Document Transformation can be carried out using the referer logs kept by web servers, and a new strategy for Document Transformation is described that is suitable for long-term incremental learning.
Abstract: This paper considers how web search engines can learn from the successful searches recorded in their user logs.Document Transformation is a feasible approach that uses these logs to improve document representations. Existing test collections do not allow an adequate investigation of Document Transformation, but we show how a rigorous evaluation of this method can be carried out using the referer logs kept by web servers. We also describe a new strategy for Document Transformation that is suitable for long-term incremental learning.Our experiments show that Document Transformation improves retrieval performance over a medium sized collection of webpages.Commercial search engines may be able to achieve similar improvements by incorporating this approach.

50 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111