scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This article describes how retrieval models which use either independence or dependence assumptions can be extended to include document representatives containing term significance weights and indicates that search strategies based on models modified in this way can further improve the effectiveness of document retrieval systems.
Abstract: Probabilistic models of retrieval have provided insights into the document retrieval process and contain the basis for very effective search strategies. A major limitation of these models is that they assume that documents are represented by binary index terms. In many cases the index terms will be assigned weights, such as within-document frequency weights, which are derived from the content of the documents by the indexing process. These weights, which are referred to here as term significance weights, indicate the relative importance of the terms in individual documents. This article describes how retrieval models which use either independence or dependence assumptions can be extended to include document representatives containing term significance weights. Comparison with other research indicates that search strategies based on models modified in this way can further improve the effectiveness of document retrieval systems.

51 citations

Proceedings ArticleDOI
01 May 1988
TL;DR: Experiments indicate that regression methods can help predict relevance, given query-document similarity values for each concept type, and the role of links is shown to be especially beneficial.
Abstract: This report considers combining information to improve retrieval. The vector space model has been extended so different classes of data are associated with distinct concept types and their respective subvectors. Two collections with multiple concept types are described, ISI-1460 and CACM-3204. Experiments indicate that regression methods can help predict relevance, given query-document similarity values for each concept type. After sampling and transformation of data, the coefficient of determination for the best model was .48 (.66) for ISI (CACM). Average precision for the two collections was 11% (31%) better for probabilistic feedback with all types versus with terms only. These findings may be of particular interest to designers of document retrieval or hypertext systems since the role of links is shown to be especially beneficial.

51 citations

Journal ArticleDOI
TL;DR: This paper presents an introduction and a survey of the use of logic for information retrieval modeling, first advanced in 1986 by Van Rijsbergen with the so-called logical uncertainty principle.
Abstract: Information retrieval is the science concerned with the efficient and effective storage of information for the later retrieval and use by interested parties. During the last forty years, a plethora of information retrieval models and their variations have emerged. Logic-based models were launched to provide a rich and uniform representation of information and its semantics with the aim to improve information retrieval effectiveness. This approach was first advanced in 1986 by Van Rijsbergen with the so-called logical uncertainty principle. Since, various logic-based models have been developed. This paper presents an introduction and a survey of the use of logic for information retrieval modeling.

51 citations

Proceedings ArticleDOI
24 Jul 2011
TL;DR: This paper shows the first set of inverted indexes which work naturally for strings as well as phrase searching, and shows efficient top-k based retrieval under relevance metrics like frequency and tf-idf.
Abstract: Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching. The central idea is to exclude document d in the inverted list of a string P if every occurrence of P in d is subsumed by another string of which P is a prefix. With this we show that our space utilization is close to the optimal. Techniques from succinct data structures are deployed to achieve compression while allowing fast access in terms of frequency and document id based retrieval. Compression and speed trade-offs are evaluated for different variants of the proposed index. For phrase searching, we show that our indexes compare favorably against a typical inverted index deploying position-wise intersections. We also show efficient top-k based retrieval under relevance metrics like frequency and tf-idf.

51 citations

Proceedings ArticleDOI
20 Jul 2008
TL;DR: Experimental results on a speech corpus of conversational English show that the use of statistics from lattices for both documents and query exemplars results in better retrieval accuracy than using only 1-best transcripts for either documents, or queries, or both.
Abstract: Recent efforts on the task of spoken document retrieval (SDR) have made use of speech lattices: speech lattices contain information about alternative speech transcription hypotheses other than the 1-best transcripts, and this information can improve retrieval accuracy by overcoming recognition errors present in the 1-best transcription. In this paper, we look at using lattices for the query-by-example spoken document retrieval task - retrieving documents from a speech corpus, where the queries are themselves in the form of complete spoken documents (query exemplars). We extend a previously proposed method for SDR with short queries to the query-by-example task. Specifically, we use a retrieval method based on statistical modeling: we compute expected word counts from document and query lattices, estimate statistical models from these counts, and compute relevance scores as divergences between these models. Experimental results on a speech corpus of conversational English show that the use of statistics from lattices for both documents and query exemplars results in better retrieval accuracy than using only 1-best transcripts for either documents, or queries, or both. In addition, we investigate the effect of stop word removal which further improves retrieval accuracy. To our knowledge, our work is the first to have used a lattice-based approach to query-by-example spoken document retrieval.

51 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111