scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Patent
06 Aug 1999
TL;DR: This article used lexical dispersion, a measure of the number of different words with which a particular search expression co-occurs within a given type of lexical construct (e.g., a noun phrase) appearing in the document set.
Abstract: Iterative information retrieval from a large database of textual or text-containing documents is facilitated by automatic construction of faceted representations. Facets are chosen heuristically based on lexical dispersion, a measure of the number of different words with which a particular search expression co-occurs within a given type of lexical construct (e.g., a noun phrase) appearing in the document set. Words having high dispersion rates represent “facets” that may be used to organize the documents conceptually in accordance with the search expression, effectively providing a concise, structured summary of the contents of a result set as well as presenting a set of candidate terms for query reformulation.

110 citations

Proceedings Article
01 Jan 1999
TL;DR: This paper describes the development of a prototype document retrieval system based on frequency calculations and corpora comparison techniques, and uses term identification and extraction techniques for identifying topics discussed in a given text.
Abstract: This paper describes the development of a prototype document retrieval system based on frequency calculations and corpora comparison techniques. The prototype, WILDER, generated simple frequency information based on which calculations of document relevance could be made. The prototype was built to allow the University of Surrey to debut in the U.S. Text Retrieval Competition (TREC). User queries as specified by the TREC organisers were converted into simple word-frequency lists and compared against values for the entire corpus. These relative frequency values indicatively produced document relevance. The application of morphological and empirical heuristics enabled WILDER to produce the ranked frequency lists required. Introduction The ad hoc task of TREC8 investigates the performance of systems in ranking a static set of documents against novel topics (queries). For each topic, the top 1000 documents satisfying the topic are submitted. Recall and precision techniques are used on these rankings to determine the results of the competition overall. We have used term identification and extraction techniques for identifying topics discussed in a given text. In this note we focus on the use of single word terms for identifying topics. The techniques are based on differences between general language texts, texts used in an everyday context, and special language texts. The special language texts are texts written, for instance, by scientists, engineers, business persons and hobbyists in their respective languages of physics, chemistry, engineering, business, and hobbies. English-speaking physicists will use the English rendering of terms of physics and use their knowledge of English language, which they share with other speakers of English. Similarly a Chinese speaking physicist writing in Chinese will use the Chinese rendering of terms plus their knowledge of Chinese which they share with other Chinese speakers. The special language texts can be distinguished from a collection of general language texts at different linguistic levels including lexical, morphological, syntactic and semantic. These differences can be measured quantitatively and qualitatively. Quantitative measures at the lexical level include frequency of usage of single and compound terms in special language texts and their equivalents in general language texts. Morphological differences can also be measured quantitatively by looking at the differences in the inflectional and derivational variants of terms; specialist texts comprise a larger number of plurals than used in general language; specialists use nominalised verbs more extensively than in general language. The key difference at the lexical level, between specialist and general language texts, is in the distribution of the so-called open class words, typically nouns and adjectives, and the closed class words, typically determiners, conjunctions, prepositions and modal verbs. Consider the 100 million-word British

110 citations

Journal ArticleDOI
TL;DR: Methods of calculating and presenting results from experimental retrieval comparisons are considered and illustrated by some new laboratory results, and two difficult performance comparisons are illustrated: Boolean versus Ranked output retrieval and non-iterative versus relevance feedback.
Abstract: Methods of calculating and presenting results from experimental retrieval comparisons are considered and illustrated by some new laboratory results. The measures used center on Recall and Precision. Topics of data calculation, single value measures, benchmark results, data aggregation, statistical significance, and the presentation of performance differences are discussed. User-oriented presentations can be used to simulate different needs such as high or low levels of Recall. Several methods of retrieval cutoff can be used as the control variable, but the document cutoff is the most useful. Two difficult performance comparisons are illustrated: Boolean versus Ranked output retrieval and non-iterative versus relevance feedback.

110 citations

Proceedings ArticleDOI
TL;DR: This paper formulate the task of asking clarifying questions in open-domain information-seeking conversational systems, propose an offline evaluation methodology for the task, and collect a dataset, called Qulac, through crowdsourcing, which significantly outperforms competitive baselines.
Abstract: Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.

109 citations

Proceedings ArticleDOI
25 Oct 2009
TL;DR: This framework gives linear space data structure with optimal query times for arbitrary score functions and improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance.
Abstract: Given a set ${\cal D}=\{d_1, d_2,..., d_D\}$ of $D$strings of total length $n$, our task is to report the "most relevant"strings for a given query pattern $P$. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by [Muthukrishnan, 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking $O(n \logn)$ words of space. We study this problem in a slightly different framework of reporting the top $k$ most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.

109 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111