scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
DissertationDOI
01 Jan 2000
TL;DR: New methods for server selection and results merging are introduced, which do not require search servers to cooperate, yet are as effective as the best methods which do.
Abstract: Published methods for distributed information retrieval generally rely on cooperation from search servers. But most real servers, particularly the tens of thousands available on the Web, are not engineered for such cooperation. This means that the majority of methods proposed, and evaluated in simulated environments of homogeneous cooperating servers, are never applied in practice. This thesis introduces new methods for server selection and results merging. The methods do not require search servers to cooperate, yet are as effective as the best methods which do. Two large experiments evaluate the new methods against many previously published methods. In contrast to previous experiments they simulate a Web-like environment, where servers employ varied retrieval algorithms and tend not to sub-partition documents from a single source. The server selection experiment uses pages from 956 real Web servers, three different retrieval systems and TREC ad hoc topics. Results show that a broker using queries to sample servers’ documents can perform selection over non-cooperating servers without loss of effectiveness. However, using the same queries to estimate the effectiveness of servers, in order to favour servers with high quality retrieval systems, did not consistently improve selection effectiveness. The results merging experiment uses documents from five TREC sub-collections, five different retrieval systems and TREC ad hoc topics. Results show that a broker using a reference set of collection statistics, rather than relying on cooperation to collate true statistics, can perform merging without loss of effectiveness. Since application of the reference statistics method requires that the broker download the documents to be merged, experiments were also conducted on effective merging based on partial documents. The new ranking method developed was not highly effective on partial documents, but showed some promise on fully downloaded documents. Using the new methods, an effective search broker can be built, capable of addressing any given set of available search servers, without their cooperation.

52 citations

Proceedings Article
01 Jan 2005
TL;DR: The third edition of the High Accuracy Retrieval from Documents (HARD) Track as discussed by the authors was the first time that the focus was on improving the accuracy of document retrieval systems.
Abstract: TREC 2005 saw the third year of the High Accuracy Retrieval from Documents (HARD) track. The HARD track explores methods for improving the accuracy of document retrieval systems, with particular attention paid to the start of the ranked list. Although it has done so in a few different ways in the past, budget realities limited the track to “clarification forms” this year. The question investigated was whether highly focused interaction with the searcher be used to improve the accuracy of a system. Participants created “clarification forms” generated in response to a query—and leveraging any information available in the corpus—that were filled out by the searcher. Typical clarification questions might ask whether some titles seem relevant, whether some words or names are on topic, or whether a short passage of text is related.

52 citations

Journal ArticleDOI
TL;DR: The “broader-than” relationships of both a medical and a computer science thesaurus when coupled with a simple average path length algorithm are able to simulate the decisions of people regarding the conceptual similarity of documents and queries.
Abstract: Information retrieval systems often rely on thesauri or semantic nets in indexing documents and in helping users search for documents. Reasoning with these thesauri resembles traversing a graph. Several algorithms for matching documents to queries based on the distances between nodes on the graph (terms in the thesaurus) are compared to the evaluations of people. The “broader-than” relationships of both a medical and a computer science thesaurus when coupled with a simple average path length algorithm are able to simulate the decisions of people regarding the conceptual similarity of documents and queries. A graphical presentation of a thesaurus is connected to a multi-window document retrieval system and its ease of use is compared to a more traditional thesaurus-based information retrieval system. While substantial evidence exists that the graphics and multiple windows can be useful, our experiments have shown, as have many other human-computer interface experiments, that a multitude of factors come into play in determining the value of a particular interface.

52 citations

Journal ArticleDOI
01 Jun 1983
TL;DR: A variety of different organizations has been proposed to enhance processing of text retrieval operations, and the advantages and disadvantages inherent in each of these approaches are discussed, along with a number of proposed implementations.
Abstract: As databases become very large, conventional digital computers cannot provide satisfactory response time. This is particularly true for text databases, which must often be several orders of magnitude larger than formatted databases to store a useful amount of information. Even the standard techniques for improving system performance (such as inverted files) may not be sufficient to give the desired performance, and the use of an unconventional hardware organization may become necessary.A variety of different organizations has been proposed to enhance processing of text retrieval operations. Most of these have concentrated on the design of fast, efficient search engines. These can be divided into three classes: associative memories, cellular pattern matchers, and finite state automata. The advantages and disadvantages inherent in each of these approaches are discussed, along with a number of proposed implementations. Finally, the text retrieval system under development at the University of Utah is discussed in more detail.

52 citations

Journal Article
TL;DR: Results supported the hypothesis that there is a relationship between the completeness of the end user's mental model and both error behavior and total number of successful searches.

52 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111