scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings Article
01 Jan 2005
TL;DR: The TREC-2005 QA track as discussed by the authors has three tasks: the main question answering task, the document ranking task, and the relationship task, which is the same as the single TREC 2004 QA task.
Abstract: The TREC 2005 Question Answering track contained three tasks: the main question answering task, the document ranking task, and the relationship task. The main task was the same as the single TREC 2004 QA task. In the main task, question series were used to define a set of targets. Eac h series was about a single target and contained factoid and list questions. The final question in the series was an “Ot her” question that asked for additional information about the target that was not covered by previous questions in the series. The document ranking task was to return a ranked list of documents for each question from a subset of the questions in the main task, where the documents were thought to contain an answer to the question. In the relationship tas k, systems were given TREC-like topic statements that ended with a question asking for evidence for a particular relationship. The goal of the TREC question answering (QA) track is to foster research on systems that return answers themselves, rather than documents containing answers, in response to a question. The track started in TREC-8 (1999), with the first several editions of the track focused on factoid questions. A factoid question is a fact-based, short answer question such as How many calories are there in a Big Mac? . The task in the TREC 2003 QA track was a combined task that contained list and definition questions in additio n to factoid questions [1]. A list question asks for differen t instances of a particular kind of information to be returned , such as List the names of chewing gums . Answering such questions requires a system to assemble an answer from information located in multiple documents. A definition question asks for interesting information about a particul ar person or thing such as Who is Vlad the Impaler?or What is a golden parachute?. Definition questions also require systems to locate inform ation in multiple documents, but in this case the information of interest is much less crisply de lineated. The TREC 2004 test set contained factoid and list questions grouped into different series, where each series had the target of a definition associated with it [2]. Each questi on in a series asked for some information about the target. In addition, the final question in each series was an explicit “Other” question, which was to be interpreted as “Tell me other interesting things about this target I don’t know enou gh to ask directly”. This last question is roughly equivalen t to the definition questions in the TREC 2003 task. Several concerns regarding the TREC 2005 QA track were raised during the TREC 2004 QA breakout session. Since the TREC 2004 task was rather different from previous years’ tasks, there was the desire to repeat the task largely unchanged. There was also the desire to build infrastructure that would allow a closer examination of the role document retrieval techniques play in supporting QA technology. As a result of this discussion, the main task for the 2005 QA track was decided to be essentially the same as the 2004 task in that the test set would consist of a set of questio n series where each series asks for information regarding a particular target. As in TREC 2004, the targets included people, organizations, and other entities (things); unlike TREC 2004 the target could also be an event. Events were added since the document set from which the answers are to be drawn are newswire articles. The runs were evaluated using the same methodology as in TREC 2004, except that the primary measure was the per-series score instead of the combined component score. The document ranking task was added to the TREC 2005 track to address the concern regarding document retrieval and QA. The task was to submit, for a subset of 50 of the questions in the main task, a ranked list of up to 1000 documents for each question. Groups whose primary emphasis was document retrieval rather than QA, were allowed to participate in the document ranking task without submitting actual answers for the main task. However, all TREC 2005 submissions to the main task were required to include a ranked list of documents for each question in the document

130 citations

Proceedings ArticleDOI
Joel L. Fagan1
01 Nov 1987
TL;DR: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented.
Abstract: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented. Problems related to this non-syntactic phrase construction method are discussed, and some possible solutions are proposed that make use of information about the syntactic structure of document and query texts.

130 citations

Journal Article
TL;DR: A series of studies explored the effects of domain expertise and search expertise in hypertext or full-text CD-ROM databases to investigate how highly interactive electronic access to primary information affects information seeking.

130 citations

Proceedings ArticleDOI
J.R. Smith1
21 Jun 1998
TL;DR: This work addresses the growing need for establishing a common content-based image retrieval test-bed and establishes a benchmark set of images and queries for this type of retrieval.
Abstract: One of the most significant problems in content-based image retrieval results from the lack of a common test-bed for researchers. Although many published articles report on content-based retrieval results using color photographs, there has been little effort in establishing a benchmark set of images and queries. Doing so would have many benefits in advancing the technology and utility of content-based image retrieval systems. We address the growing need for establishing a common content-based image retrieval test-bed.

129 citations

Journal ArticleDOI
01 Apr 2014
TL;DR: Overall the findings suggest that language modeling techniques improves document retrieval, with lemmatization technique producing the best result.
Abstract: The current study proposes to compare document retrieval precision performances based on language modeling techniques, particularly stemming and lemmatization. Stemming is a procedure to reduce all words with the same stem to a common form whereas lemmatization removes inflectional endings and returns the base or dictionary form of a word. Comparisons were also made between these two techniques with a baseline ranking algorithm (i.e. with no language processing). A search engine was developed and the algorithms were tested based on a test collection. Both mean average precisions and histograms indicate stemming and lemmatization to outperform the baseline algorithm. As for the language modeling techniques, lemmatization produced better precision compared to stemming, however the differences are insignificant. Overall the findings suggest that language modeling techniques improves document retrieval, with lemmatization technique producing the best result.

129 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111