scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings Article
16 Jul 2006
TL;DR: The experimental results support the efficacy of the OntoSearch system by using domain ontology and user ontology for enhanced search performance.
Abstract: OntoSearch, a full-text search engine that exploits ontological knowledge for document retrieval, is presented in this paper. Different from other ontology based search engines, OntoSearch does not require a user to specify the associated concepts of his/her queries. Domain ontology in OntoSearch is in the form of a semantic network. Given a keyword based query, OntoSearch infers the related concepts through a spreading activation process in the domain ontology. To provide personalized information access, we further develop algorithms to learn and exploit user ontology model based on a customized view of the domain ontology. The proposed system has been applied to the domain of searching scientific publications in the ACM Digital Library. The experimental results support the efficacy of the OntoSearch system by using domain ontology and user ontology for enhanced search performance.

45 citations

Journal Article
TL;DR: This paper introduces a newly developed XML indexing and retrieval system on Okapi and extends Robertson’s field-weighted BM25F for document retrieval to element level retrieval function BM25E, and shows how the tuned weights for selected fields are tuned by using INEX 2004 topics and assessments.
Abstract: This is the first year for the Centre for Interactive Systems Research participation of INEX. Based on a newly developed XML indexing and retrieval system on Okapi, we extend Robertson's field-weighted BM25F for document retrieval to element level retrieval function BM25E. In this paper, we introduce this new function and our experimental method in detail, and then show how we tuned weights for our selected fields by using INEX 2004 topics and assessments. Based on the tuned models we submitted our runs for CO.Thorough, CO.FetchBrowse, the methods we propose show real promise. Existing problems and future work are also discussed.

45 citations

Book ChapterDOI
01 Sep 2001
TL;DR: The paper hypothesizes that robust behavior is the result of repetition of important words in the text, meaning that losing one or two occurrences is not crippling and additional related words providing a greater context-- meaning that those words will match even if the seemingly critical word is misrecognized.
Abstract: Several years of research have suggested that the accuracy of spoken document retrieval systems is not adversely affected by speech recognition errors. Even with error rates of around 40%, the effectiveness of an IR system falls less than 10%. The paper hypothesizes that this robust behavior is the result of repetition of important words in the text--meaning that losing one or two occurrences is not crippling-- and the result of additional related words providing a greater context-- meaning that those words will match even if the seemingly critical word is misrecognized. This hypothesis is supported by examples from TREC's SDR track, the TDT evaluation, and some work showing the impact of recognition errors on spoken queries.

45 citations

Book ChapterDOI
27 Jun 2007
TL;DR: An overview of the NL understanding environment functionalities, and the algorithms related to the text segmentation method, which requires a NLP-parser and a semantic representation in Roget-based vectors is presented.
Abstract: Information retrieval needs to match relevant texts with a given query. Selecting appropriate parts is useful when documents are long, and only portions are interesting to the user. In this paper, we describe a method that extensively uses natural language techniques for text segmentation based on topic change detection. The method requires a NLP-parser and a semantic representation in Roget-based vectors. We have run the experiment on French documents, for which we have the appropriate tools, but the method could be transposed to any other language with the same requirements. The article sketches an overview of the NL understanding environment functionalities, and the algorithms related to our text segmentation method. An experiment in text segmentation is also presented and its result in an information retrieval task is shown.

45 citations

Posted Content
TL;DR: A new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation is presented.
Abstract: Automated fact-checking based on machine learning is a promising approach to identify false information distributed on the web. In order to achieve satisfactory performance, machine learning methods require a large corpus with reliable annotations for the different tasks in the fact-checking process. Having analyzed existing fact-checking corpora, we found that none of them meets these criteria in full. They are either too small in size, do not provide detailed annotations, or are limited to a single domain. Motivated by this gap, we present a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation. To aid future corpus construction, we describe our methodology for corpus creation and annotation, and demonstrate that it results in substantial inter-annotator agreement. As baselines for future research, we perform experiments on our corpus with a number of model architectures that reach high performance in similar problem settings. Finally, to support the development of future models, we provide a detailed error analysis for each of the tasks. Our results show that the realistic, multi-domain setting defined by our data poses new challenges for the existing models, providing opportunities for considerable improvement by future systems.

45 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111