scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
01 Jun 1987
TL;DR: Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach to text signatures, fixed-length bit string representations of document content.
Abstract: This paper considers the use of text signatures, fixed-length bit string representations of document content, in an experimental information retrieval system: such signatures may be generated from the list of keywords characterising a document or a query. A file of documents may be searched in a bit-serial parallel computer, such as the ICL Distributed Array Processor, using a two-level retrieval strategy in which a comparison of a query signature with the file of document signatures provides a simple and efficient means of identifying those few documents that need to undergo a computationally demanding, character matching search. Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach.

46 citations

Journal ArticleDOI
03 Jan 2011
TL;DR: This thesis investigates heuristics for obtaining word-based representations from biomedical text for robust retrieval and proposes a cross-lingual framework for monolingual biomedical IR.
Abstract: In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge is, however, far from trivial.The first research theme investigates heuristics for obtaining word-based representations from biomedical text for robust retrieval. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. Document preprocessing heuristics such as stop word removal, stemming, and breakpoint identification and normalization were shown to strongly affect retrieval performance. An effective combination of heuristics was identified to obtain a word-based representation from text for the remainder of this thesis.The second research theme deals with concept-based retrieval. We compared a word-based to a concept-based representation and determined to what extent a manual concept-based representation can be automatically obtained from text. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval.In the third and last research theme we propose a cross-lingual framework for monolingual biomedical IR. In this framework, the integration of a concept-based representation is viewed as a cross-lingual matching problem involving a word-based and concept-based representation language. This framework gives us the opportunity to adopt a large set of established crosslingual information retrieval methods and techniques for this domain. Experiments with basic term-to-term translation models demonstrate that this approach can significantly improve word-based retrieval.Directions for future work are using these concepts for communication between user and retrieval system, extending upon the translation models and extending CLIR-enhanced concept-based retrieval outside the biomedical domain.Available online from http://purl.utwente.nl/publications/72481.

46 citations

Journal ArticleDOI
TL;DR: An interlingua-based indexing approach is proposed to account for the particular challenges that arise in the design and implementation of cross-language document retrieval systems for the medical domain and it is found that translating both documents and user queries into a language-independent, concept-like representation format is more beneficial to enhance cross- language retrieval performance.
Abstract: approach to account for the particular challenges that arise in the design and implementation of crosslanguage document retrieval systems for the medical domain. Methods: Documents, as well as queries, are mapped to a language-independent conceptual layer on which retrieval operations are performed. We contrast this approach with the direct translation of German queries to English ones which, subsequently, are matched against English documents. Results: We evaluate both approaches, interlinguabased and direct translation, on a large medical document collection, the OHSUMED corpus. A substantial benefit for interlingua-based document retrieval using German queries on English texts is found, which amounts to 93% of the (monolingual) English baseline. Conclusions: Most state-of-the-art cross-language information retrieval systems translate user queries to the language(s) of the target documents. In contradistinction to this approach, translating both documents and user queries into a language-independent, concept-like representation format is more beneficial to enhance cross-language retrieval performance

46 citations

Journal ArticleDOI
01 Oct 2016
TL;DR: It is shown that basic schemes are weak, but some of them can be made arbitrarily safe by composing them with large anonymity systems, and the security of each scheme is proved using a flexible differentially private definition for private queries that can capture notions of imperfect privacy.
Abstract: Private Information Retrieval (PIR), despite being well studied, is computationally costly and arduous to scale. We explore lower-cost relaxations of information-theoretic PIR, based on dummy queries, sparse vectors, and compositions with an anonymity system. We prove the security of each scheme using a flexible differentially private definition for private queries that can capture notions of imperfect privacy. We show that basic schemes are weak, but some of them can be made arbitrarily safe by composing them with large anonymity systems.

46 citations

Proceedings Article
01 Jan 2000
TL;DR: An XMLlike language is proposed to describe the user model for music information retrieval purposes and some paradigms to acquire, deploy and share the user information to improve current music information systems are proposed.
Abstract: To make multimedia data easily retrieved, we use metadata to describe the information, so that search engines or other information filter tools can effectively and efficiently locate and retrieve the multimedia content. Since many features of multimedia content are perceptual and user-dependent, user modeling is also necessary for multimedia information retrieval systems, e.g., music information retrieval systems. Furthermore, to make the user models sharable, we need standardized language to describe them. In this paper, an XMLlike language is proposed to describe the user model for music information retrieval purposes. We also propose some paradigms to acquire, deploy and share the user information to improve current music information systems. A prototype system, MusicCat, is analyzed and implemented as a case.

45 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111