scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
10 Feb 1995-Science
TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Abstract: A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.

630 citations

Book ChapterDOI
01 Jan 1992
TL;DR: A retrieval system (INQUERY) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation is described.
Abstract: As larger and more heterogeneous text databases become available, information retrieval research will depend on the development of powerful, efficient and flexible retrieval engines. In this paper, we describe a retrieval system (INQUERY) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation. INQUERY has been used successfully with databases containing nearly 400,000 documents.

629 citations

Book ChapterDOI
22 Nov 2009
TL;DR: This paper proposes a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.
Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

626 citations

Journal ArticleDOI
TL;DR: The proposed method was designed to disambiguate senses that are usually associated with different topics using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval.
Abstract: Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.

614 citations

Book
30 Sep 1998
TL;DR: This paper presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and expensive process of integrating structured data and text into a discrete-time system.
Abstract: List of Figures. Preface. 1. Introduction. 2. Retrieval Strategies. 3. Retrieval Utilities. 4. Efficiency Issues pertaining to Sequential IR Systems. 5. Integrating Structured Data and Text. 6. Parallel Information Retrieval Systems. 7. Distributed Information Retrieval. 8. The Text Retrieval Conference (TREC). 9. Future Directions. References.

561 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111