Topic
Document retrieval
About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Abstract: A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.
630 citations
••
01 Jan 1992TL;DR: A retrieval system (INQUERY) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation is described.
Abstract: As larger and more heterogeneous text databases become available, information retrieval research will depend on the development of powerful, efficient and flexible retrieval engines. In this paper, we describe a retrieval system (INQUERY) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation. INQUERY has been used successfully with databases containing nearly 400,000 documents.
629 citations
••
22 Nov 2009TL;DR: This paper proposes a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.
Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.
626 citations
••
TL;DR: The proposed method was designed to disambiguate senses that are usually associated with different topics using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval.
Abstract: Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.
614 citations
•
30 Sep 1998TL;DR: This paper presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and expensive process of integrating structured data and text into a discrete-time system.
Abstract: List of Figures. Preface. 1. Introduction. 2. Retrieval Strategies. 3. Retrieval Utilities. 4. Efficiency Issues pertaining to Sequential IR Systems. 5. Integrating Structured Data and Text. 6. Parallel Information Retrieval Systems. 7. Distributed Information Retrieval. 8. The Text Retrieval Conference (TREC). 9. Future Directions. References.
561 citations