scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings ArticleDOI
11 Aug 2002
TL;DR: A general language model was proposed that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval.
Abstract: Statistical Language Models(LM) have been used in many natural language processing tasks including speech recognition and machine translation [5, 2]. Recently language models have been explored as a framework for information retrieval [9, 4, 7, 1, 6]. The basic idea is to view each document to have its own language model and model querying as a generative process. Documents are ranked based on the probability of their language model generating the given query. Since documents are fixed entities in information retrieval, language models for documents suffer from sparse data problem. Smoothed unigram models have been used to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval. Song and Croft [10] proposed a general language model that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities. Improved performance was observed with combined bigram language models. The language models explored for information retrieval mimic those used for speech recognition. Specifically, in the bigram model a document d represented as word sequence w1, w2, · · · , wn is modeled as

117 citations

Patent
11 Jan 2002
TL;DR: In this paper, a new data structure and algorithms which offer at least equal performance in common sparse matrix tasks, and improved performance in many, were proposed and applied to a word document index to produce fast build and query times for document retrieval.
Abstract: A new data structure and algorithms which offer at least equal performance in common sparse matrix tasks, and improved performance in many. This is applied to a word-document index to produce fast build and query times for document retrieval.

117 citations

Patent
Tetsuya Morita1
13 Jan 1989
TL;DR: In this paper, a document retrieval system employs a keyword connection table which contains relation information of keyword connections respectively coupling two arbitrary keywords which are used for retrieving registered documents, the relation information at least includes a relation name and a relationship describing the relation between the two keywords.
Abstract: A document retrieval system employs a keyword connection table which contains relation information of keyword connections respectively coupling two arbitrary keywords which are used for retrieving registered documents. The relation information at least includes a relation name and a relationship describing the relation between the two arbitrary keywords. The relation information may dynamically change depending on a frequency of use of the keywords, that is, by a learning function.

116 citations

Proceedings Article
01 Dec 2008
TL;DR: Two complementary models of blog retrieval that perform at comparable levels of precision and recall are developed and shown consistent and significant improvement across all models using the Wikipedia expansion strategy.
Abstract: We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hocinformation retrieval in several ways: (1) the unit of output (the blog) is composed of a collection of documents (the blog posts) rather than a single document, (2) the query represents an ongoing and typically multifaceted interest in the topic rather than a passing ad-hoc information need and (3) due to the propensity of spam, splogs, and tangential comments, the blogosphere is particularly challenging to use as a source for high-quality query expansion terms. We address these differences at the document representation level, by comparing retrieval models that view either the blog or its constituent posts as the atomic units of retrieval, and at the query expansion level, by making novel use of the links and anchor text in Wikipedia1 to expand a user's initial query. We develop two complementary models of blog retrieval that perform at comparable levels of precision and recall. We also show consistent and significant improvement across all models using our Wikipedia expansion strategy.

114 citations

Patent
18 Feb 2009
TL;DR: In this article, a translation direction specifying unit specifies a first language and a second language, and a keyword extracting unit extracts a keyword for a document retrieval from the first-language character string or the second-languages character string, with which a document retrieving unit performs document retrieval.
Abstract: A translation direction specifying unit specifies a first language and a second language. A speech recognizing unit recognizes a speech signal of the first language and outputs a first language character string. A first translating unit translates the first language character string into a second language character string that will be displayed on a display device. A keyword extracting unit extracts a keyword for a document retrieval from the first language character string or the second language character string, with which a document retrieving unit performs a document retrieval. A second translating unit translates a retrieved document into its opponent language, which will be displayed on the display device.

114 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111