scispace - formally typeset
Journal ArticleDOI

A language modeling approach to information retrieval

Jay Ponte, +1 more
- Vol. 51, Iss: 2, pp 275-281
Reads0
Chats0
TLDR
It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection and provide further proof of concept for the use of language models for retrieval tasks.
Abstract
In today's world, there is no shortage of information. However, for a specific information need, only a small subset of all of the available information will be useful. The field of information retrieval (IR) is the study of methods to provide users with that small subset of information relevant to their needs and to do so in a timely fashion. Information sources can take many forms, but this thesis will focus on text based information systems and investigate problems germane to the retrieval of written natural language documents. Central to these problems is the notion of "topic." In other words, what are documents about? However, topics depend on the semantics of documents and retrieval systems are not endowed with knowledge of the semantics of natural language. The approach taken in this thesis will be to make use of probabilistic language models to investigate text based information retrieval and related problems. One such problem is the prediction of topic shifts in text, the topic segmentation problem. It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection. Two complementary sets of features are studied individually and then combined into a single language model. The language modeling approach allows this problem to be approached in a principled way without complex semantic modeling. Next, the problem of document retrieval in response to a user query will be investigated. Models of document indexing and document retrieval have been extensively studied over the past three decades. The integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. Much of the reason for this is that the indexing component requires inferences as to the semantics of documents. Instead, an approach to retrieval based on probabilistic language modeling will be presented. Models are estimated for each document individually. The approach to modeling is non-parametric and integrates the entire retrieval process into a single model. One advantage of this approach is that collection statistics, which are used heuristically for the assignment of concept probabilities in other probabilistic models, are used directly in the estimation of language model probabilities in this approach. The language modeling approach has been implemented and tested empirically and performs very well on standard test collections and query sets. In order to improve retrieval effectiveness, IR systems use additional techniques such as relevance feedback, unsupervised query expansion and structured queries. These and other techniques are discussed in terms of the language modeling approach and empirical results are given for several of the techniques developed. These results provide further proof of concept for the use of language models for retrieval tasks.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book

Opinion Mining and Sentiment Analysis

TL;DR: This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems and focuses on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis.
Book

Learning to Rank for Information Retrieval

TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.
Journal ArticleDOI

Relevance-Based Language Models

TL;DR: This work proposes a novel technique for estimating a relevance model with no training data and demonstrates that it can produce highly accurate relevance models, addressing important notions of synonymy and polysemy.
Journal ArticleDOI

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

TL;DR: This paper examines the sensitivity of retrieval performance to the smoothing parameters and compares several popular smoothing methods on different test collection.
Journal ArticleDOI

Understanding inverse document frequency: on theoretical arguments for IDF

TL;DR: It is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.
References
More filters
BookDOI

Density estimation for statistics and data analysis

TL;DR: The Kernel Method for Multivariate Data: Three Important Methods and Density Estimation in Action.
Journal ArticleDOI

On Estimation of a Probability Density Function and Mode

TL;DR: In this paper, the problem of the estimation of a probability density function and of determining the mode of the probability function is discussed. Only estimates which are consistent and asymptotically normal are constructed.
Book

Statistical analysis of finite mixture distributions

TL;DR: This course discusses Mathematical Aspects of Mixtures, Sequential Problems and Procedures, and Applications of Finite Mixture Models.
Related Papers (5)