A machine learning approach to building domain-specific search engines

Open AccessProceedings Article

A machine learning approach to building domain-specific search engines

Andrew McCallum, +3 more

- pp 662-667

Chats0

TLDR

The use of machine learning techniques are proposed to greatly automate the creation and maintenance of domain-specific search engines and new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments is described.

Abstract:

Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also difficult and time-consuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers available at www.cora.justrcsettrch.com.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Web mining research: a survey

Raymond Kosala, +1 more

- 01 Jun 2000 -

Sigkdd Explorations

TL;DR: This paper surveys the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories, which are then situate some of the research with respect to these three categories.

...read moreread less

Proceedings Article

Focused Crawling Using Context Graphs

Michelangelo Diligenti, +4 more

TL;DR: A focused crawling algorithm is presented that builds a model for the context within which topically relevant pages occur on the web that can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages.

...read moreread less

Journal ArticleDOI

The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies

David M. Blei, +2 more

- 08 Feb 2010 -

Journal of the ACM

TL;DR: The nested Chinese restaurant process (nCRP) as discussed by the authors is a stochastic process that assigns probability distributions to ensembles of infinitely deep, infinitely branching trees, and it can be used as a prior distribution in a Bayesian nonparametric model of document collections.

...read moreread less

Posted Content

The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies

David M. Blei, +2 more

- 03 Oct 2007 -

arXiv: Machine Learning

TL;DR: An application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction.

...read moreread less

Learning Hidden Markov Model Structure for Information Extraction

Kristie Seymore, +1 more

TL;DR: It is demonstrated that a manually-constructed model that contains multiple states per extraction field outperforms a model with one state per field, and the use of distantly-labeled data to set model parameters provides a significant improvement in extraction accuracy.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Maximum likelihood from incomplete data via the EM algorithm

Arthur P. Dempster, +2 more

- 01 Sep 1977 -

Journal of the royal statistical society...

Journal ArticleDOI

A tutorial on hidden Markov models and selected applications in speech recognition

Lawrence R. Rabiner

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.

...read moreread less

Journal ArticleDOI

Reinforcement learning: a survey

Leslie Pack Kaelbling, +2 more

- 01 Jan 1996 -

Journal of Artificial Intelligence Resea...

TL;DR: Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.

...read moreread less

Journal ArticleDOI

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

Andrew J. Viterbi

- 01 Apr 1967 -

IEEE Transactions on Information Theory

TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.

...read moreread less

Posted Content

Reinforcement Learning: A Survey

Leslie Pack Kaelbling, +2 more

- 01 May 1996 -

arXiv: Artificial Intelligence

TL;DR: A survey of reinforcement learning from a computer science perspective can be found in this article, where the authors discuss the central issues of RL, including trading off exploration and exploitation, establishing the foundations of RL via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.

...read moreread less

A machine learning approach to building domain-specific search engines

Citations

Web mining research: a survey

Focused Crawling Using Context Graphs

The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies

The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies

Learning Hidden Markov Model Structure for Information Extraction

References

Maximum likelihood from incomplete data via the EM algorithm

A tutorial on hidden Markov models and selected applications in speech recognition

Reinforcement learning: a survey

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

Reinforcement Learning: A Survey

Related Papers (5)

The anatomy of a large-scale hypertextual Web search engine

Focused crawling: a new approach to topic-specific Web resource discovery

Enhanced hypertext categorization using hyperlinks

Authoritative sources in a hyperlinked environment

A re-examination of text categorization methods