Clustering the tagged web

doi:10.1145/1498759.1498809

Proceedings ArticleDOI

Clustering the tagged web

- pp 54-63

TLDR

It is demonstrated how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages.

Abstract:

Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora

Daniel Ramage, +3 more

TL;DR: Labeled LDA is introduced, a topic model that constrains Latent Dirichlet Allocation by defining a one-to-one correspondence between LDA's latent topics and user tags that allows Labeled LDA to directly learn word-tag correspondences.

...read moreread less

Proceedings ArticleDOI

#TwitterSearch: a comparison of microblog search and web search

Jaime Teevan, +2 more

TL;DR: This paper explores search behavior on the popular microblogging/social networking site Twitter and observes that people search Twitter to find temporally relevant information and information related to people, and the results returned from the different corpora support these different uses.

...read moreread less

Journal ArticleDOI

Survey on social tagging techniques

Manish Gupta, +3 more

TL;DR: Different techniques employed to study various aspects of tagging are summarized, including properties of tag streams, tagging models, tag semantics, generating recommendations using tags, visualizations of tags, applications of tags and problems associated with tagging usage.

...read moreread less

Journal ArticleDOI

Improving Recommender Systems by Incorporating Social Contextual Information

Hao Ma, +3 more

- 01 Apr 2011 -

ACM Transactions on Information Systems

TL;DR: This article proposes a factor analysis approach based on probabilistic matrix factorization to alleviate the data sparsity and poor prediction accuracy problems by incorporating social contextual information, such as social networks and social tags.

...read moreread less

Proceedings ArticleDOI

Modeling Interestingness with Deep Neural Networks

Jianfeng Gao, +4 more

TL;DR: In this article, the authors use deep neural networks to learn deep semantic models (DSM) of "interestingness" in click transitions between source and target documents derived from web browser logs, which can be used for contextual entity search, automatic text highlighting, prefetching documents of likely interest, automated content recommendation, automated advertisement placement, etc.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

Indexing by Latent Semantic Analysis

Scott Deerwester, +4 more

- 01 Sep 1990 -

Journal of the Association for Informati...

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Book

Introduction to Information Retrieval

Christopher D. Manning, +2 more

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.

...read moreread less

Journal ArticleDOI

Finding scientific topics

Thomas L. Griffiths, +1 more

- 06 Apr 2004 -

Proceedings of the National Academy of S...

TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.

...read moreread less