scispace - formally typeset
Search or ask a question

Showing papers by "Ran El-Yaniv published in 2005"


Proceedings ArticleDOI
07 Aug 2005
TL;DR: An extensive empirical study of two-way, three-way and four-way applications of the MDC scheme using six real-world datasets including the 20 News-groups and the Enron email collection shows that the algorithms consistently and significantly outperform previous state-of-the-art information theoretic clustering algorithms.
Abstract: We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in co-occurrence data. In this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables. To implement this idea, we propose an algorithm that interleaves top-down clustering of some variables and bottom-up clustering of the other variables, with a local optimization correction routine. Focusing on document clustering we present an extensive empirical study of two-way, three-way and four-way applications of our scheme using six real-world datasets including the 20 News-groups (20NG) and the Enron email collection. Our multi-way distributional clustering (MDC) algorithms consistently and significantly outperform previous state-of-the-art information theoretic clustering algorithms.

122 citations


Journal ArticleDOI
TL;DR: Empirical examination of a recent transductive learning approach based on clustering, implemented with 'spectral clustering', on a suite of benchmark datasets from the UCI repository indicates that the new approach is effective and comparable with one of the best known transductives learning algorithms to-date.

14 citations


Journal ArticleDOI
TL;DR: A model is presented, based on divergence measures and statistics of the alignment structure, that corrects BLAST e-values for low complexity sequences without filtering or excluding them and generates scores that are more effective in distinguishing true similarities from chance similarities.
Abstract: The statistical estimates of BLAST and PSI-BLAST are of extreme importance to determine the biological relevance of sequence matches. While being very effective in evaluating most matches, these es...

12 citations