Journal ArticleDOI
Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection
Reads0
Chats0
TLDR
This work proposes a novel semi-supervised spectral ordering algorithm that modifies the Laplacian matrix such that domain knowledge is taken into account and demonstrates the effectiveness of the proposed framework on the seriation of Usenet newsgroup messages.Abstract:
Several studies have demonstrated the prospects of spectral ordering for data mining. One successful application is seriation of paleontological findings, i.e. ordering the sites of excavation, using data on mammal co-occurrences only. However, spectral ordering ignores the background knowledge that is naturally present in the domain: paleontologists can derive the ages of the sites within some accuracy. On the other hand, the age information is uncertain, so the best approach would be to combine the background knowledge with the information on mammal co-occurrences. Motivated by this kind of partial supervision we propose a novel semi-supervised spectral ordering algorithm that modifies the Laplacian matrix such that domain knowledge is taken into account. Also, it performs feature selection by discarding features that contribute most to the unwanted variability of the data in bootstrap sampling. Moreover, we demonstrate the effectiveness of the proposed framework on the seriation of Usenet newsgroup messages, where the task is to find out the underlying flow of discussion. The theoretical properties of our algorithm are thoroughly analyzed and it is demonstrated that the proposed framework enhances the stability of the spectral ordering output and induces computational gains.read more
Citations
More filters
Journal ArticleDOI
Accelerating spectral clustering with partial supervision
TL;DR: A semi-supervised framework for spectral clustering that provably improves the efficiency of the Power Method for computing the Spectral Clustering solution and demonstrates that the efficiency can be enhanced not only by data compression but also by introducing the appropriate supervised bias to the input Laplacian matrix.
Journal ArticleDOI
Live and learn from mistakes: A lightweight system for document classification
TL;DR: The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naive Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
Journal ArticleDOI
Feature selection for k-means clustering stability: theoretical analysis and an algorithm
TL;DR: The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion and is demonstrated in the context of cancer research, where the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes.
Proceedings ArticleDOI
Mind the eigen-gap, or how to accelerate semi-supervised spectral learning algorithms
TL;DR: It has been demonstrated that the appropriate use of partial supervision can bias the data Laplacian matrix such that the necessary eigenvector computations are provably accelerated.
Posted Content
Combinatorial algorithms for the seriation problem
TL;DR: This thesis studies the seriation problem, a combinatorial problem arising in data analysis, which asks to sequence a set of objects in such a way that similar objects are ordered close to each other, and focuses on the combinatorsial structure and properties of Robinsonian matrices, a special class of structured matrices which best achieve the seriated goal.
References
More filters
Journal Article
The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Sergey Brin,Lawrence Page +1 more
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Journal ArticleDOI
A tutorial on spectral clustering
TL;DR: In this article, the authors present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches, and discuss the advantages and disadvantages of these algorithms.
Book
The algebraic eigenvalue problem
TL;DR: Theoretical background Perturbation theory Error analysis Solution of linear algebraic equations Hermitian matrices Reduction of a general matrix to condensed form Eigenvalues of matrices of condensed forms The LR and QR algorithms Iterative methods Bibliography.
Journal ArticleDOI
Top 10 algorithms in data mining
Xindong Wu,Vipin Kumar,J. Ross Quinlan,Joydeep Ghosh,Qiang Yang,Hiroshi Motoda,Geoffrey J. McLachlan,Angus S. K. Ng,Bing Liu,Philip S. Yu,Zhi-Hua Zhou,Michael Steinbach,David J. Hand,Dan Steinberg +13 more
TL;DR: This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.
Book
Matrix perturbation theory
G. W. Stewart,Ji-guang Sun +1 more
TL;DR: In this article, the Perturbation of Eigenvalues and Generalized Eigenvalue Problems are studied. But they focus on linear systems and Least Squares problems and do not consider invariant subspaces.
Related Papers (5)
Analysis of spectral clustering algorithms for community detection: the general bipartite setting
Zhixin Zhou,Arash A. Amini +1 more