scispace - formally typeset
Search or ask a question

Showing papers by "Joydeep Ghosh published in 2005"


Proceedings ArticleDOI
01 Dec 2005
TL;DR: This paper proposes and analyzes parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences, and shows that there is a bijection between regular exponential families and a largeclass of BRegman diverGences, that is called regular Breg man divergence.
Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans , the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information.

1,723 citations


Journal ArticleDOI
TL;DR: This work investigates two approaches based on the concept of random forests of classifiers implemented within a binary hierarchical multiclassifier system, with the goal of achieving improved generalization of the classifier in analysis of hyperspectral data, particularly when the quantity of training data is limited.
Abstract: Statistical classification of byperspectral data is challenging because the inputs are high in dimension and represent multiple classes that are sometimes quite mixed, while the amount and quality of ground truth in the form of labeled data is typically limited. The resulting classifiers are often unstable and have poor generalization. This work investigates two approaches based on the concept of random forests of classifiers implemented within a binary hierarchical multiclassifier system, with the goal of achieving improved generalization of the classifier in analysis of hyperspectral data, particularly when the quantity of training data is limited. A new classifier is proposed that incorporates bagging of training samples and adaptive random subspace feature selection within a binary hierarchical classifier (BHC), such that the number of features that is selected at each node of the tree is dependent on the quantity of associated training data. Results are compared to a random forest implementation based on the framework of classification and regression trees. For both methods, classification results obtained from experiments on data acquired by the National Aeronautics and Space Administration (NASA) Airborne Visible/Infrared Imaging Spectrometer instrument over the Kennedy Space Center, Florida, and by Hyperion on the NASA Earth Observing 1 satellite over the Okavango Delta of Botswana are superior to those from the original best basis BHC algorithm and a random subspace extension of the BHC.

984 citations


Journal ArticleDOI
TL;DR: A generative mixture-model approach to clustering directional data based on the von Mises-Fisher distribution, which arises naturally for data distributed on the unit hypersphere, and derives and analyzes two variants of the Expectation Maximization framework for estimating the mean and concentration parameters of this mixture.
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.

869 citations


Journal ArticleDOI
TL;DR: This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher distributions.
Abstract: This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.

261 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: This paper interprets an overlapping clustering model proposed by Segal et al.
Abstract: While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.

218 citations


Proceedings ArticleDOI
25 Jul 2005
TL;DR: The shortest path k-nearest neighbor classifier (SkNN), that utilizes nonlinear manifold learning, is proposed for analysis of hyperspectral data and high classification accuracies and generalization capability are compared to those achieved by the best basis binary hierarchical classifier, the hierarchical support vector machine classifiers, and the k-NEarest neighbors classifier on both the original data and a subset of its principal components.
Abstract: The shortest path k-nearest neighbor classifier (SkNN), that utilizes nonlinear manifold learning, is proposed for analysis of hyperspectral data. In contrast to classifiers that deal with the high dimensional feature space directly, this approach uses the pairwise distance matrix over a nonlinear manifold to classify novel observations. Because manifold learning preserves the local pairwise distances and updates distances of a sample to samples beyond the user-defined neighborhood along the shortest path on the manifold, similar samples are moved into closer proximity. High classification accuracies are achieved by using the simple k-nearest neighbor (kNN) classifier. SkNN was applied to hyperspectral data collected by the Hyperion sensor on the EO1 satellite over the Okavango Delta of Botswana. Classification accuracies and generalization capability are compared to those achieved by the best basis binary hierarchical classifier, the hierarchical support vector machine classifier, and the k-nearest neighbor classifier on both the original data and a subset of its principal components.

61 citations


Proceedings ArticleDOI
10 May 2005
TL;DR: This work proposes a new technique that extracts a suitable hierarchical structure automatically from a corpus of labeled documents and shows that it groups similar classes closer together in the tree and discovers relationships among documents that are not encoded in the class labels.
Abstract: While several hierarchical classification methods have been applied to web content, such techniques invariably rely on a pre-defined taxonomy of documents. We propose a new technique that extracts a suitable hierarchical structure automatically from a corpus of labeled documents. We show that our technique groups similar classes closer together in the tree and discovers relationships among documents that are not encoded in the class labels. The learned taxonomy is then used along with binary SVMs for multi-class classification. We demonstrate the efficacy of our approach by testing it on the 20-Newsgroup dataset.

56 citations


Proceedings ArticleDOI
07 Aug 2005
TL;DR: This work presents several modifications to OC-IB and integrates it with a global search that results in several improvements such as deterministic results, optimality guarantees, control over cluster size and extension to other cost functions.
Abstract: Unsupervised learning methods often involve summarizing the data using a small number of parameters. In certain domains, only a small subset of the available data is relevant for the problem. One-Class Classification or One-Class Clustering attempts to find a useful subset by locating a dense region in the data. In particular, a recently proposed algorithm called One-Class Information Ball (OC-IB) shows the advantage of modeling a small set of highly coherent points as opposed to pruning outliers. We present several modifications to OC-IB and integrate it with a global search that results in several improvements such as deterministic results, optimality guarantees, control over cluster size and extension to other cost functions. Empirical studies yield significantly better results on various real and artificial data.

49 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: This framework decouples data privacy issues from knowledge integration issues by requiring the individual sites to share only privacy-safe probabilistic models of the local data, which are then integrated to obtain a global Probabilistic model based on the union of the features available at all the sites.
Abstract: We present a probabilistic model-based framework for distributed learning that takes into account privacy restrictions and is applicable to scenarios where the different sites have diverse, possibly overlapping subsets of features. Our framework decouples data privacy issues from knowledge integration issues by requiring the individual sites to share only privacy-safe probabilistic models of the local data, which are then integrated to obtain a global probabilistic model based on the union of the features available at all the sites. We provide a mathematical formulation of the model integration problem using the maximum likelihood and maximum entropy principles and describe iterative algorithms that are guaranteed to converge to the optimal solution. For certain commonly occurring special cases involving hierarchically ordered feature sets or conditional independence, we obtain closed form solutions and use these to propose an efficient alternative scheme by recursive decomposition of the model integration problem. To address interpretability concerns, we also present a modified formulation where the global model is assumed to belong to a specified parametric family. Finally, to highlight the generality of our framework, we provide empirical results for various learning tasks such as clustering and classification on different kinds of datasets consisting of continuous vector, categorical and directional attributes. The results show that high quality global models can be obtained without much loss of privacy.

42 citations


Journal ArticleDOI
TL;DR: A general framework for distributed clustering that takes into account privacy requirements, based on building probabilistic models of the data at each local site, whose parameters are then transmitted to a central location and shown that high quality global clusters can be achieved with little loss of privacy.

37 citations


Proceedings ArticleDOI
27 Nov 2005
TL;DR: A robust and efficient framework for unsupervised discovery of structure in data by finding multiple prototypes that summarize the data that enables the algorithm to scale up to extremely large and high-dimensional domains such as text data.
Abstract: We introduce a robust and efficient framework called CLUMP (CLustering Using Multiple Prototypes) for unsupervised discovery of structure in data. CLUMP relies on finding multiple prototypes that summarize the data. Clustering the prototypes enables our algorithm to scale up to extremely large and high-dimensional domains such as text data. Other desirable properties include robustness to noise and parameter choices. In this paper, we describe the approach in detail, characterize its performance on a variety of datasets, and compare it to some existing model selection approaches.

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter proposes a relationship-based approach to clustering such data that tries to sidestep the “curse-of-dimensionality” issue by working in a suitable similarity space instead of the original high-dimensional feature space.
Abstract: Transaction analysis, including clustering of market baskets, is a key application of data mining to the retail industry. This domain has some specific requirements, such as the need for obtaining easily interpretable and actionable results. It also exhibits some very challenging characteristics, mostly stemming from the fact that the data have thousands of features and are highly non-Gaussian and sparse. This chapter proposes a relationship-based approach to clustering such data that tries to sidestep the “curse-of-dimensionality” issue by working in a suitable similarity space instead of the original high-dimensional feature space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph-partitioning-based clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging clusters can be easily derived, and it also guides the user toward a suitable number of clusters. Results are presented on a real retail industry data set of several thousand customers and products.

Proceedings Article
09 Jul 2005
TL;DR: A maximum likelihood based framework which exploits the hierarchical structure of the taxonomies to obtain a more natural mapping between the source classes and the master taxonomy.
Abstract: Many approaches have been proposed for the problem of mapping categories (classes) from a source taxonomy to classes in a master taxonomy. Most of these techniques, however, ignore the hierarchical structure of the taxonomies. In this paper, we propose a maximum likelihood based framework which exploits the hierarchical structure to obtain a more natural mapping between the source classes and the master taxonomy. Furthermore, unlike previous work, our technique also inserts source classes into appropriate places of the master hierarchy creating new categories if required. We evaluate our approach on text and hyperspectral datasets.

Proceedings ArticleDOI
24 Oct 2005
TL;DR: An evaluation of clustering algorithms using these metrics shows that CLARANS clustering algorithm produces better quality clusters in the feature space and more homogeneous phases for CPI compared to the popular k-means algorithm.
Abstract: We propose a set of statistical metrics for making a comprehensive, fair, and insightful evaluation of features, clustering algorithms, and distance measures in representative sampling techniques for microprocessor simulation. Our evaluation of clustering algorithms using these metrics shows that CLARANS clustering algorithm produces better quality clusters in the feature space and more homogeneous phases for CPI compared to the popular k-means algorithm. We also propose a new micro-architecture independent data locality based feature, reuse distance distribution (RDD), for finding phases in programs, and show that the RDD feature consistently results in more homogeneous phases than basic block vector (BBV) for many SPEC CPU2000 benchmark programs.


Book ChapterDOI
13 Jun 2005
TL;DR: A knowledge transfer framework that leverages the information extracted from the existing labeled data to classify spatially separate and multitemporal test data is proposed and shows that in the absence of any labeled data in the new area, the approach is better than a direct application of the original classifier on the new data.
Abstract: Obtaining ground truth for hyperspectral data is an expensive task. In addition, a number of factors cause the spectral signatures of the same class to vary with location and/or time. Therefore, adapting a classifier designed from available labeled data to classify new hyperspectral images is difficult, but invaluable to the remote sensing community. In this paper, we use the Binary Hierarchical Classifier to propose a knowledge transfer framework that leverages the information gathered from existing labeled data to classify the data obtained from a spatially separate test area. Experimental results show that in the absence of any labeled data in the new area, our approach is better than a direct application of the old classifier on the new data. Moreover, when small amounts of labeled data are available from the new area, our framework offers further improvements through semi-supervised learning mechanisms.

Book ChapterDOI
27 Aug 2005
TL;DR: Classification results comparing the PPS to Gaussian Mixture Models and K-nearest neighbours show the P PS classifier to be promising, especially for high-D data.
Abstract: In this paper we propose using manifolds modeled by probabilistic principle surfaces (PPS) to characterize and classify high-D data The PPS can be thought of as a nonlinear probabilistic generalization of principal components, as it is designed to pass through the “middle” of the data In fact, the PPS can map a manifold of any simple topology (as long as it can be described by a set of ordered vector co-ordinates) to data in high-dimensional space In classification problems, each class of data is represented by a PPS manifold of varying complexity Experiments using various PPS topologies from a 1-D line to 3-D spherical shell were conducted on two toy classification datasets and three UCI Machine Learning datasets Classification results comparing the PPS to Gaussian Mixture Models and K-nearest neighbours show the PPS classifier to be promising, especially for high-D data

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter presents a modular learning framework called the Binary Hierarchical Classifier (BHC) that takes a coarse-to-fine approach to dealing with a large number of output classes and results in more interpretable results.
Abstract: Many complex pattern classification problems involve high-dimensional inputs as well as a large number of classes. In this chapter, we present a modular learning framework called the Binary Hierarchical Classifier (BHC) that takes a coarse-to-fine approach to dealing with a large number of output classes. BHC decomposes a C-class problem into a set of C-1 two-(meta)class problems, arranged in a binary tree with C leaf nodes and C-1 internal nodes. Each internal node is comprised of a feature extractor and a classifier that discriminates between the two meta-classes represented by its two children. Both bottom-up and top-down approaches for building such a BHC are presented in this chapter. The Bottom-up Binary Hierarchical Classifier (BU-BHC) is built by applying agglomerative clustering to the set of C classes. The Top-down Binary Hierarchical Classifier (TD-BHC) is built by recursively partitioning a set of classes at any internal node into two disjoint groups or meta-classes. The coupled problems of finding a good partition and of searching for a linear feature extractor that best discriminates the two resulting meta-classes are solved simultaneously at each stage of the recursive algorithm. The hierarchical, multistage classification approach taken by the BHC also helps with dealing with high-dimensional data, since simpler feature spaces are often adequate for solving the two-(meta)class problems. In addition, it leads to the discovery of useful domain knowledge such as class hierarchies or ontologies, and results in more interpretable results.