Showing papers by "Joydeep Ghosh published in 2005"

PDF

Open Access

Proceedings Article•DOI•

[...]

Arindam Banerjee¹, Srujana Merugu¹, Inderjit S. Dhillon¹, Joydeep Ghosh¹•Institutions (1)

01 Dec 2005

TL;DR: This paper proposes and analyzes parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences, and shows that there is a bijection between regular exponential families and a largeclass of BRegman diverGences, that is called regular Breg man divergence.

...read moreread less

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans , the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information.

...read moreread less

1,723 citations

Journal Article•DOI•

Investigation of the random forest framework for classification of hyperspectral data

[...]

Jisoo Ham¹, Yangchi Chen¹, Melba M. Crawford¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

22 Feb 2005-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: This work investigates two approaches based on the concept of random forests of classifiers implemented within a binary hierarchical multiclassifier system, with the goal of achieving improved generalization of the classifier in analysis of hyperspectral data, particularly when the quantity of training data is limited.

...read moreread less

Abstract: Statistical classification of byperspectral data is challenging because the inputs are high in dimension and represent multiple classes that are sometimes quite mixed, while the amount and quality of ground truth in the form of labeled data is typically limited. The resulting classifiers are often unstable and have poor generalization. This work investigates two approaches based on the concept of random forests of classifiers implemented within a binary hierarchical multiclassifier system, with the goal of achieving improved generalization of the classifier in analysis of hyperspectral data, particularly when the quantity of training data is limited. A new classifier is proposed that incorporates bagging of training samples and adaptive random subspace feature selection within a binary hierarchical classifier (BHC), such that the number of features that is selected at each node of the tree is dependent on the quantity of associated training data. Results are compared to a random forest implementation based on the framework of classification and regression trees. For both methods, classification results obtained from experiments on data acquired by the National Aeronautics and Space Administration (NASA) Airborne Visible/Infrared Imaging Spectrometer instrument over the Kennedy Space Center, Florida, and by Hyperion on the NASA Earth Observing 1 satellite over the Okavango Delta of Botswana are superior to those from the original best basis BHC algorithm and a random subspace extension of the BHC.

...read moreread less

984 citations

Journal Article•DOI•

Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

[...]

Arindam Banerjee¹, Inderjit S. Dhillon¹, Joydeep Ghosh¹, Suvrit Sra¹•Institutions (1)

University of Texas at Austin¹

01 Dec 2005-Journal of Machine Learning Research

TL;DR: A generative mixture-model approach to clustering directional data based on the von Mises-Fisher distribution, which arises naturally for data distributed on the unit hypersphere, and derives and analyzes two variants of the Expectation Maximization framework for estimating the mean and concentration parameters of this mixture.

...read moreread less

Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.

...read moreread less

869 citations

Journal Article•DOI•

Generative model-based document clustering: a comparative study

[...]

Shi Zhong¹, Joydeep Ghosh²•Institutions (2)

Florida Atlantic University¹, University of Texas at Austin²

01 Sep 2005-Knowledge and Information Systems

TL;DR: This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher distributions.

...read moreread less

Abstract: This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.

...read moreread less

261 citations

Proceedings Article•DOI•

Model-based overlapping clustering

[...]

Arindam Banerjee¹, Chase Krumpelman¹, Joydeep Ghosh¹, Sugato Basu¹, Raymond J. Mooney¹ - Show less +1 more•Institutions (1)

University of Texas at Austin¹

21 Aug 2005

TL;DR: This paper interprets an overlapping clustering model proposed by Segal et al.

...read moreread less

Abstract: While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.

...read moreread less

218 citations

Proceedings Article•DOI•

Applying nonlinear manifold learning to hyperspectral data for land cover classification

[...]

Yangchi Chen¹, Melba M. Crawford¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

25 Jul 2005

TL;DR: The shortest path k-nearest neighbor classifier (SkNN), that utilizes nonlinear manifold learning, is proposed for analysis of hyperspectral data and high classification accuracies and generalization capability are compared to those achieved by the best basis binary hierarchical classifier, the hierarchical support vector machine classifiers, and the k-NEarest neighbors classifier on both the original data and a subset of its principal components.

...read moreread less

Abstract: The shortest path k-nearest neighbor classifier (SkNN), that utilizes nonlinear manifold learning, is proposed for analysis of hyperspectral data. In contrast to classifiers that deal with the high dimensional feature space directly, this approach uses the pairwise distance matrix over a nonlinear manifold to classify novel observations. Because manifold learning preserves the local pairwise distances and updates distances of a sample to samples beyond the user-defined neighborhood along the shortest path on the manifold, similar samples are moved into closer proximity. High classification accuracies are achieved by using the simple k-nearest neighbor (kNN) classifier. SkNN was applied to hyperspectral data collected by the Hyperion sensor on the EO1 satellite over the Okavango Delta of Botswana. Classification accuracies and generalization capability are compared to those achieved by the best basis binary hierarchical classifier, the hierarchical support vector machine classifier, and the k-nearest neighbor classifier on both the original data and a subset of its principal components.

...read moreread less

61 citations

Proceedings Article•DOI•

Automatically learning document taxonomies for hierarchical classification

[...]

Kunal Punera¹, Suju Rajan¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

10 May 2005

TL;DR: This work proposes a new technique that extracts a suitable hierarchical structure automatically from a corpus of labeled documents and shows that it groups similar classes closer together in the tree and discovers relationships among documents that are not encoded in the class labels.

...read moreread less

Abstract: While several hierarchical classification methods have been applied to web content, such techniques invariably rely on a pre-defined taxonomy of documents. We propose a new technique that extracts a suitable hierarchical structure automatically from a corpus of labeled documents. We show that our technique groups similar classes closer together in the tree and discovers relationships among documents that are not encoded in the class labels. The learned taxonomy is then used along with binary SVMs for multi-class classification. We demonstrate the efficacy of our approach by testing it on the 20-Newsgroup dataset.

...read moreread less

56 citations

Proceedings Article•DOI•

Robust one-class clustering using hybrid global and local search

[...]

Gunjan Gupta¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

07 Aug 2005

TL;DR: This work presents several modifications to OC-IB and integrates it with a global search that results in several improvements such as deterministic results, optimality guarantees, control over cluster size and extension to other cost functions.

...read moreread less

Abstract: Unsupervised learning methods often involve summarizing the data using a small number of parameters. In certain domains, only a small subset of the available data is relevant for the problem. One-Class Classification or One-Class Clustering attempts to find a useful subset by locating a dense region in the data. In particular, a recently proposed algorithm called One-Class Information Ball (OC-IB) shows the advantage of modeling a small set of highly coherent points as opposed to pruning outliers. We present several modifications to OC-IB and integrate it with a global search that results in several improvements such as deterministic results, optimality guarantees, control over cluster size and extension to other cost functions. Empirical studies yield significantly better results on various real and artificial data.

...read moreread less

49 citations

Proceedings Article•DOI•

A distributed learning framework for heterogeneous data sources

[...]

Srujana Merugu¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

21 Aug 2005

TL;DR: This framework decouples data privacy issues from knowledge integration issues by requiring the individual sites to share only privacy-safe probabilistic models of the local data, which are then integrated to obtain a global Probabilistic model based on the union of the features available at all the sites.

...read moreread less

Abstract: We present a probabilistic model-based framework for distributed learning that takes into account privacy restrictions and is applicable to scenarios where the different sites have diverse, possibly overlapping subsets of features. Our framework decouples data privacy issues from knowledge integration issues by requiring the individual sites to share only privacy-safe probabilistic models of the local data, which are then integrated to obtain a global probabilistic model based on the union of the features available at all the sites. We provide a mathematical formulation of the model integration problem using the maximum likelihood and maximum entropy principles and describe iterative algorithms that are guaranteed to converge to the optimal solution. For certain commonly occurring special cases involving hierarchically ordered feature sets or conditional independence, we obtain closed form solutions and use these to propose an efficient alternative scheme by recursive decomposition of the model integration problem. To address interpretability concerns, we also present a modified formulation where the global model is assumed to belong to a specified parametric family. Finally, to highlight the generality of our framework, we provide empirical results for various learning tasks such as clustering and classification on different kinds of datasets consisting of continuous vector, categorical and directional attributes. The results show that high quality global models can be obtained without much loss of privacy.

...read moreread less

42 citations

Journal Article•DOI•

A privacy-sensitive approach to distributed clustering

[...]

Srujana Merugu¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

01 Mar 2005-Pattern Recognition Letters

TL;DR: A general framework for distributed clustering that takes into account privacy requirements, based on building probabilistic models of the data at each local site, whose parameters are then transmitted to a central location and shown that high quality global clusters can be achieved with little loss of privacy.

...read moreread less

37 citations

Proceedings Article•DOI•

CLUMP: a scalable and robust framework for structure discovery

[...]

Kunal Punera¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

27 Nov 2005

TL;DR: A robust and efficient framework for unsupervised discovery of structure in data by finding multiple prototypes that summarize the data that enables the algorithm to scale up to extremely large and high-dimensional domains such as text data.

...read moreread less

Abstract: We introduce a robust and efficient framework called CLUMP (CLustering Using Multiple Prototypes) for unsupervised discovery of structure in data. CLUMP relies on finding multiple prototypes that summarize the data. Clustering the prototypes enables our algorithm to scale up to extremely large and high-dimensional domains such as text data. Other desirable properties include robustness to noise and parameter choices. In this paper, we describe the approach in detail, characterize its performance on a variety of datasets, and compare it to some existing model selection approaches.

...read moreread less

Book Chapter•DOI•

Clustering and Visualization of Retail Market Baskets

[...]

Joydeep Ghosh¹, Alexander Strehl¹•Institutions (1)

University of Texas at Austin¹

01 Jan 2005

TL;DR: This chapter proposes a relationship-based approach to clustering such data that tries to sidestep the “curse-of-dimensionality” issue by working in a suitable similarity space instead of the original high-dimensional feature space.

...read moreread less

Abstract: Transaction analysis, including clustering of market baskets, is a key application of data mining to the retail industry. This domain has some specific requirements, such as the need for obtaining easily interpretable and actionable results. It also exhibits some very challenging characteristics, mostly stemming from the fact that the data have thousands of features and are highly non-Gaussian and sparse. This chapter proposes a relationship-based approach to clustering such data that tries to sidestep the “curse-of-dimensionality” issue by working in a suitable similarity space instead of the original high-dimensional feature space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph-partitioning-based clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging clusters can be easily derived, and it also guides the user toward a suitable number of clusters. Results are presented on a real retail industry data set of several thousand customers and products.

...read moreread less

Proceedings Article•

A maximum likelihood framework for integrating taxonomies

[...]

Suju Rajan¹, Kunal Punera¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

09 Jul 2005

TL;DR: A maximum likelihood based framework which exploits the hierarchical structure of the taxonomies to obtain a more natural mapping between the source classes and the master taxonomy.

...read moreread less

Abstract: Many approaches have been proposed for the problem of mapping categories (classes) from a source taxonomy to classes in a master taxonomy. Most of these techniques, however, ignore the hierarchical structure of the taxonomies. In this paper, we propose a maximum likelihood based framework which exploits the hierarchical structure to obtain a more natural mapping between the source classes and the master taxonomy. Furthermore, unlike previous work, our technique also inserts source classes into appropriate places of the master hierarchy creating new categories if required. We evaluate our approach on text and hyperspectral datasets.

...read moreread less

Proceedings Article•DOI•

Analyzing and improving clustering based sampling for microprocessor simulation

[...]

Yue Luo, Ajay Joshi, Aashish Phansalkar, Lizy K. John, Joydeep Ghosh - Show less +1 more

24 Oct 2005

TL;DR: An evaluation of clustering algorithms using these metrics shows that CLARANS clustering algorithm produces better quality clusters in the feature space and more homogeneous phases for CPI compared to the popular k-means algorithm.

...read moreread less

Abstract: We propose a set of statistical metrics for making a comprehensive, fair, and insightful evaluation of features, clustering algorithms, and distance measures in representative sampling techniques for microprocessor simulation. Our evaluation of clustering algorithms using these metrics shows that CLARANS clustering algorithm produces better quality clusters in the feature space and more homogeneous phases for CPI compared to the popular k-means algorithm. We also propose a new micro-architecture independent data locality based feature, reuse distance distribution (RDD), for finding phases in programs, and show that the RDD feature consistently results in more homogeneous phases than basic block vector (BBV) for many SPEC CPU2000 benchmark programs.

...read moreread less

Book Chapter•DOI•

Adaptive and Neural Methods for Image Segmentation

[...]

Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

01 Dec 2005

Book Chapter•DOI•

Exploiting class hierarchies for knowledge transfer in hyperspectral data

[...]

Suju Rajan¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

13 Jun 2005

TL;DR: A knowledge transfer framework that leverages the information extracted from the existing labeled data to classify spatially separate and multitemporal test data is proposed and shows that in the absence of any labeled data in the new area, the approach is better than a direct application of the original classifier on the new data.

...read moreread less

Abstract: Obtaining ground truth for hyperspectral data is an expensive task. In addition, a number of factors cause the spectral signatures of the same class to vary with location and/or time. Therefore, adapting a classifier designed from available labeled data to classify new hyperspectral images is difficult, but invaluable to the remote sensing community. In this paper, we use the Binary Hierarchical Classifier to propose a knowledge transfer framework that leverages the information gathered from existing labeled data to classify the data obtained from a spatially separate test area. Experimental results show that in the absence of any labeled data in the new area, our approach is better than a direct application of the old classifier on the new data. Moreover, when small amounts of labeled data are available from the new area, our framework offers further improvements through semi-supervised learning mechanisms.

...read moreread less

Book Chapter•DOI•

Probabilistic principal surface classifier

[...]

Kui-yu Chang¹, Joydeep Ghosh²•Institutions (2)

Nanyang Technological University¹, University of Texas at Austin²

27 Aug 2005

TL;DR: Classification results comparing the PPS to Gaussian Mixture Models and K-nearest neighbours show the P PS classifier to be promising, especially for high-D data.

...read moreread less

Abstract: In this paper we propose using manifolds modeled by probabilistic principle surfaces (PPS) to characterize and classify high-D data The PPS can be thought of as a nonlinear probabilistic generalization of principal components, as it is designed to pass through the “middle” of the data In fact, the PPS can map a manifold of any simple topology (as long as it can be described by a set of ordered vector co-ordinates) to data in high-dimensional space In classification problems, each class of data is represented by a PPS manifold of varying complexity Experiments using various PPS topologies from a 1-D line to 3-D spherical shell were conducted on two toy classification datasets and three UCI Machine Learning datasets Classification results comparing the PPS to Gaussian Mixture Models and K-nearest neighbours show the PPS classifier to be promising, especially for high-D data

...read moreread less

Book Chapter•DOI•

Automatic Discovery of Class Hierarchies via Output Space Decomposition

[...]

Joydeep Ghosh, Shailesh Kumar, Melba M. Crawford

01 Jan 2005

TL;DR: This chapter presents a modular learning framework called the Binary Hierarchical Classifier (BHC) that takes a coarse-to-fine approach to dealing with a large number of output classes and results in more interpretable results.

...read moreread less

Abstract: Many complex pattern classification problems involve high-dimensional inputs as well as a large number of classes. In this chapter, we present a modular learning framework called the Binary Hierarchical Classifier (BHC) that takes a coarse-to-fine approach to dealing with a large number of output classes. BHC decomposes a C-class problem into a set of C-1 two-(meta)class problems, arranged in a binary tree with C leaf nodes and C-1 internal nodes. Each internal node is comprised of a feature extractor and a classifier that discriminates between the two meta-classes represented by its two children. Both bottom-up and top-down approaches for building such a BHC are presented in this chapter. The Bottom-up Binary Hierarchical Classifier (BU-BHC) is built by applying agglomerative clustering to the set of C classes. The Top-down Binary Hierarchical Classifier (TD-BHC) is built by recursively partitioning a set of classes at any internal node into two disjoint groups or meta-classes. The coupled problems of finding a good partition and of searching for a linear feature extractor that best discriminates the two resulting meta-classes are solved simultaneously at each stage of the recursive algorithm. The hierarchical, multistage classification approach taken by the BHC also helps with dealing with high-dimensional data, since simpler feature spaces are often adequate for solving the two-(meta)class problems. In addition, it leads to the discovery of useful domain knowledge such as class hierarchies or ontologies, and results in more interpretable results.

...read moreread less