scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Gaussian Mixture Model to Detect Clusters Embedded in Feature Subspace

01 Jan 2007-Communications in information and systems (International Press of Boston)-Vol. 7, Iss: 4, pp 337-352
TL;DR: A probabilistic model based on Gaussian mixture is introduced to solve the problem of clusters embedded in different feature subspace, where some features can be irrelevant, and thus hinder the clustering performance.
Abstract: The goal of unsupervised learning, i.e., clustering, is to determine the intrinsic structure of unlabeled data. Feature selection for clustering improves the performance of grouping by removing irrelevant features. Typical feature selection algorithms select a common feature subset for all the clusters. Consequently, clusters embedded in different feature subspaces are not able to be identified. In this paper, we introduce a probabilistic model based on Gaussian mixture to solve this problem. Particularly, the feature relevance for an individual cluster is treated as a probability, which is represented by localized feature saliency and estimated through Expectation Maximization (EM) algorithm during the clustering process. In addition, the number of clusters is determined simultaneously by integrating a Minimum Message Length (MML) criterion. Experiments carried on both synthetic and real-world datasets illustrate the performance of the proposed approach in finding clusters embedded in feature subspace. 1. Introduction. Clustering is unsupervised classification of data objects into different groups (clusters) such that objects in one group are similar together and dis- similar from another group. Applications of data clustering are found in many fields, such as information discovering, text mining, web analysis, image grouping, medi- cal diagnosis, and bioinformatics. Many clustering algorithms have been proposed in the literature (8). Basically, they can be categorized into two groups: hierarchical or partitional. A clustering algorithm typically considers all available features of the dataset in an attempt to learn as much as possible from data. In practice, however, some features can be irrelevant, and thus hinder the clustering performance. Feature selection, which chooses the "best" feature subset for clustering, can be applied to solve this problem. Feature selection is extensively studied in supervised learning scenario (1-3), where class labels are available for judging the performance improvement contributed by a feature selection algorithm. For unsupervised learning, feature selection is a very dif- ficult problem due to the lack of class labels, and it has received extensive attention recently. The algorithm proposed in (4) measures feature similarity by an information compression index. In (5), the relevant features are detected using a distance-based entropy measure. (6) evaluates the cluster quality over different feature subsets by normalizing cluster separability or likelihood using a cross-projection method. In (7), feature saliency is defined as a probability and estimated by the Expectation Maxi- mization (EM) algorithm using Gaussian mixture models. A variational Bayesian ap-
Citations
More filters
Journal Article
TL;DR: In this article, a bipartite graph based data clustering method is proposed, where terms and documents are simultaneously grouped into semantically meaningful co-categories and subject descriptors.
Abstract: Bipartite Graph Partitioning and Data Clustering* Hongyuan Zha Xiaofeng He Dept. of Comp. Sci. & Eng. Penn State Univ. State College, PA 16802 {zha,xhe}@cse.psu.edu Chris Ding Horst Simon NERSC Division Berkeley National Lab. Berkeley, CA 94720 {chqding,hdsimon} Qlbl. gov Ming Gu Dept. of Math. U.C. Berkeley Berkeley, CA 94720 mgu@math.berkeley.edu ABSTRACT M a n y data types arising from data mining applications can be modeled as bipartite graphs, examples include terms and documents in a text corpus, customers and purchasing items in market basket analysis and reviewers and movies in a movie recommender system. In this paper, we propose a new data clustering method based on partitioning the underlying bipartite graph. The partition is constructed by minimizing a normalized sum of edge weights between unmatched pairs of vertices of the bipartite graph. We show that an approxi­ mate solution to the minimization problem can be obtained by computing a partial singular value decomposition ( S V D ) of the associated edge weight matrix of the bipartite graph. We point out the connection of our clustering algorithm to correspondence analysis used in multivariate analysis. We also briefly discuss the issue of assigning data objects to multiple clusters. In the experimental results, we apply our clustering algorithm to the problem of document clustering to illustrate its effectiveness and efficiency. 1. INTRODUCTION Cluster analysis is an important tool for exploratory data mining applications arising from many diverse disciplines. Informally, cluster analysis seeks to partition a given data set into compact clusters so that data objects within a clus­ ter are more similar than those in distinct clusters. The liter­ ature on cluster analysis is enormous including contributions from many research communities, (see [6, 9] for recent sur­ veys of some classical approaches.) M a n y traditional clus­ tering algorithms are based on the assumption that the given dataset consists of covariate information (or attributes) for each individual data object, and cluster analysis can be cast as a problem of grouping a set of n-dimensional vectors each representing a data object in the dataset. A familiar ex­ ample is document clustering using the vector space model [1]. Here each document is represented by an n-dimensional vector, and each coordinate of the vector corresponds to a term in a vocabulary of size n. This formulation leads to the so-called term-document matrix A = (oy) for the rep­ resentation of the collection of documents, where o y is the so-called term frequency, i.e., the number of times term i occurs in document j. In this vector space model terms and documents are treated asymmetrically with terms consid­ ered as the covariates or attributes of documents. It is also possible to treat both terms and documents as first-class citizens in a symmetric fashion, and consider a y as the fre­ quency of co-occurrence of term i and document j as is done, for example, in probabilistic latent semantic indexing [12]. In this paper, we follow this basic principle and propose a new approach to model terms and documents as vertices in a bipartite graph with edges of the graph indicating the co-occurrence of terms and documents. In addition we can optionally use edge weights to indicate the frequency of this co-occurrence. Cluster analysis for document collections in this context is based on a very intuitive notion: documents are grouped by topics, on one hand documents in a topic tend to more heavily use the same subset of terms which form a term cluster, and on the other hand a topic usually is characterized by a subset of terms and those documents heavily using those terms tend to be about that particular topic. It is this interplay of terms and documents which gives rise to what we call bi-clustering by which terms and documents are simultaneously grouped into semantically co- Categories and Subject Descriptors 11.3.3 [ I n f o r m a t i o n S e a r c h a n d R e t r i e v a l ] : Clustering; G.1.3 [ N u m e r i c a l L i n e a r A l g e b r a ] : Singular value de­ composition; G.2.2 [ G r a p h T h e o r y ] : G r a p h algorithms General Terms Algorithms, theory Keywords document clustering, bipartite graph, graph partitioning, spectral relaxation, singular value decomposition, correspon­ dence analysis *Part of this work was done while Xiaofeng He was a grad­ uate research assistant at N E R S C , Berkeley National Lab. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM '01 November 5-10, 2001, Atlanta, Georgia. U S A Copyright 2001 A C M X - X X X X X - X X - X / X X / X X ...$5.00. O u r clustering algorithm computes an approximate global optimal solution while probabilistic latent semantic indexing relies on the E M algorithm and therefore might be prone to local m i n i m a even with the help of some annealing process. x

295 citations

Journal ArticleDOI
TL;DR: This paper proposes, implement, and test a hybrid neighborhood-aware algorithm for outlier detection that considers the uneven spatial density of the users, the number of malicious Users, the level of conspiracy, and the lack of accuracy and malfunctioning sensors.
Abstract: In this paper we study the problem of sensor data verification in Participatory Sensing (PS) systems using an air quality/pollution monitoring application as a validation example. Data verification, in the context of PS, consists of the process of detecting and removing spatial outliers to properly reconstruct the variables of interest. We propose, implement, and test a hybrid neighborhood-aware algorithm for outlier detection that considers the uneven spatial density of the users, the number of malicious users, the level of conspiracy, and the lack of accuracy and malfunctioning sensors. The algorithm utilizes the Delaunay triangulation and Gaussian Mixture Models to build neighborhoods based on the spatial and non-spatial attributes of each location. This neighborhood definition allows us to demonstrate that it is not necessary to apply accurate but computationally expensive estimators to the entire dataset to obtain good results, as equally accurate but computationally cheaper methods can also be applied to part of the data and obtain good results as well. Our experimental results show that our hybrid algorithm performs as good as the best estimator while reducing the execution time considerably.

22 citations

Journal ArticleDOI
TL;DR: A regularized Gaussian mixture model (GMM) for clustering that finds low-dimensional representations of the component covariance matrices, resulting in better estimation of local feature correlations.
Abstract: Finding low-dimensional representation of high-dimensional data sets is an important task in various applications. The fact that data sets often contain clusters embedded in different subspaces poses barrier to this task. Driven by the need in methods that enable clustering and finding each cluster’s intrinsic subspace simultaneously, in this paper, we propose a regularized Gaussian mixture model (GMM) for clustering. Despite the advantages of GMM, such as its probabilistic interpretation and robustness against observation noise, traditional maximum-likelihood estimation for GMMs shows disappointing performance in high-dimensional setting. The proposed regularization method finds low-dimensional representations of the component covariance matrices, resulting in better estimation of local feature correlations. The regularization problem can be incorporated in the expectation maximization algorithm for maximizing the likelihood function of a GMM, with the ${M}$ -step modified to incorporate the regularization. The ${M}$ -step involves a determinant maximization problem, which can be solved efficiently. The performance of the proposed method is demonstrated using several simulated data sets. We also illustrate the potential value of the proposed method in applications using four real data sets.

21 citations


Cites background from "A Gaussian Mixture Model to Detect ..."

  • ...[33] proposed to detect clusters embedded in feature subspace with a local feature saliency measure....

    [...]

Journal ArticleDOI
TL;DR: The semi-supervised projected model-based clustering algorithm (SeSProC) includes a novel model selection approach, using a greedy forward search to estimate the final number of clusters, and outperforms three related baseline algorithms in most scenarios using synthetic and real data sets.
Abstract: We present an adaptation of model-based clustering for partially labeled data, that is capable of finding hidden cluster labels. All the originally known and discoverable clusters are represented using localized feature subset selections (subspaces), obtaining clusters unable to be discovered by global feature subset selection. The semi-supervised projected model-based clustering algorithm (SeSProC) also includes a novel model selection approach, using a greedy forward search to estimate the final number of clusters. The quality of SeSProC is assessed using synthetic data, demonstrating its effectiveness, under different data conditions, not only at classifying instances with known labels, but also at discovering completely hidden clusters in different subspaces. Besides, SeSProC also outperforms three related baseline algorithms in most scenarios using synthetic and real data sets.

8 citations


Additional excerpts

  • ...The extension of Law et al. (2004) to subspaces was applied in Li et al. (2007), which is the groundwork for our research....

    [...]

Dissertation
02 Apr 2013
TL;DR: In this paper, a semi-supervised clustering approach is proposed to validate the validation of data relevance by using either known indices or expert opinions, which can be used to assess the quality of clustering solutions.
Abstract: Machine learning techniques are used for extracting valuable knowledge from data. Nowa¬days, these techniques are becoming even more important due to the evolution in data ac¬quisition and storage, which is leading to data with different characteristics that must be exploited. Therefore, advances in data collection must be accompanied with advances in machine learning techniques to solve new challenges that might arise, on both academic and real applications. There are several machine learning techniques depending on both data characteristics and purpose. Unsupervised classification or clustering is one of the most known techniques when data lack of supervision (unlabeled data) and the aim is to discover data groups (clusters) according to their similarity. On the other hand, supervised classification needs data with supervision (labeled data) and its aim is to make predictions about labels of new data. The presence of data labels is a very important characteristic that guides not only the learning task but also other related tasks such as validation. When only some of the available data are labeled whereas the others remain unlabeled (partially labeled data), neither clustering nor supervised classification can be used. This scenario, which is becoming common nowadays because of labeling process ignorance or cost, is tackled with semi-supervised learning techniques. This thesis focuses on the branch of semi-supervised learning closest to clustering, i.e., to discover clusters using available labels as support to guide and improve the clustering process. Another important data characteristic, different from the presence of data labels, is the relevance or not of data features. Data are characterized by features, but it is possible that not all of them are relevant, or equally relevant, for the learning process. A recent clustering tendency, related to data relevance and called subspace clustering, claims that different clusters might be described by different feature subsets. This differs from traditional solutions to data relevance problem, where a single feature subset (usually the complete set of original features) is found and used to perform the clustering process. The proximity of this work to clustering leads to the first goal of this thesis. As commented above, clustering validation is a difficult task due to the absence of data labels. Although there are many indices that can be used to assess the quality of clustering solutions, these validations depend on clustering algorithms and data characteristics. Hence, in the first goal three known clustering algorithms are used to cluster data with outliers and noise, to critically study how some of the most known validation indices behave. The main goal of this work is however to combine semi-supervised clustering with subspace clustering to obtain clustering solutions that can be correctly validated by using either known indices or expert opinions. Two different algorithms are proposed from different points of view to discover clusters characterized by different subspaces. For the first algorithm, available data labels are used for searching for subspaces firstly, before searching for clusters. This algorithm assigns each instance to only one cluster (hard clustering) and is based on mapping known labels to subspaces using supervised classification techniques. Subspaces are then used to find clusters using traditional clustering techniques. The second algorithm uses available data labels to search for subspaces and clusters at the same time in an iterative process. This algorithm assigns each instance to each cluster based on a membership probability (soft clustering) and is based on integrating known labels and the search for subspaces into a model-based clustering approach. The different proposals are tested using different real and synthetic databases, and comparisons to other methods are also included when appropriate. Finally, as an example of real and current application, different machine learning tech¬niques, including one of the proposals of this work (the most sophisticated one) are applied to a task of one of the most challenging biological problems nowadays, the human brain model¬ing. Specifically, expert neuroscientists do not agree with a neuron classification for the brain cortex, which makes impossible not only any modeling attempt but also the day-to-day work without a common way to name neurons. Therefore, machine learning techniques may help to get an accepted solution to this problem, which can be an important milestone for future research in neuroscience. Resumen Las tecnicas de aprendizaje automatico se usan para extraer informacion valiosa de datos. Hoy en dia, la importancia de estas tecnicas esta siendo incluso mayor, debido a que la evolucion en la adquisicion y almacenamiento de datos esta llevando a datos con diferentes caracteristicas que deben ser explotadas. Por lo tanto, los avances en la recoleccion de datos deben ir ligados a avances en las tecnicas de aprendizaje automatico para resolver nuevos retos que pueden aparecer, tanto en aplicaciones academicas como reales. Existen varias tecnicas de aprendizaje automatico dependiendo de las caracteristicas de los datos y del proposito. La clasificacion no supervisada o clustering es una de las tecnicas mas conocidas cuando los datos carecen de supervision (datos sin etiqueta), siendo el objetivo descubrir nuevos grupos (agrupaciones) dependiendo de la similitud de los datos. Por otra parte, la clasificacion supervisada necesita datos con supervision (datos etiquetados) y su objetivo es realizar predicciones sobre las etiquetas de nuevos datos. La presencia de las etiquetas es una caracteristica muy importante que guia no solo el aprendizaje sino tambien otras tareas relacionadas como la validacion. Cuando solo algunos de los datos disponibles estan etiquetados, mientras que el resto permanece sin etiqueta (datos parcialmente etiquetados), ni el clustering ni la clasificacion supervisada se pueden utilizar. Este escenario, que esta llegando a ser comun hoy en dia debido a la ignorancia o el coste del proceso de etiquetado, es abordado utilizando tecnicas de aprendizaje semi-supervisadas. Esta tesis trata la rama del aprendizaje semi-supervisado mas cercana al clustering, es decir, descubrir agrupaciones utilizando las etiquetas disponibles como apoyo para guiar y mejorar el proceso de clustering. Otra caracteristica importante de los datos, distinta de la presencia de etiquetas, es la relevancia o no de los atributos de los datos. Los datos se caracterizan por atributos, pero es posible que no todos ellos sean relevantes, o igualmente relevantes, para el proceso de aprendizaje. Una tendencia reciente en clustering, relacionada con la relevancia de los datos y llamada clustering en subespacios, afirma que agrupaciones diferentes pueden estar descritas por subconjuntos de atributos diferentes. Esto difiere de las soluciones tradicionales para el problema de la relevancia de los datos, en las que se busca un unico subconjunto de atributos (normalmente el conjunto original de atributos) y se utiliza para realizar el proceso de clustering. La cercania de este trabajo con el clustering lleva al primer objetivo de la tesis. Como se ha comentado previamente, la validacion en clustering es una tarea dificil debido a la ausencia de etiquetas. Aunque existen muchos indices que pueden usarse para evaluar la calidad de las soluciones de clustering, estas validaciones dependen de los algoritmos de clustering utilizados y de las caracteristicas de los datos. Por lo tanto, en el primer objetivo tres conocidos algoritmos se usan para agrupar datos con valores atipicos y ruido para estudiar de forma critica como se comportan algunos de los indices de validacion mas conocidos. El objetivo principal de este trabajo sin embargo es combinar clustering semi-supervisado con clustering en subespacios para obtener soluciones de clustering que puedan ser validadas de forma correcta utilizando indices conocidos u opiniones expertas. Se proponen dos algoritmos desde dos puntos de vista diferentes para descubrir agrupaciones caracterizadas por diferentes subespacios. Para el primer algoritmo, las etiquetas disponibles se usan para bus¬car en primer lugar los subespacios antes de buscar las agrupaciones. Este algoritmo asigna cada instancia a un unico cluster (hard clustering) y se basa en mapear las etiquetas cono-cidas a subespacios utilizando tecnicas de clasificacion supervisada. El segundo algoritmo utiliza las etiquetas disponibles para buscar de forma simultanea los subespacios y las agru¬paciones en un proceso iterativo. Este algoritmo asigna cada instancia a cada cluster con una probabilidad de pertenencia (soft clustering) y se basa en integrar las etiquetas conocidas y la busqueda en subespacios dentro de clustering basado en modelos. Las propuestas son probadas utilizando diferentes bases de datos reales y sinteticas, incluyendo comparaciones con otros metodos cuando resulten apropiadas. Finalmente, a modo de ejemplo de una aplicacion real y actual, se aplican diferentes tecnicas de aprendizaje automatico, incluyendo una de las propuestas de este trabajo (la mas sofisticada) a una tarea de uno de los problemas biologicos mas desafiantes hoy en dia, el modelado del cerebro humano. Especificamente, expertos neurocientificos no se ponen de acuerdo en una clasificacion de neuronas para la corteza cerebral, lo que imposibilita no solo cualquier intento de modelado sino tambien el trabajo del dia a dia al no tener una forma estandar de llamar a las neuronas. Por lo tanto, las tecnicas de aprendizaje automatico pueden ayudar a conseguir una solucion aceptada para este problema, lo cual puede ser un importante hito para investigaciones futuras en neurociencia.

7 citations


Cites methods from "A Gaussian Mixture Model to Detect ..."

  • ...[166], using also minimum message length (MML) criterion to estimate the final number of clusters....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Abstract: Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.

14,054 citations


"A Gaussian Mixture Model to Detect ..." refers background in this paper

  • ...Particularly, the feature relevance for an individual cluster is treated as a probability, which is represented by localized feature saliency and estimated through Expectation Maximization (EM) algorithm during the clustering process....

    [...]

01 Jan 1997
TL;DR: A survey of machine learning methods for handling data sets containing large amounts of irrelevant information can be found in this article, where the authors focus on two key issues: selecting relevant features and selecting relevant examples.
Abstract: In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area. @ 1997 Elsevier Science B.V.

2,947 citations

Journal ArticleDOI
TL;DR: This survey reviews work in machine learning on methods for handling data sets containing large amounts of irrelevant information and describes the advances that have been made in both empirical and theoretical work in this area.

2,869 citations


"A Gaussian Mixture Model to Detect ..." refers background in this paper

  • ...Experiments carried on both synthetic and real-world datasets illustrate the performance of the proposed approach in finding clusters embedded in feature subspace....

    [...]

Proceedings ArticleDOI
01 Jun 1998
TL;DR: CLIQUE is presented, a clustering algorithm that satisfies each of these requirements of data mining applications including the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records.
Abstract: Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

2,782 citations


"A Gaussian Mixture Model to Detect ..." refers methods in this paper

  • ...The aforementioned algorithms perform feature selection in a global sense by producing a common feature subset for all the clusters....

    [...]