scispace - formally typeset
Search or ask a question
Author

Leonard K. M. Poon

Bio: Leonard K. M. Poon is an academic researcher from University of Hong Kong. The author has contributed to research in topics: Latent variable & Cluster analysis. The author has an hindex of 12, co-authored 36 publications receiving 353 citations. Previous affiliations of Leonard K. M. Poon include Hong Kong University of Science and Technology & Hong Kong Institute of Education.

Papers
More filters
Journal ArticleDOI
TL;DR: In this article, a hierarchical topic detection method is proposed where topics are obtained by clustering documents in multiple ways and each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics.

41 citations

Posted Content
TL;DR: The latent tree variational autoencoder (LTVAE) as discussed by the authors is a variant of VAE where there is a superstructure of discrete latent variables on top of the latent features.
Abstract: We investigate a variant of variational autoencoders where there is a superstructure of discrete latent variables on top of the latent features. In general, our superstructure is a tree structure of multiple super latent variables and it is automatically learned from data. When there is only one latent variable in the superstructure, our model reduces to one that assumes the latent features to be generated from a Gaussian mixture model. We call our model the latent tree variational autoencoder (LTVAE). Whereas previous deep learning methods for clustering produce only one partition of data, LTVAE produces multiple partitions of data, each being given by one super latent variable. This is desirable because high dimensional data usually have many different natural facets and can be meaningfully partitioned in multiple ways.

37 citations

Journal ArticleDOI
TL;DR: This paper proposes a generalization of the Gaussian mixture models and demonstrates its ability to automatically identify natural facets of data and cluster data along each of those facets simultaneously, to show that facet determination usually leads to better clustering results than variable selection.

37 citations

Proceedings Article
21 Jun 2010
TL;DR: A generalization of the Gaussian mixture model is proposed, its ability to cluster data along multiple facets is shown, and it is demonstrated it is often more reasonable to facilitate variable selection than to perform it.
Abstract: Variable selection for cluster analysis is a difficult problem. The difficulty originates not only from the lack of class information but also the fact that high-dimensional data are often multifaceted and can be meaningfully clustered in multiple ways. In such a case the effort to find one subset of attributes that presumably gives the "best" clustering may be misguided. It makes more sense to facilitate variable selection by domain experts, that is, to systematically identify various facets of a data set (each being based on a subset of attributes), cluster the data along each one, and present the results to the domain experts for appraisal and selection. In this paper, we propose a generalization of the Gaussian mixture model, show its ability to cluster data along multiple facets, and demonstrate it is often more reasonable to facilitate variable selection than to perform it.

33 citations

Journal ArticleDOI
TL;DR: This paper proposes an algorithm called BI that can deal with data sets with hundreds of attributes that compares favorably with alternative methods that are not based on LTMs and empirically compares it with EAST and other more efficient LTM learning algorithms.
Abstract: Real-world data are often multifaceted and can be meaningfully clustered in more than one way. There is a growing interest in obtaining multiple partitions of data. In previous work we learnt from data a latent tree model (LTM) that contains multiple latent variables (Chen et al. 2012). Each latent variable represents a soft partition of data and hence multiple partitions result in. The LTM approach can, through model selection, automatically determine how many partitions there should be, what attributes define each partition, and how many clusters there should be for each partition. It has been shown to yield rich and meaningful clustering results. Our previous algorithm EAST for learning LTMs is only efficient enough to handle data sets with dozens of attributes. This paper proposes an algorithm called BI that can deal with data sets with hundreds of attributes. We empirically compare BI with EAST and other more efficient LTM learning algorithms, and show that BI outperforms its competitors on data sets with hundreds of attributes. In terms of clustering results, BI compares favorably with alternative methods that are not based on LTMs.

31 citations


Cited by
More filters
01 Jan 2012

3,692 citations

Journal ArticleDOI
TL;DR: A general framework to design VAEs suitable for fitting incomplete heterogenous data, which includes likelihood models for real-valued, positive real valued, interval, categorical, ordinal and count data, and allows accurate estimation of missing data is proposed.

177 citations

Book
01 Jul 2019
TL;DR: In this paper, the authors frame cluster analysis and classification in terms of statistical models, thus yielding principled estimation, testing and prediction methods, and sound answers to the central questions, such as how many clusters are there? which method should I use? How should I handle outliers.
Abstract: Cluster analysis finds groups in data automatically. Most methods have been heuristic and leave open such central questions as: how many clusters are there? Which method should I use? How should I handle outliers? Classification assigns new observations to groups given previously classified observations, and also has open questions about parameter tuning, robustness and uncertainty assessment. This book frames cluster analysis and classification in terms of statistical models, thus yielding principled estimation, testing and prediction methods, and sound answers to the central questions. It builds the basic ideas in an accessible but rigorous way, with extensive data examples and R code; describes modern approaches to high-dimensional data and networks; and explains such recent advances as Bayesian regularization, non-Gaussian model-based clustering, cluster merging, variable selection, semi-supervised and robust classification, clustering of functional data, text and images, and co-clustering. Written for advanced undergraduates in data science, as well as researchers and practitioners, it assumes basic knowledge of multivariate calculus, linear algebra, probability and statistics.

134 citations

Journal ArticleDOI
TL;DR: Insight is provided into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations and two essential machine learning classification methods are provided.
Abstract: Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fueled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID-19’s informational crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be implemented. In this research article, we identify public sentiment associated with the pandemic using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations. Furthermore, we provide a methodological overview of two essential machine learning (ML) classification methods, in the context of textual analytics, and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a strong classification accuracy of 91% for short Tweets, with the Naive Bayes method. We also observe that the logistic regression classification method provides a reasonable accuracy of 74% with shorter Tweets, and both methods showed relatively weaker performance for longer Tweets. This research provides insights into Coronavirus fear sentiment progression, and outlines associated methods, implications, limitations and opportunities.

118 citations