scispace - formally typeset
Search or ask a question
Author

Haesun Park

Bio: Haesun Park is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Non-negative matrix factorization & Cluster analysis. The author has an hindex of 53, co-authored 235 publications receiving 12188 citations. Previous affiliations of Haesun Park include University of Zagreb & Cornell University.


Papers
More filters
Proceedings ArticleDOI
20 Aug 2006
TL;DR: This work provides a new approach of evaluating the quality of clustering on words using class aggregate distribution and multi-peak distribution and provides new rules for updating $F,S, G$ and proves the convergence of these algorithms.
Abstract: Currently, most research on nonnegative matrix factorization (NMF)focus on 2-factor $X=FG^T$ factorization. We provide a systematicanalysis of 3-factor $X=FSG^T$ NMF. While it unconstrained 3-factor NMF is equivalent to it unconstrained 2-factor NMF, itconstrained 3-factor NMF brings new features to it constrained 2-factor NMF. We study the orthogonality constraint because it leadsto rigorous clustering interpretation. We provide new rules for updating $F,S, G$ and prove the convergenceof these algorithms. Experiments on 5 datasets and a real world casestudy are performed to show the capability of bi-orthogonal 3-factorNMF on simultaneously clustering rows and columns of the input datamatrix. We provide a new approach of evaluating the quality ofclustering on words using class aggregate distribution andmulti-peak distribution. We also provide an overview of various NMF extensions andexamine their relationships.

1,211 citations

Journal ArticleDOI
TL;DR: The experimental results illustrate that the proposed sparse NMF algorithm often achieves better clustering performance with shorter computing time compared to other existing NMF algorithms.
Abstract: Motivation: Many practical pattern recognition problems require non-negativity constraints. For example, pixels in digital images and chemical concentrations in bioinformatics are non-negative. Sparse non-negative matrix factorizations (NMFs) are useful when the degree of sparseness in the non-negative basis matrix or the non-negative coefficient matrix in an NMF needs to be controlled in approximating high-dimensional data in a lower dimensional space. Results: In this article, we introduce a novel formulation of sparse NMF and show how the new formulation leads to a convergent sparse NMF algorithm via alternating non-negativity-constrained least squares. We apply our sparse NMF algorithm to cancer-class discovery and gene expression data analysis and offer biological analysis of the results obtained. Our experimental results illustrate that the proposed sparse NMF algorithm often achieves better clustering performance with shorter computing time compared to other existing NMF algorithms. Availability: The software is available as supplementary material. Contact:hskim@cc.gatech.edu, hpark@acc.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online.

813 citations

01 Jan 2006
TL;DR: In this paper, the authors introduced sparse NMFs via alternating non-negativity-constrained least squares (NNCS) for cancer class discovery and gene expression data analysis.
Abstract: Many practical pattern recognition problems require non-negativity constraints. For example, pixels in digital images and chemical concentrations in bioinformatics are non-negative. Non-negative matrix factorization (NMF) is a useful technique in approximating these high dimensional data. Sparse NMFs are also useful when we need to control the degree of sparseness in non-negative basis vectors or non-negative lower-dimensional representations. In this paper, we introduce novel sparse NMFs via alternating non-negativity-constrained least squares. We applied one of the proposed sparse NMFs to cancer class discovery and gene expression data analysis. Our experimental results illustrate that our proposed method achieves better clustering performance than NMF based on multiplicative update rules and sparse NMFs based on the gradient descent method.

662 citations

Journal ArticleDOI
TL;DR: This paper introduces an algorithm for NMF based on alternating nonnegativity constrained least squares (NMF/ANLS) and the active set-based fast algorithm for nonNegativity constrained most squares with multiple right-hand side vectors, and discusses its convergence properties and a rigorous convergence criterion based on the Karush-Kuhn-Tucker (KKT) conditions.
Abstract: Nonnegative matrix factorization (NMF) determines a lower rank approximation of a matrix $A \in \mathbb{R}^{m \times n} \approx WH$ where an integer $k \ll \min(m,n)$ is given and nonnegativity is imposed on all components of the factors $W \in \mathbb{R}^{m \times k}$ and $H \in \mathbb{R}^{k \times n}$. NMF has attracted much attention for over a decade and has been successfully applied to numerous data analysis problems. In applications where the components of the data are necessarily nonnegative, such as chemical concentrations in experimental results or pixels in digital images, NMF provides a more relevant interpretation of the results since it gives nonsubtractive combinations of nonnegative basis vectors. In this paper, we introduce an algorithm for NMF based on alternating nonnegativity constrained least squares (NMF/ANLS) and the active set-based fast algorithm for nonnegativity constrained least squares with multiple right-hand side vectors, and we discuss its convergence properties and a rigorous convergence criterion based on the Karush-Kuhn-Tucker (KKT) conditions. In addition, we also describe algorithms for sparse NMFs and regularized NMF. We show how we impose a sparsity constraint on one of the factors by $L_1$-norm minimization and discuss its convergence properties. Our algorithms are compared to other commonly used NMF algorithms in the literature on several test data sets in terms of their convergence behavior.

612 citations

Journal ArticleDOI
TL;DR: Imputation methods based on the least squares formulation are proposed to estimate missing values in the gene expression data, which exploit local similarity structures in the data as well as least squares optimization process.
Abstract: Motivation: Gene expression data often contain missing expression values. Effective missing value estimation methods are needed since many algorithms for gene expression data analysis require a complete matrix of gene array values. In this paper, imputation methods based on the least squares formulation are proposed to estimate missing values in the gene expression data, which exploit local similarity structures in the data as well as least squares optimization process. Results: The proposed local least squares imputation method (LLSimpute) represents a target gene that has missing values as a linear combination of similar genes. The similar genes are chosen by k-nearest neighbors or k coherent genes that have large absolute values of Pearson correlation coefficients. Non-parametric missing values estimation method of LLSimpute are designed by introducing an automatic k-value estimator. In our experiments, the proposed LLSimpute method shows competitive results when compared with other imputation methods for missing value estimation on various datasets and percentages of missing values in the data. Availability: The software is available at http://www.cs.umn.edu/~hskim/tools.html Contact: hpark@cs.umn.edu

493 citations


Cited by
More filters
Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

01 Aug 2000
TL;DR: Assessment of medical technology in the context of commercialization with Bioentrepreneur course, which addresses many issues unique to biomedical products.
Abstract: BIOE 402. Medical Technology Assessment. 2 or 3 hours. Bioentrepreneur course. Assessment of medical technology in the context of commercialization. Objectives, competition, market share, funding, pricing, manufacturing, growth, and intellectual property; many issues unique to biomedical products. Course Information: 2 undergraduate hours. 3 graduate hours. Prerequisite(s): Junior standing or above and consent of the instructor.

4,833 citations

Book
12 Aug 2008
TL;DR: This book explains the principles that make support vector machines (SVMs) a successful modelling and prediction tool for a variety of applications and provides a unique in-depth treatment of both fundamental and recent material on SVMs that so far has been scattered in the literature.
Abstract: This book explains the principles that make support vector machines (SVMs) a successful modelling and prediction tool for a variety of applications. The authors present the basic ideas of SVMs together with the latest developments and current research questions in a unified style. They identify three reasons for the success of SVMs: their ability to learn well with only a very small number of free parameters, their robustness against several types of model violations and outliers, and their computational efficiency compared to several other methods. Since their appearance in the early nineties, support vector machines and related kernel-based methods have been successfully applied in diverse fields of application such as bioinformatics, fraud detection, construction of insurance tariffs, direct marketing, and data and text mining. As a consequence, SVMs now play an important role in statistical machine learning and are used not only by statisticians, mathematicians, and computer scientists, but also by engineers and data analysts. The book provides a unique in-depth treatment of both fundamental and recent material on SVMs that so far has been scattered in the literature. The book can thus serve as both a basis for graduate courses and an introduction for statisticians, mathematicians, and computer scientists. It further provides a valuable reference for researchers working in the field. The book covers all important topics concerning support vector machines such as: loss functions and their role in the learning process; reproducing kernel Hilbert spaces and their properties; a thorough statistical analysis that uses both traditional uniform bounds and more advanced localized techniques based on Rademacher averages and Talagrand's inequality; a detailed treatment of classification and regression; a detailed robustness analysis; and a description of some of the most recent implementation techniques. To make the book self-contained, an extensive appendix is added which provides the reader with the necessary background from statistics, probability theory, functional analysis, convex analysis, and topology.

4,664 citations

01 Jan 2012

3,692 citations

Journal ArticleDOI
Yan, Xu, Zhang, Yang, Lin 
TL;DR: A new supervised dimensionality reduction algorithm called marginal Fisher analysis is proposed in which the intrinsic graph characterizes the intraclass compactness and connects each data point with its neighboring points of the same class, while the penalty graph connects the marginal points and characterizing the interclass separability.
Abstract: A large family of algorithms - supervised or unsupervised; stemming from statistics or geometry theory - has been designed to provide different solutions to the problem of dimensionality reduction. Despite the different motivations of these algorithms, we present in this paper a general formulation known as graph embedding to unify them within a common framework. In graph embedding, each algorithm can be considered as the direct graph embedding or its linear/kernel/tensor extension of a specific intrinsic graph that describes certain desired statistical or geometric properties of a data set, with constraints from scale normalization or a penalty graph that characterizes a statistical or geometric property that should be avoided. Furthermore, the graph embedding framework can be used as a general platform for developing new dimensionality reduction algorithms. By utilizing this framework as a tool, we propose a new supervised dimensionality reduction algorithm called marginal Fisher analysis in which the intrinsic graph characterizes the intraclass compactness and connects each data point with its neighboring points of the same class, while the penalty graph connects the marginal points and characterizes the interclass separability. We show that MFA effectively overcomes the limitations of the traditional linear discriminant analysis algorithm due to data distribution assumptions and available projection directions. Real face recognition experiments show the superiority of our proposed MFA in comparison to LDA, also for corresponding kernel and tensor extensions

2,751 citations