scispace - formally typeset
Open AccessJournal ArticleDOI

Simultaneous feature selection and clustering using mixture models

TLDR
This paper proposes the concept of feature saliency and introduces an expectation-maximization algorithm to estimate it, in the context of mixture-based clustering, and extends the criterion and algorithm to simultaneously estimate the feature saliencies and the number of clusters.
Abstract
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.

read more

Content maybe subject to copyright    Report

Simultaneous Feature Selection and Clustering
Using Mixture Models
Martin H.C. Law, Student Member, IEEE,Ma
´
rio A.T. Figueiredo, Senior Member, IEEE, and
Anil K. Jain, Fellow, IEEE
Abstract—Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist
many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the
clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there
are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the
determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we
propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of
mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant
features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to
simultaneously estimate the feature saliencies and the number of clusters.
Index Terms—Feature selection, clustering, unsupervised learning, mixture models, minimum message length, EM algorithm.
æ
1INTRODUCTION
T
HE goal of clustering is to discover a “natural” grouping
in a set of patterns, points, or objects, without knowl-
edge of any class labels. Clustering, or cluster analysis, is
prevalent in any discipline that involves analysis of multi-
variate data. It is, of course, impractical to exhaustively list
the numerous uses of clustering techniques. Image seg-
mentation, an important problem in computer vision, can
be formulated as a clustering problem [21], [28], [55].
Documents can be clustered [23] to generate topical
hierarchies for information access [53] or retrieval [5].
Clustering is also used to perform market segmentation [2],
[11] as well as in biology, e.g., to study genome data [3].
Many clustering algorithms have been proposed in
different application scenarios [25], [29]. They can be
divided roughly into two categories: hierarchical clustering,
which creates a “tree” with branches merging at different
levels, and partitional clustering, which divides the data into
different “flat” clusters. The input of clustering algorithms
can either be a proximity matrix containing the similarities/
dissimilarities between all pairs of points, or a pattern
matrix, where each item is described by a vector of
attributes, also called features. In this paper, we shall focus
on partitional clustering with a pattern matrix as input.
In principle, the more information we have about each
pattern, the better a clustering algorithm is expected to
perform. This seems to suggest that we should use as many
features as possible to represent the patterns. However, this is
not the casein practice. Some features can be just “noise,” thus
not contributing to (or even degrading) the clustering
process. The task of selecting the “best” feature subset is
known as feature selection, sometimes as variable selection or
subset selection.
Feature selection is important for several reasons, the
fundamental one being arguably that noisy features can
degrade the performance of most learning algorithms (see
the example in Fig. 1). In supervised learning, it is known
that feature selection can improve the performance of
classifiers learned from limited amounts of data [49]; it
leads to more economical (both in storage and computation)
classifiers and, in many cases, it may lead to interpretable
models. Feature selection is particularly important for data
sets with large numbers of features, e.g., classification
problems in molecular biology may involve thousands of
features [3], [62], and a Web page can be represented by
thousands of different key-terms [58]. Appearance-based
image classification methods may use each pixel as a
feature [6], thus easily involving thousands of features.
Feature selection has been widely studied in the context
of supervised learning (see [7], [24], [33], [34] and references
therein), where the ultimate goal is to select features that
can achieve the highest accuracy on unseen data. Feature
selection has received comparatively very little attention in
unsupervised learning or clustering. One important reason
is that it is not at all clear how to assess the relevance of a
subset of features without resorting to class labels. The
problem is made even more challenging when the number
of clusters is unknown, since the optimal number of clusters
and the optimal feature subset are interrelated, as illu-
strated in Fig. 2 (taken from [16]). Note that methods based
on variance (such as principal components analysis) need not
select good features for clustering, as features with large
variance can be independent of the intrinsic grouping of the
data (see example in Fig. 3).
Most feature selection algorithms (such as [9], [33], [47])
involve a combinatorial search through the space of all
feature subsets. Usually, heuristic (nonexhaustive) methods
have to be adopted, because the size of this space is
1154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 9, SEPTEMBER 2004
. M.H.C. Law and A.K. Jain are with the Department of Computer Science
and Engineering, Michigan State University, 3115 Engineering Building,
East Lansing, Michigan 48824-1226. E-mail: {lawhiu, jain}@cse.msu.edu.
. M.A.T. Figueiredo is with the Instituto de Telecomunicac¸o
˜
es, Instituto
Superior Te
´
cnico, Torre Norte, Piso 10, Av. Rovisco Pais, 1049-001 Lisboa,
Portugal. E-mail: mtf@lx.it.pt.
Manuscript received 15 May 2003; accepted 27 Feb. 2004.
Recommended for acceptance by B.J. Frey.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number TPAMI-0077-0503.
0162-8828/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

exponential in the number of features. In this case, one
generally loses any guarantee of optimality of the selected
feature subset.
In this paper, we propose a solution to the feature selection
problem in unsupervised learning by casting it as an
estimation problem, thus avoiding any combinatorial search.
Instead of selecting a subset of features, we estimate a set of
real-valued (actually in ½0; 1) quantities (one for each feature)
which we call the feature saliencies. This estimation is carried
out by an EM algorithm derived for the task. Since we are in
the presence of a model-selection-type problem, it is
necessary to avoid the situation where all the saliencies take
the maximum possible value. This is achieved by adopting a
minimum message length (MML, [60], [61]) penalty, as was
done in [18] to select the number of clusters. The MML
criterion encourages the saliencies of the irrelevant features to
go to zero, allowing us to prune the feature set. Finally, we
integrate the process of feature saliency estimation into the
algorithm proposed in [18], thus obtaining a method which is
able to simultaneously perform feature selection and deter-
mine the number of clusters. Although the algorithm is
presented with respect to Gaussian mixture-based clustering,
one can extend it to other types of model-based clustering as
well. The algorithm first appears in [38].
The remainder of this paper is organized as follows: In
Section 2, we review approaches for feature selection and
previous attempts to solve the feature selection problem in
unsupervised learning. The details of our approach are
presented in Section 3. Experimental results are reported in
Section 4, followed by comments on the proposed algorithm
in Section 5. Finally, we conclude in Section 6 and outline
some future work directions.
2RELATED WORK
Most of the literature on feature selection pertains to
supervised learning (both classification [24] and regression
[40]). Feature selection algorithms can be broadly divided
into two categories [7], [33]: filters and wrappers. The filter
approaches evaluate the relevance of each feature (subset)
using the data set alone, regardless of the subsequent learning
algorithm. RELIEF [32] and its enhancement [36] are
representatives of this class, where the basic idea is to assign
feature weights based on the consistency of the feature value
in the k nearest neighbors of every data point. Information-
theoretic methods are also used to evaluate features: the
mutual information between a relevant feature and the class
labels should be high [4]. Nonparametric methods can be
used to compute mutual information involving continuous
features [37]. A feature can be regarded as irrelevant if it is
conditionally independent of the class labels given other
features. The concept of Markov blanket is used to formalize
this notion of irrelevancy in [34].
On the other hand, wrapper approaches [33] invoke the
learning algorithm to evaluate the quality of each feature
(subset). Specifically, a learning algorithm (e.g., a nearest
neighbor classifier, a decision tree, a naive Bayes method) is
run on a feature subset and the feature subset is assessed by
some estimate of the classification accuracy. Wrappers are
usually more computationally demanding, but they can be
superior in accuracy when compared with filters, which
ignore the properties of the learning task at hand [33].
Both approaches, filters and wrappers, usually involve
combinatorial searches through the space of possible
feature subsets; for this task, different types of heuristics,
such as sequential forward or backward searches, floating
search, beam search, bidirectional search, and genetic
search have been suggested [9], [33], [47], [63]. It is also
possible to construct a set of weak (in the boosting sense
[20]) classifiers, with each one using only one feature, and
then apply boosting, which effectively performs feature
LAW ET AL.: SIMULTANEOUS FEATURE SELECTION AND CLUSTERING USING MIXTURE MODELS 1155
Fig. 1. A uniformly distributed irrelevant feature (x
2
) makes it difficult for
the Gaussian mixture learning algorithm in [18] to recover the two
underlying clusters. If only feature x
1
is used, however, the two clusters
are easily identified. The curves along the horizontal and vertical axes of
the figure indicate the marginal distribution of x
1
and x
2
, respectively.
Fig. 2. Number of clusters is interrelated with feature subset used. The
optimal feature subsets for identifying three, two, one clusters in this
data set are fx
1
;x
2
g, fx
1
g, and fx
2
g, respectively. On the other hand,
the optimal number of clusters for feature subsets fx
1
;x
2
g, fx
1
g, and
fx
2
g are also three, two, one, respectively.
Fig. 3. Feature x
1
, although explaining more data variance than feature
x
2
, is spurious for the identification of the two clusters in this data set.

selection [59]. It has also been proposed to approach feature
selection using rough set theory [35].
All of the approaches mentioned above are concerned
with feature selection in the presence of class labels.
Comparatively, not much work has been done for feature
selection in unsupervised learning. Of course, any method
conceived for supervised learning that does not use the
class labels could be used for unsupervised learning; it is
the case for methods that measure feature similarity to
detect redundant features, using, e.g., mutual information
[53] or a maximum information compression index [42]. In
[16], [17], the normalized log-likelihood and cluster separ-
ability are used to evaluate the quality of clusters obtained
with different feature subsets. Different feature subsets and
numbers of clusters, for multinomial model-based cluster-
ing, are evaluated using marginal likelihood and cross-
validated likelihood in [58]. The algorithm described in [52]
uses automatic relevance determination priors to select
features when there are two clusters. In [13], the clustering
tendency of each feature is assessed by an entropy index. A
genetic algorithm is used in [31] for feature selection in
k-means clustering. In [56], feature selection for symbolic
data is addressed by assuming that irrelevant features are
uncorrelated with the relevant features. Reference [14]
describes the notion of “category utility” for feature
selection in a conceptual clustering task. The CLIQUE
algorithm [1] is popular in the data mining community and
it finds hyperrectangular shaped clusters using a subset of
attributes for a large database. The wrapper approach can
also be adopted to select features for clustering; this has
been explored in our earlier work [19], [38].
All the methods referred above perform “hard” feature
selection (a feature is either selected or not). There are also
algorithms that assign weights to different features to
indicate their significance. In [43], weights are assigned to
different groups of features for k-means clustering based on
a score related to the Fisher discriminant. Feature weighting
for k-means clustering is also considered in [41], but the goal
there is to find the best description of the clusters after they
are identified. The method described in [46] can be
classified as learning feature weights for conditional
Gaussian networks. An EM algorithm based on Bayesian
shrinking is proposed in [22] for unsupervised learning.
3EMALGORITHM FOR FEATURE SALIENCY
In this section, we propose an EM algorithm for performing
mixture-based (or model-based) clustering with feature
selection. In mixture-based clustering, each data point is
modeled as having been generated by one of a set of
probabilistic models [25], [39]. Clustering is then done by
learning the parameters of these models and the associated
probabilities. Each pattern is assigned to the mixture
component that most likely generated it. Although the
derivations below refer to Gaussian mixtures, they can be
generalized to other types of mixtures.
3.1 Mixture Densities
A finite mixture density with K components is defined by
pðyÞ¼
X
K
j¼1
j
pðyj
j
Þ; ð1Þ
where 8
j
;
j
0;
P
j
j
¼ 1; each
j
is the set of parameters of
the jth component (all components are assumed to have the
same form, e.g., Gaussian); and f
1
; ...;
K
;
1
; ...;
K
g
will denote the full parameter set. The goal of mixture
estimation is to infer from a set of N data points
fy
1
; ...; y
N
g, assumed to be samples of a distribution
with density given by (1). Each y
i
is a D-dimensional feature
vector ½y
i1
; ...;y
iD
T
. In the sequel, we will use the indices i, j,
and l to run through data points (1 to N), mixture components
(1 to K), and features (1 to D), respectively.
As is well-known, neither the maximum likelihood (ML)
estimate,
b
ML
¼ arg max
log pðYjÞfg;
nor the maximum a posteriori (MAP) estimate (given some
prior pðÞ)
b
MAP
¼ arg max
log pðYjÞþlog pðÞ
fg
;
can be found analytically. The usual choice is the EM
algorithm, which finds local maxima of these criteria [39].
This algorithm is based on a set fz
1
; ...; z
N
gof N missing
(latent) labels, where z
i
¼½z
i1
; ...;z
iK
,withz
ij
¼ 1 and
z
ip
¼ 0, for p j, meaning that y
i
is a sample of pðj
j
Þ. For
brevity of notation, sometimes we write z
i
¼ j for such z
i
. The
complete data log-likelihood, i.e., the log-likelihood if Z was
observed, is
log pðY; ZjÞ¼
X
N
i¼1
X
K
j¼1
z
ij
log
j
pðy
i
j
j
Þ

: ð2Þ
The EM algorithm produces a sequence of estimates
f
b
ðtÞ;t¼ 0; 1; 2; ...g using two alternating steps:
. E-step: Compute E½ZjY;
b
ðtÞ,theexpected
value of the missing data given the current parameter
estimate, and plug it into log pðY; ZjÞ, yielding the so-
called Q-function Qð;
b
ðtÞÞ ¼ log pðY; WjÞ. Since the
elements of Z are binary, we have
w
i;j
Ez
ij
jY;
b
ðtÞ
hi
¼ Pr z
ij
¼ 1jy
i
;
b
ðtÞ
hi
¼
b
j
ðtÞ pðy
i
j
b
j
ðtÞÞ
P
K
k¼1
b
k
ðtÞ pðy
i
j
b
k
ðtÞÞ
:
ð3Þ
Notice that
j
is the a priori probability that z
ij
¼ 1
(i.e., that y
i
belongs to cluster j), while w
ij
is the
corresponding a posteriori probability, after obser-
ving y
i
.
. M-step: Update the parameter estimates,
b
ðt þ 1Þ¼arg max
fQð;
b
ðtÞÞ þ log pðÞg;
in the case of MAP estimation, or without log pðÞ in
the ML case.
3.2 Feature Saliency
In this section, we define the concept of feature saliency and
derive an EM algorithm to estimate its value. We assume
that the features are conditionally independent given the
(hidden) component label, that is,
1156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 9, SEPTEMBER 2004

pðyjÞ¼
X
K
j¼1
j
pðyj
j
Þ¼
X
K
j¼1
j
Y
D
l¼1
pðy
l
j
jl
Þ; ð4Þ
where pðj
jl
Þ is the pdf of the lth feature in the jth
component. This assumption enables us to utilize the power
of the EM algorithm. In the particular case of Gaussian
mixtures, the conditional independence assumption is
equivalent to adopting diagonal covariance matrices, which
is a common choice for high-dimensional data, such as in
naı
¨
ve Bayes classifiers, latent class models, as well as in the
emission densities of continuous hidden Markov models.
Among different definitions of feature irrelevancy
(proposed for supervised learning), we adopt the one
suggested in [48], [58], which is suitable for unsupervised
learning: the lth feature is irrelevant if its distribution is
independent of the class labels, i.e., if it follows a common
density, denoted by qðy
l
j
l
Þ. Let ¼ð
1
; ...;
D
Þ be a set of
binary parameters, such that
l
¼ 1 if feature l is relevant
and
l
¼ 0, otherwise. The mixture density in (4) can then
be rewritten as
pðyj; f
j
g; f
jl
g; f
l
¼
X
K
j¼1
j
Y
D
l¼1
½pðy
l
j
jl
Þ
l
½qðy
l
j
l
Þ
1
l
:
ð5Þ
A related model for feature selection in supervised learning
has been considered in [44], [48]. Intuitively, determines
which edges exist between the hidden label z and the
individual features y
l
in the graphical model illustrated in
Fig. 4, for the case D ¼ 4.
Our notion of feature saliency is summarized in the
following steps: 1) We treat the
l
s as missing variables and
2) we define the feature saliency as
l
¼ Pð
l
¼ 1Þ,the
probability that the lth feature is relevant. This definition
makes sense, as it is difficult to know for sure that a certain
feature is irrelevant in unsupervised learning. The resulting
model (likelihood function) is written as (see the proof in
Appendix A)
pðyjÞ¼
X
K
j¼1
j
Y
D
l¼1
ð
l
pðy
l
j
jl
Þþð1
l
Þqðy
l
j
l
ÞÞ; ð6Þ
where ¼ff
j
g; f
jl
g; f
l
g; f
l
gg is the set of all the
parameters of the model. An intuitive way to see how (6)
is obtained is to notice that ½pðy
l
j
jl
Þ
l
½qðy
l
j
l
Þ
1
l
can be
written as
l
pðy
l
j
jl
Þþð1
l
Þqðy
l
j
l
Þ, because
l
is binary.
The form of qð:j:Þ reflects our prior knowledge about the
distribution of the nonsalient features. In principle, it can be
any 1D distribution (e.g., a Gaussian, a student-t, or even a
mixture). We shall limit qð:j:Þ to be a Gaussian, since this
leads to reasonable results in practice.
Equation (6) has a generative interpretation. As in a
standard finite mixture, we first select the component label j
by sampling from a multinomial distribution with para-
meters ð
1
; ...;
K
Þ. Then, for each feature l ¼ 1; ...;D,we
flip a biased coin whose probability of getting a head is
l
;if
we get a head, we use the mixture component pð:j
jl
Þ to
generate the lth feature; otherwise, the common component
qð:j
l
Þ is used. A graphical model representation of (6) is
shown in Fig. 5 for the case D ¼ 4.
3.2.1 EM Algorithm
By treating Z (the hidden class labels) and as hidden
variables, one can derive (see details in Appendix B) the
following EM algorithm for parameter estimation:
. E-step: Compute the following quantities:
a
ij l
¼ P ð
l
¼ 1;y
il
jz
i
¼ jÞ¼
l
pðy
il
j
jl
Þ; ð7Þ
b
ij l
¼ P ð
l
¼ 0;y
il
jz
i
¼ jÞ¼ð1
l
Þqðy
il
j
l
Þ; ð8Þ
c
ij l
¼ P ðy
il
jz
i
¼ jÞ¼a
ijl
þ b
ijl
; ð9Þ
w
ij
¼ P ðz
i
¼ jjy
i
Þ¼
j
Q
l
c
ij l
P
j
j
Q
l
c
ij l
; ð10Þ
u
ij l
¼ P ð
l
¼ 1;z
i
¼ jjy
i
Þ¼
a
ij l
c
ij l
w
ij
; ð11Þ
v
ij l
¼ P ð
l
¼ 0;z
i
¼ jjy
i
Þ¼w
ij
u
ij l
: ð12Þ
. M-step: Reestimate the parameters according to
following expressions:
LAW ET AL.: SIMULTANEOUS FEATURE SELECTION AND CLUSTERING USING MIXTURE MODELS 1157
Fig. 4. A graphical model for the probability model in (5) for the case of four features (D ¼ 4) with different indicator variables.
l
¼ 1 corresponds to
the existence of an arc from z to y
l
, and
l
¼ 0 corresponds to its absence. (a)
1
¼ 1,
2
¼ 1,
3
¼ 0,
4
¼ 1. (b)
1
¼ 0,
2
¼ 1,
3
¼ 1,
4
¼ 0.
Fig. 5. A graphical model showing the mixture density in (6). The
variables z,
1
;
2
;
3
;
4
are “hidden” and only y
1
;y
2
;y
3
;y
4
are observed.

b
j
j
¼
P
i
w
ij
P
ij
w
ij
¼
P
i
w
ij
n
; ð13Þ
d
Mean inMean in
jl
¼
P
i
u
ij l
y
il
P
i
u
ijl
; ð14Þ
d
Var inVar in
jl
¼
P
i
u
ij l
ðy
il
ð
d
Mean inMean in
jl
ÞÞ
2
P
i
u
ij l
; ð15Þ
d
Mean inMean in
l
¼
P
i
ð
P
j
v
ij l
Þ y
il
P
ij
v
ij l
; ð16Þ
d
Var inVar in
l
¼
P
i
ð
P
j
v
ij l
Þðy
il
ð
d
Mean inMean in
l
ÞÞ
2
P
ij
v
ij l
; ð17Þ
b
l
l
¼
P
i;j
u
ij l
P
i;j
u
ij l
þ
P
i;j
v
ij l
¼
P
i;j
u
ij l
n
: ð18Þ
In these equations, the variable u
ij l
measures how important
the ith pattern is to the jth component, when the lth feature is
used. It is thus natural that the estimates of the mean and the
variance in
jl
are weighted sums with weight u
ij l
. Similar
relationship exists between
P
j
v
ij l
and
l
. The term
P
ij
u
ij l
can be interpreted as how likely it is that
l
equals one,
explaining why the estimate of
l
is proportional to
P
ij
u
ij l
.
3.3 Model Selection
Standard EM for mixtures exhibits some weaknesses which
also affect the EM algorithm introduced above: it requires
knowledge of K, and a good initialization is essential for
reaching a good local optimum. To overcome these difficul-
ties, we adopt the approach in [18], which is based on the
MML criterion [61], [60].
The MML criterion for our model (see details in
Appendix C) consists of minimizing, with respect to , the
following cost function (after discarding the order one term)
log pðYjÞþ
K þ D
2
log N þ
R
2
X
D
l¼1
X
K
j¼1
logðN
j
l
Þ
þ
S
2
X
D
l¼1
logðNð1
l
ÞÞ;
ð19Þ
where R and S are the number of parameters in
jl
and
l
,
respectively. If pð:j:Þ and qð:j:Þ are univariate Gaussians
(arbitrary mean and variance), R ¼ S ¼ 2. From a para-
meter estimation viewpoint, (19) is equivalent to a maximum
a posteriori (MAP) estimate,
^
¼arg max
(
log pðYjÞ
RD
2
X
K
l¼1
log
j
S
2
X
D
l¼1
logð1
l
Þ
RK
2
X
D
l¼1
log
l
)
;
ð20Þ
with the following (Dirichlet-type, but improper) priors on
the
j
s and
l
s:
pð
1
; ...;
K
Þ/
Y
K
j¼1
RD=2
j
;
pð
1
; ...;
D
Þ/
Y
D
l¼1
RK=2
l
ð1
l
Þ
S=2
:
Since these priors are conjugate with respect to the complete
data likelihood, the EM algorithm undergoes a minor
modification: The M-step (13) and (18) are replaced by
b
j
j
¼
maxð
P
i
w
ij
RD
2
; 0Þ
P
j
maxð
P
i
w
ij
RD
2
; 0Þ
ð21Þ
b
l
l
¼
maxð
P
i;j
u
ijl
KR
2
; 0Þ
maxð
P
i;j
u
ijl
KR
2
; 0Þþmaxð
P
i;j
v
ijl
S
2
; 0Þ
: ð22Þ
In addition to the log-likelihood, the other terms in (19)
have simple interpretations. The term
KþD
2
log N is a standard
MDL-type [50] parameter code-length corresponding to K
j
values and D
l
values. For the lth feature in the jth
component, the “effective” number of data points for
estimating
jl
is N
j
l
. Since there are R parameters in each
jl
, the corresponding code-length is
R
2
logðN
l
j
Þ. Similarly,
for the lth feature in the common component, the number of
1158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 9, SEPTEMBER 2004
Fig. 6. The unsupervised feature saliency algorithm.

Figures
Citations
More filters
Book

Machine Learning : A Probabilistic Perspective

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Journal ArticleDOI

A survey on feature selection methods

TL;DR: The objective is to provide a generic introduction to variable elimination which can be applied to a wide array of machine learning problems and focus on Filter, Wrapper and Embedded methods.
Journal ArticleDOI

Subset Selection in Regression

TL;DR: Chapman and Miller as mentioned in this paper, Subset Selection in Regression (Monographs on Statistics and Applied Probability, no. 40, 1990) and Section 5.8.
Proceedings ArticleDOI

Unsupervised feature selection for multi-cluster data

TL;DR: Inspired from the recent developments on manifold learning and L1-regularized models for subset selection, a new approach is proposed, called Multi-Cluster Feature Selection (MCFS), for unsupervised feature selection, which select those features such that the multi-cluster structure of the data can be best preserved.
Journal ArticleDOI

MILES: Multiple-Instance Learning via Embedded Instance Selection

TL;DR: This work proposes a learning method, MILES (multiple-instance learning via embedded instance selection), which converts the multiple- instance learning problem to a standard supervised learning problem that does not impose the assumption relating instance labels to bag labels.
References
More filters
Proceedings ArticleDOI

Rapid object detection using a boosted cascade of simple features

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Journal ArticleDOI

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.
Journal ArticleDOI

Data clustering: a review

TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Journal ArticleDOI

Normalized cuts and image segmentation

TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "Simultaneous feature selection and clustering using mixture models" ?

In this paper, the authors propose the concept of feature saliency and introduce an expectation-maximization ( EM ) algorithm to estimate it, in the context of mixture-based clustering. 

There are several avenues for future work. How to extend the algorithm to cope with this is a challenging problem. The authors can replace the mixture of Gaussians by a mixture of multinomial distribution, thereby making the proposed algorithm also applicable to categorical data. Finally, principles other than MML, such as variational Bayes [ 12 ], can be adopted to perform model selection. 

since the model selection algorithm determines the number of components, it can be initialized with a large value of K, thus alleviating the need for a good initialization, as shown in [18]. 

The authors can furtherreduce the complexity by adopting optimization techniquesapplicable for standard EM for Gaussian mixture, such assampling the data, compressing the data [8], or usingefficient data structures [45], [54]. 

Another strength of the proposed algorithm is that byinitialization with a large number of Gaussian components,the algorithm is less sensitive to the local minimumproblem than the standard EM algorithm. 

The task of selecting the “best” feature subset is known as feature selection, sometimes as variable selection or subset selection. 

The CLIQUE algorithm [1] is popular in the data mining community and it finds hyperrectangular shaped clusters using a subset of attributes for a large database. 

The texture data set (texture) consists of 4,000 19- dimensional Gabor filter features from a collage of four Brodatz textures [27]. 

The image segmentation data set (image) contains 2,320 data points with 19 features from seven classes; each pattern consists of features extracted from a 3 3 region taken from seven types of outdoor images: brickface, sky, foliage, cement, window, path, and grass. 

The high error rate for zernike is due to the fact that digit images are inherently more difficult to cluster: for example, “4” can be written in a manner very similar to “9” and it is difficult for any unsupervised learning algorithm to distinguish among them. 

The authors can see the general trendthat as the feature number increases, the saliency decreases, inaccordance with the true characteristics of the data. 

The proposed algorithm can avoid running EM many times with different numbers of components and different feature subsets, and can achieve better performance than using all the available features for clustering. 

Since these data sets were collected for supervised classification, the class labels are not involved in their experiment, except for evaluation of the clustering results. 

By treating Z (the hidden class labels) and as hidden variables, one can derive (see details in Appendix B) thefollowing EM algorithm for parameter estimation:.