What are the future works mentioned in the paper "Simultaneous feature selection and clustering using mixture models" ?

There are several avenues for future work. How to extend the algorithm to cope with this is a challenging problem. The authors can replace the mixture of Gaussians by a mixture of multinomial distribution, thereby making the proposed algorithm also applicable to categorical data. Finally, principles other than MML, such as variational Bayes [ 12 ], can be adopted to perform model selection.

What is the way to initialize a model?

since the model selection algorithm determines the number of components, it can be initialized with a large value of K, thus alleviating the need for a good initialization, as shown in [18].

How can the authors reduce the complexity of the algorithm?

The authors can furtherreduce the complexity by adopting optimization techniquesapplicable for standard EM for Gaussian mixture, such assampling the data, compressing the data [8], or usingefficient data structures [45], [54].

What is the strength of the proposed algorithm?

Another strength of the proposed algorithm is that byinitialization with a large number of Gaussian components,the algorithm is less sensitive to the local minimumproblem than the standard EM algorithm.

What is the composition of the texture data set?

The texture data set (texture) consists of 4,000 19- dimensional Gabor filter features from a collage of four Brodatz textures [27].

What is the wdbc image segmentation data set?

The image segmentation data set (image) contains 2,320 data points with 19 features from seven classes; each pattern consists of features extracted from a 3 3 region taken from seven types of outdoor images: brickface, sky, foliage, cement, window, path, and grass.

Why is zernike so difficult to cluster?

The high error rate for zernike is due to the fact that digit images are inherently more difficult to cluster: for example, “4” can be written in a manner very similar to “9” and it is difficult for any unsupervised learning algorithm to distinguish among them.

What is the general trend of the feature number?

The authors can see the general trendthat as the feature number increases, the saliency decreases, inaccordance with the true characteristics of the data.

What is the way to avoid running EM many times?

The proposed algorithm can avoid running EM many times with different numbers of components and different feature subsets, and can achieve better performance than using all the available features for clustering.

Why are the class labels not involved in their experiment?

Since these data sets were collected for supervised classification, the class labels are not involved in their experiment, except for evaluation of the clustering results.

(Open Access) Simultaneous feature selection and clustering using mixture models (2004) | M.H.C. Law

Q: What are the contributions mentioned in the paper "Simultaneous feature selection and clustering using mixture models" ?

In this paper, the authors propose the concept of feature saliency and introduce an expectation-maximization ( EM ) algorithm to estimate it, in the context of mixture-based clustering.

Simultaneous Feature Selection and Clustering

Using Mixture Models

Martin H.C. Law, Student Member, IEEE,Ma

rio A.T. Figueiredo, Senior Member, IEEE, and

Anil K. Jain, Fellow, IEEE

Abstract—Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist

many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the

clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there

are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the

determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we

propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of

mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant

features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to

simultaneously estimate the feature saliencies and the number of clusters.

Index Terms—Feature selection, clustering, unsupervised learning, mixture models, minimum message length, EM algorithm.

1INTRODUCTION

HE goal of clustering is to discover a “natural” grouping

in a set of patterns, points, or objects, without knowl-

edge of any class labels. Clustering, or cluster analysis, is

prevalent in any discipline that involves analysis of multi-

variate data. It is, of course, impractical to exhaustively list

the numerous uses of clustering techniques. Image seg-

mentation, an important problem in computer vision, can

be formulated as a clustering problem [21], [28], [55].

Documents can be clustered [23] to generate topical

hierarchies for information access [53] or retrieval [5].

Clustering is also used to perform market segmentation [2],

[11] as well as in biology, e.g., to study genome data [3].

Many clustering algorithms have been proposed in

different application scenarios [25], [29]. They can be

divided roughly into two categories: hierarchical clustering,

which creates a “tree” with branches merging at different

levels, and partitional clustering, which divides the data into

different “flat” clusters. The input of clustering algorithms

can either be a proximity matrix containing the similarities/

dissimilarities between all pairs of points, or a pattern

matrix, where each item is described by a vector of

attributes, also called features. In this paper, we shall focus

on partitional clustering with a pattern matrix as input.

In principle, the more information we have about each

pattern, the better a clustering algorithm is expected to

perform. This seems to suggest that we should use as many

features as possible to represent the patterns. However, this is

not the casein practice. Some features can be just “noise,” thus

not contributing to (or even degrading) the clustering

process. The task of selecting the “best” feature subset is

known as feature selection, sometimes as variable selection or

subset selection.

Feature selection is important for several reasons, the

fundamental one being arguably that noisy features can

degrade the performance of most learning algorithms (see

the example in Fig. 1). In supervised learning, it is known

that feature selection can improve the performance of

classifiers learned from limited amounts of data [49]; it

leads to more economical (both in storage and computation)

classifiers and, in many cases, it may lead to interpretable

models. Feature selection is particularly important for data

sets with large numbers of features, e.g., classification

problems in molecular biology may involve thousands of

features [3], [62], and a Web page can be represented by

thousands of different key-terms [58]. Appearance-based

image classification methods may use each pixel as a

feature [6], thus easily involving thousands of features.

Feature selection has been widely studied in the context

of supervised learning (see [7], [24], [33], [34] and references

therein), where the ultimate goal is to select features that

can achieve the highest accuracy on unseen data. Feature

selection has received comparatively very little attention in

unsupervised learning or clustering. One important reason

is that it is not at all clear how to assess the relevance of a

subset of features without resorting to class labels. The

problem is made even more challenging when the number

of clusters is unknown, since the optimal number of clusters

and the optimal feature subset are interrelated, as illu-

strated in Fig. 2 (taken from [16]). Note that methods based

on variance (such as principal components analysis) need not

select good features for clustering, as features with large

variance can be independent of the intrinsic grouping of the

data (see example in Fig. 3).

Most feature selection algorithms (such as [9], [33], [47])

involve a combinatorial search through the space of all

feature subsets. Usually, heuristic (nonexhaustive) methods

have to be adopted, because the size of this space is

1154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 9, SEPTEMBER 2004

. M.H.C. Law and A.K. Jain are with the Department of Computer Science

and Engineering, Michigan State University, 3115 Engineering Building,

East Lansing, Michigan 48824-1226. E-mail: {lawhiu, jain}@cse.msu.edu.

. M.A.T. Figueiredo is with the Instituto de Telecomunicac¸o

es, Instituto

Superior Te

cnico, Torre Norte, Piso 10, Av. Rovisco Pais, 1049-001 Lisboa,

Portugal. E-mail: mtf@lx.it.pt.

Manuscript received 15 May 2003; accepted 27 Feb. 2004.

Recommended for acceptance by B.J. Frey.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number TPAMI-0077-0503.

0162-8828/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

exponential in the number of features. In this case, one

generally loses any guarantee of optimality of the selected

feature subset.

In this paper, we propose a solution to the feature selection

problem in unsupervised learning by casting it as an

estimation problem, thus avoiding any combinatorial search.

Instead of selecting a subset of features, we estimate a set of

real-valued (actually in ½0; 1) quantities (one for each feature)

which we call the feature saliencies. This estimation is carried

out by an EM algorithm derived for the task. Since we are in

the presence of a model-selection-type problem, it is

necessary to avoid the situation where all the saliencies take

the maximum possible value. This is achieved by adopting a

minimum message length (MML, [60], [61]) penalty, as was

done in [18] to select the number of clusters. The MML

criterion encourages the saliencies of the irrelevant features to

go to zero, allowing us to prune the feature set. Finally, we

integrate the process of feature saliency estimation into the

algorithm proposed in [18], thus obtaining a method which is

able to simultaneously perform feature selection and deter-

mine the number of clusters. Although the algorithm is

presented with respect to Gaussian mixture-based clustering,

one can extend it to other types of model-based clustering as

well. The algorithm first appears in [38].

The remainder of this paper is organized as follows: In

Section 2, we review approaches for feature selection and

previous attempts to solve the feature selection problem in

unsupervised learning. The details of our approach are

presented in Section 3. Experimental results are reported in

Section 4, followed by comments on the proposed algorithm

in Section 5. Finally, we conclude in Section 6 and outline

some future work directions.

2RELATED WORK

Most of the literature on feature selection pertains to

supervised learning (both classification [24] and regression

[40]). Feature selection algorithms can be broadly divided

into two categories [7], [33]: filters and wrappers. The filter

approaches evaluate the relevance of each feature (subset)

using the data set alone, regardless of the subsequent learning

algorithm. RELIEF [32] and its enhancement [36] are

representatives of this class, where the basic idea is to assign

feature weights based on the consistency of the feature value

in the k nearest neighbors of every data point. Information-

theoretic methods are also used to evaluate features: the

mutual information between a relevant feature and the class

labels should be high [4]. Nonparametric methods can be

used to compute mutual information involving continuous

features [37]. A feature can be regarded as irrelevant if it is

conditionally independent of the class labels given other

features. The concept of Markov blanket is used to formalize

this notion of irrelevancy in [34].

On the other hand, wrapper approaches [33] invoke the

learning algorithm to evaluate the quality of each feature

(subset). Specifically, a learning algorithm (e.g., a nearest

neighbor classifier, a decision tree, a naive Bayes method) is

run on a feature subset and the feature subset is assessed by

some estimate of the classification accuracy. Wrappers are

usually more computationally demanding, but they can be

superior in accuracy when compared with filters, which

ignore the properties of the learning task at hand [33].

Both approaches, filters and wrappers, usually involve

combinatorial searches through the space of possible

feature subsets; for this task, different types of heuristics,

such as sequential forward or backward searches, floating

search, beam search, bidirectional search, and genetic

search have been suggested [9], [33], [47], [63]. It is also

possible to construct a set of weak (in the boosting sense

[20]) classifiers, with each one using only one feature, and

then apply boosting, which effectively performs feature

LAW ET AL.: SIMULTANEOUS FEATURE SELECTION AND CLUSTERING USING MIXTURE MODELS 1155

Fig. 1. A uniformly distributed irrelevant feature (x

) makes it difficult for

the Gaussian mixture learning algorithm in [18] to recover the two

underlying clusters. If only feature x

is used, however, the two clusters

are easily identified. The curves along the horizontal and vertical axes of

the figure indicate the marginal distribution of x

and x

, respectively.

Fig. 2. Number of clusters is interrelated with feature subset used. The

optimal feature subsets for identifying three, two, one clusters in this

data set are fx

g, fx

g, and fx

g, respectively. On the other hand,

the optimal number of clusters for feature subsets fx

g, fx

g, and

g are also three, two, one, respectively.

Fig. 3. Feature x

, although explaining more data variance than feature

, is spurious for the identification of the two clusters in this data set.

selection [59]. It has also been proposed to approach feature

selection using rough set theory [35].

All of the approaches mentioned above are concerned

with feature selection in the presence of class labels.

Comparatively, not much work has been done for feature

selection in unsupervised learning. Of course, any method

conceived for supervised learning that does not use the

class labels could be used for unsupervised learning; it is

the case for methods that measure feature similarity to

detect redundant features, using, e.g., mutual information

[53] or a maximum information compression index [42]. In

[16], [17], the normalized log-likelihood and cluster separ-

ability are used to evaluate the quality of clusters obtained

with different feature subsets. Different feature subsets and

numbers of clusters, for multinomial model-based cluster-

ing, are evaluated using marginal likelihood and cross-

validated likelihood in [58]. The algorithm described in [52]

uses automatic relevance determination priors to select

features when there are two clusters. In [13], the clustering

tendency of each feature is assessed by an entropy index. A

genetic algorithm is used in [31] for feature selection in

k-means clustering. In [56], feature selection for symbolic

data is addressed by assuming that irrelevant features are

uncorrelated with the relevant features. Reference [14]

describes the notion of “category utility” for feature

selection in a conceptual clustering task. The CLIQUE

algorithm [1] is popular in the data mining community and

it finds hyperrectangular shaped clusters using a subset of

attributes for a large database. The wrapper approach can

also be adopted to select features for clustering; this has

been explored in our earlier work [19], [38].

All the methods referred above perform “hard” feature

selection (a feature is either selected or not). There are also

algorithms that assign weights to different features to

indicate their significance. In [43], weights are assigned to

different groups of features for k-means clustering based on

a score related to the Fisher discriminant. Feature weighting

for k-means clustering is also considered in [41], but the goal

there is to find the best description of the clusters after they

are identified. The method described in [46] can be

classified as learning feature weights for conditional

Gaussian networks. An EM algorithm based on Bayesian

shrinking is proposed in [22] for unsupervised learning.

3EMALGORITHM FOR FEATURE SALIENCY

In this section, we propose an EM algorithm for performing

mixture-based (or model-based) clustering with feature

selection. In mixture-based clustering, each data point is

modeled as having been generated by one of a set of

probabilistic models [25], [39]. Clustering is then done by

learning the parameters of these models and the associated

probabilities. Each pattern is assigned to the mixture

component that most likely generated it. Although the

derivations below refer to Gaussian mixtures, they can be

generalized to other types of mixtures.

3.1 Mixture Densities

A finite mixture density with K components is defined by

pðyÞ¼

j¼1



pðyj

Þ; ð1Þ

where 8

;

 0;



¼ 1; each 

is the set of parameters of

the jth component (all components are assumed to have the

same form, e.g., Gaussian); and  f

; ...;

;

; ...;

will denote the full parameter set. The goal of mixture

estimation is to infer  from a set of N data points

Y¼fy

; ...; y

g, assumed to be samples of a distribution

with density given by (1). Each y

is a D-dimensional feature

vector ½y

; ...;y



. In the sequel, we will use the indices i, j,

and l to run through data points (1 to N), mixture components

(1 to K), and features (1 to D), respectively.

As is well-known, neither the maximum likelihood (ML)

estimate,



¼ arg max



log pðYjÞfg;

nor the maximum a posteriori (MAP) estimate (given some

prior pðÞ)



MAP

¼ arg max



log pðYjÞþlog pðÞ

;

can be found analytically. The usual choice is the EM

algorithm, which finds local maxima of these criteria [39].

This algorithm is based on a set Z¼fz

; ...; z

gof N missing

(latent) labels, where z

¼½z

; ...;z

,withz

¼ 1 and

¼ 0, for p 6¼ j, meaning that y

is a sample of pðj

Þ. For

brevity of notation, sometimes we write z

¼ j for such z

. The

complete data log-likelihood, i.e., the log-likelihood if Z was

observed, is

log pðY; ZjÞ¼

i¼1

j¼1

log 

pðy

j



: ð2Þ

The EM algorithm produces a sequence of estimates

ðtÞ;t¼ 0; 1; 2; ...g using two alternating steps:

. E-step: Compute W¼E½ZjY;

ðtÞ,theexpected

value of the missing data given the current parameter

estimate, and plug it into log pðY; ZjÞ, yielding the so-

called Q-function Qð;

ðtÞÞ ¼ log pðY; WjÞ. Since the

elements of Z are binary, we have

i;j

 Ez

jY;

ðtÞ

¼ Pr z

¼ 1jy

;

ðtÞ



ðtÞ pðy



ðtÞÞ

k¼1



ðtÞ pðy



ðtÞÞ

ð3Þ

Notice that 

is the a priori probability that z

¼ 1

(i.e., that y

belongs to cluster j), while w

is the

corresponding a posteriori probability, after obser-

ving y

. M-step: Update the parameter estimates,

ðt þ 1Þ¼arg max



fQð;

ðtÞÞ þ log pðÞg;

in the case of MAP estimation, or without log pðÞ in

the ML case.

3.2 Feature Saliency

In this section, we define the concept of feature saliency and

derive an EM algorithm to estimate its value. We assume

that the features are conditionally independent given the

(hidden) component label, that is,

1156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 9, SEPTEMBER 2004

pðyjÞ¼

j¼1



pðyj

Þ¼

j¼1



l¼1

pðy

j

Þ; ð4Þ

where pðj

Þ is the pdf of the lth feature in the jth

component. This assumption enables us to utilize the power

of the EM algorithm. In the particular case of Gaussian

mixtures, the conditional independence assumption is

equivalent to adopting diagonal covariance matrices, which

is a common choice for high-dimensional data, such as in

naı

ve Bayes classifiers, latent class models, as well as in the

emission densities of continuous hidden Markov models.

Among different definitions of feature irrelevancy

(proposed for supervised learning), we adopt the one

suggested in [48], [58], which is suitable for unsupervised

learning: the lth feature is irrelevant if its distribution is

independent of the class labels, i.e., if it follows a common

density, denoted by qðy

j

Þ. Let  ¼ð

; ...;

Þ be a set of

binary parameters, such that 

¼ 1 if feature l is relevant

and 

¼ 0, otherwise. The mixture density in (4) can then

be rewritten as

pðyj; f

g; f

g; f

gÞ ¼

j¼1



l¼1

½pðy

j

Þ



½qðy

j

Þ

1

ð5Þ

A related model for feature selection in supervised learning

has been considered in [44], [48]. Intuitively,  determines

which edges exist between the hidden label z and the

individual features y

in the graphical model illustrated in

Fig. 4, for the case D ¼ 4.

Our notion of feature saliency is summarized in the

following steps: 1) We treat the 

s as missing variables and

2) we define the feature saliency as 

¼ Pð

¼ 1Þ,the

probability that the lth feature is relevant. This definition

makes sense, as it is difficult to know for sure that a certain

feature is irrelevant in unsupervised learning. The resulting

model (likelihood function) is written as (see the proof in

Appendix A)

pðyjÞ¼

j¼1



l¼1

ð

pðy

j

Þþð1  

Þqðy

j

ÞÞ; ð6Þ

where  ¼ff

g; f

g; f

g; f

gg is the set of all the

parameters of the model. An intuitive way to see how (6)

is obtained is to notice that ½pðy

j

Þ



½qðy

j

Þ

1

can be

written as 

pðy

j

Þþð1  

Þqðy

j

Þ, because 

is binary.

The form of qð:j:Þ reflects our prior knowledge about the

distribution of the nonsalient features. In principle, it can be

any 1D distribution (e.g., a Gaussian, a student-t, or even a

mixture). We shall limit qð:j:Þ to be a Gaussian, since this

leads to reasonable results in practice.

Equation (6) has a generative interpretation. As in a

standard finite mixture, we first select the component label j

by sampling from a multinomial distribution with para-

meters ð

; ...;

Þ. Then, for each feature l ¼ 1; ...;D,we

flip a biased coin whose probability of getting a head is 

;if

we get a head, we use the mixture component pð:j

Þ to

generate the lth feature; otherwise, the common component

qð:j

Þ is used. A graphical model representation of (6) is

shown in Fig. 5 for the case D ¼ 4.

3.2.1 EM Algorithm

By treating Z (the hidden class labels) and  as hidden

variables, one can derive (see details in Appendix B) the

following EM algorithm for parameter estimation:

. E-step: Compute the following quantities:

ij l

¼ P ð

¼ 1;y

¼ jÞ¼

pðy

j

Þ; ð7Þ

ij l

¼ P ð

¼ 0;y

¼ jÞ¼ð1  

Þqðy

j

Þ; ð8Þ

ij l

¼ P ðy

¼ jÞ¼a

ijl

þ b

ijl

; ð9Þ

¼ P ðz

¼ jjy

Þ¼



ij l



ij l

; ð10Þ

ij l

¼ P ð

¼ 1;z

¼ jjy

Þ¼

ij l

; ð11Þ

ij l

¼ P ð

¼ 0;z

¼ jjy

Þ¼w

 u

ij l

: ð12Þ

. M-step: Reestimate the parameters according to

following expressions:

LAW ET AL.: SIMULTANEOUS FEATURE SELECTION AND CLUSTERING USING MIXTURE MODELS 1157

Fig. 4. A graphical model for the probability model in (5) for the case of four features (D ¼ 4) with different indicator variables. 

¼ 1 corresponds to

the existence of an arc from z to y

, and 

¼ 0 corresponds to its absence. (a) 

¼ 1, 

¼ 0, 

¼ 1. (b) 

¼ 0, 

¼ 1, 

¼ 0.

Fig. 5. A graphical model showing the mixture density in (6). The

variables z, 

;

are “hidden” and only y

are observed.



; ð13Þ

Mean inMean in 

ij l

ijl

; ð14Þ

Var inVar in 

ij l

ðy

ð

Mean inMean in 

ÞÞ

ij l

; ð15Þ

Mean inMean in 

ij l

Þ y

ij l

; ð16Þ

Var inVar in 

ij l

Þðy

ð

Mean inMean in 

ÞÞ

ij l

; ð17Þ



i;j

ij l

i;j

ij l

i;j

ij l

i;j

ij l

: ð18Þ

In these equations, the variable u

ij l

measures how important

the ith pattern is to the jth component, when the lth feature is

used. It is thus natural that the estimates of the mean and the

variance in 

are weighted sums with weight u

ij l

. Similar

relationship exists between

ij l

and 

. The term

ij l

can be interpreted as how likely it is that 

equals one,

explaining why the estimate of 

is proportional to

ij l

3.3 Model Selection

Standard EM for mixtures exhibits some weaknesses which

also affect the EM algorithm introduced above: it requires

knowledge of K, and a good initialization is essential for

reaching a good local optimum. To overcome these difficul-

ties, we adopt the approach in [18], which is based on the

MML criterion [61], [60].

The MML criterion for our model (see details in

Appendix C) consists of minimizing, with respect to , the

following cost function (after discarding the order one term)

log pðYjÞþ

K þ D

log N þ

l¼1

j¼1

logðN



l¼1

logðNð1  

ÞÞ;

ð19Þ

where R and S are the number of parameters in 

and 

respectively. If pð:j:Þ and qð:j:Þ are univariate Gaussians

(arbitrary mean and variance), R ¼ S ¼ 2. From a para-

meter estimation viewpoint, (19) is equivalent to a maximum

a posteriori (MAP) estimate,

 ¼arg max



(

log pðYjÞ

l¼1

log 



l¼1

logð1  



l¼1

log 

)

;

ð20Þ

with the following (Dirichlet-type, but improper) priors on

the 

s and 

pð

; ...;

Þ/

j¼1



RD=2

;

pð

; ...;

Þ/

l¼1



RK=2

ð1  

S=2

Since these priors are conjugate with respect to the complete

data likelihood, the EM algorithm undergoes a minor

modification: The M-step (13) and (18) are replaced by



maxð



; 0Þ

maxð



; 0Þ

ð21Þ



maxð

i;j

ijl



; 0Þ

maxð

i;j

ijl



; 0Þþmaxð

i;j

ijl



; 0Þ

: ð22Þ

In addition to the log-likelihood, the other terms in (19)

have simple interpretations. The term

KþD

log N is a standard

MDL-type [50] parameter code-length corresponding to K

values and D

values. For the lth feature in the jth

component, the “effective” number of data points for

estimating 

is N



. Since there are R parameters in each



, the corresponding code-length is

logðN



Þ. Similarly,

for the lth feature in the common component, the number of

1158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 9, SEPTEMBER 2004

Fig. 6. The unsupervised feature saliency algorithm.

Simultaneous feature selection and clustering using mixture models

Figures

Citations

Machine Learning : A Probabilistic Perspective

A survey on feature selection methods

Subset Selection in Regression

Unsupervised feature selection for multi-cluster data

MILES: Multiple-Instance Learning via Embedded Instance Selection

References

Pattern Classification

Rapid object detection using a boosted cascade of simple features

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Data clustering: a review

Normalized cuts and image segmentation

Related Papers (5)

Finite Mixture Models

Maximum likelihood from incomplete data via the EM algorithm

An introduction to variable and feature selection

Estimating the Dimension of a Model

Wrappers for feature subset selection

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "Simultaneous feature selection and clustering using mixture models" ?

Q2. What are the future works mentioned in the paper "Simultaneous feature selection and clustering using mixture models" ?

Q3. What is the way to initialize a model?

Q4. How can the authors reduce the complexity of the algorithm?

Q5. What is the strength of the proposed algorithm?

Q6. What is the name of the task of selecting the “best” feature subset?

Q7. What is the popular algorithm for clustering?

Q8. What is the composition of the texture data set?

Q9. What is the wdbc image segmentation data set?

Q10. Why is zernike so difficult to cluster?

Q11. What is the general trend of the feature number?

Q12. What is the way to avoid running EM many times?

Q13. Why are the class labels not involved in their experiment?

Q14. What is the EM algorithm for determining the feature saliency?