What are the future works mentioned in the paper "Feature selection and kernel learning for local learning-based clustering" ?

The future work will focus on solving the feature selection/kernel learning problem and the clustering problem with a unified objective function.

How do the authors reduce the constraint l2 f0 to l 0?

To avoid a combinatorial search for later, the authors relax the constraint l 2 f0; 1g to l 0 and further restrict its scale by Pd l¼1 l ¼ 1.

How can The authorfind k-mutual neighbors for fxigni141?

Find k-mutual neighbors for fxigni¼1, using the metric defined in (12);4 Construct the matrix M in (6) with i given in (17),and then solve the problem (7) to obtain Y;5 Compute wc i ; 8i; c by (15) and update using (27); 6 endTo deal with some complex data sets, the LLC algorithm can be kernelized as in [3] by replacing the linear ridge regression with the kernel ridge regression.

How many patches are there for each image?

there are 27 patches in total for each image: nine patches from the original image, nine patches from the horizontal edge maps, and nine patches from the vertical edge maps.

What is the equivalence of the l2 norm with the standard simplex?

In the input space, the authors address this equivalence based on the fact that the infimum of the weighted l2 norm, with the weights defined on the standard simplex, is equal to a squared special l1 norm regularization.

What are the parameters used for the LLC algorithm?

It can be seen that the proposed LLC-fs algorithm almost outperforms the baseline k-means, spectral clustering, and the basic LLC algorithm on all data sets except the mfeafou, but note that spectral clustering and LLC have used their best parameters.

(Open Access) Feature Selection and Kernel Learning for Local Learning-Based Clustering (2011) | Hong Zeng

Q: What is the proposed method for learning a convex combination of kernels?

the proposed feature selection method is extended from the observation space to the feature space, naturally leading to the problem of learning a convex combination of kernels for the local learning-based clustering.

Feature Selection and Kernel Learning for

Local Learning-Based Clustering

Hong Zeng, Member, IEEE, and Yiu-ming Cheung, Senior Member, IEEE

Abstract—The performance of the most clustering algorithms highly relies on the representation of data in the input space or the

Hilbert space of kernel methods. This paper is to obtain an appropriate data representation through feature selection or kernel learning

within the framework of the Local Learning-Based Clustering (LLC) (Wu and Scho

lkopf 2006) method, which can outperform the global

learning-based ones when dealing with the high-dimensional data lying on manifold. Specifically, we associate a weight to each feature

or kernel and incorporate it into the built-in regularization of the LLC algorithm to take into account the relevance of each feature or

kernel for the clustering. Accordingly, the weights are estimated iteratively in the clustering process. We show that the resulting

weighted regularization with an additional constraint on the weights is equivalent to a known sparse-promoting penalty. Hence, the

weights of those irrelevant features or kernels can be shrunk toward zero. Extensive experiments show the efficacy of the proposed

methods on the benchmark data sets.

Index Terms—High-dimensional data, local learning-based clustering, feature selection, kernel learning, sparse weighting.

1INTRODUCTION

T is common to perform high-dimensional data clustering

in a variety of pattern recognition and data mining

problems in which the high-dimensional data are repre-

sented by a large number of features. However, the

discrimination among patterns is often impeded by the

abundance of features. For instance, it is quite common to

have thousands of gene expression coefficients as features

for a single sample in genomic data analysis, but only a

small fraction is capable of discriminating among different

tissue classes. Those irrelevant features involved in the

prediction may seriously degrade the performance of an

inference machine [13]. Therefore, it is desirable to develop

an effective feature selection algorithm toward identifying

those features relevant to the inference task in hand.

On the other hand, the kernel methods have been widely

applied to a variety of learning problems in the past

decades, where the data are implicitly mapped into a

nonlinear high-dimensional space by kernel function [30]. It

is known that the performance of these methods will

heavily hinge on the choice of kernel. Unfortunately, the

most suitable kernel for a particular task is often unknown

in advance. Moreover, exhaustive search on a user-defined

pool of kernels will be quite time-consuming when the size

of the pool becomes large [29]. Hence, it is crucial to learn

an appropriate kernel efficiently to make the performance

of the employed kernel-based inference method robust or

even improved.

This paper attempts to obtain a n appropriate data

representation for clustering in the input space or the

Hilbert space (also interchangeably called feature space

hereinafter) of kernel methods. Accordingly, two issues, i.e.,

feature selection and kernel learning, are considered. In fact,

either of these two issues have been extensively studied in

the context of supervised learning, but are comparatively

less explored in the clustering problem. A major reason is

that feature selection or kernel learning in unsupervised

learning becomes more challenging without the presence of

ground-truth class labels that could guide the search for

relevant representations. Most recently, some research

works regarding these two issues have been done in the

unsupervised case, e.g., see [43], [12], [13], [25], [34]. A

predominant strategy among these approaches, which have

achieved prominent improved clustering performance, is to

first relax the binary hard decision on the relevance of feature

or kernel to a real-valued soft one, i.e., a confidence or

weight, turning the combinatorial search problem into a

continuous learning problem. Then, these approaches apply

the following two iterative steps until convergence: 1) esti-

mating the weights for features or kernels using the

intermediate clustering result, and 2) refeeding the weighted

feature or kernel into the employed clustering algorithm.

Despite the success of such common strategy for both the

feature selection and kernel learning in clustering, there are

still two problems at least not properly addressed. One

problem is on the exploited clustering algorithm which

generates the intermediate clustering result. The feature or

kernel is evaluated by the intermediate clustering result; an

improper intermediate partition may lead to a poor weight-

ing. Some employed clustering algorithms in those methods

may be prone to such failure, especially when dealing with

high-dimensional data lying on manifold. The other problem

is the sparseness of the weights. Sparse weighting, i.e., a big

1532 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011

. H. Zeng is with the School of Instrument Science and Engineering,

Southeast University, China, and the Department of Computer Science,

Hong Kong Baptist University, Hong Kong SAR, China.

E-mail: littlezenghong@gmail.com.

. Y.M. Cheung is with the Department of Computer Science, Hong Kong

Baptist University, Hong Kong SAR, China.

E-mail: ymc@comp.hkbu.edu.hk.

Manuscript received 17 Feb. 2009; revised 12 Dec. 2009; accepted 23 Oct.

2010; published online 29 Nov. 2010.

Recommended for acceptance by M. Meila.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMI-2009-02-0117.

Digital Object Identifier no. 10.1109/TPAMI.2010.215.

0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

gap between the weights for informative and uninformative

representation, as well as vanishing weights for uninforma-

tive ones, is desirable so that the effect of irrelevant features

or kernels can be significantly mitigated. Moreover, it helps

to better understand the problem by focusing on only a few

dominant features or kernels that most contribute to the task.

To the best of our knowledge, few of those methods have

provided a principled and effective regularization on the

sparsity of weights.

In this paper, we shall propose two methods that

perform the feature selection and kernel learning within

the framework of the Local Learning-Based Clustering

(LLC) [3], respectively. The LLC algorithm tries to ensure

that the cluster label of each data point is close to the one

predicted by the local regression model, a current super-

vised learning method, with its neighboring points and

their cluster labels [3]. Essentially, it finds the partition

which is mostly able to embody such local configuration, it

is thereby expected to be good at clustering data sets lying

on manifold, e.g., the high-dimensional sparse data sets.

Furthermore, by utilizing the ridge regression in the

supervised learning to develop an unsupervised clustering

method, LLC has a built-in regularization for the model

complexity. In this paper, we modify such a built-in ridge

regularization in the local regression model to take into

account the relevance of each feature or kernel for

clustering. It is shown that the modified penalty term with

a constraint is equivalent to the existing sparse-promoting

penalty. Hence, it is guaranteed that the resulting weights

for features are sparse and then local configuration may get

refined; a better clustering result can thus be expected.

Moreover, the proposed feature selection method is ex-

tended from the observation space to the feature space,

naturally leading to the problem of learning a convex

combination of kernels for the local learning-based cluster-

ing. The main contributions of our work are two-fold:

1. A novel feature selection method and a kernel

learning method are proposed for local learning-

based clustering, respectively, whereas almost all of

the existing counterparts are developed for global

learning-based clustering.

2. The feature selection and kernel learn ing for

clustering are addressed in a unified approach

under the same regularization framework.

The remainder of this paper is organized as follows: Related

works are reviewed in Section 2. Section 3 gives an

overview of the LLC algorithm. We present the proposed

feature selection method in Section 4, and then extend it to

learn the combination o f kernels in Section 5. Some

discussions are given in Section 6. In Section 7, extensive

experiments are conducted to show the performance of the

proposed methods on several benchmark data sets. Finally,

we draw a conclusion in Section 8.

2RELATED WORKS

This section overviews the literature on the unsupervised

feature selection and kernel learning only. The reviews of

supervised feature selection and kernel learning can be

found in [5] and [27], respectively.

The approaches to unsupervised feature selection for

clustering can be generally categorized as the filter and

wrapper ones. The filter approaches [9], [40], [7], [8], [6] leave

out uninformative features before the clustering. They have

demonstrated great computational efficiency because they

do not involve clustering when evaluating the quality of

features. In general, such a method has to determine the

number of selected relevant features. Unfortunately, this

crucial issue has rarely been addressed in the literature,

thus causing difficulty in practical applications [13]. In

contrast, the wrapper approaches [10], [11], [12], [13] first

construct a candidate of feature subset on which its

goodness is then assessed by investigating the performance

of a specific clustering. These two steps are repeated until

convergence. In general, the wrapper approaches outper-

form the filter ones, but is more time-consuming because of

the exhaustive search in the space of feature subsets. In the

literature, some wrapper approaches, e.g., [10], [11], have

utilized the gree dy sea rch (i.e., a nonexhaustive one),

which, however, cannot guarantee to select all relevant

features. This shortcoming, as well as the issue of determin-

ing the number of selected relevant features in the filter

approaches, can be alleviated by assigning each feature a

nonnegative weight [12], [13] rather than a binary indicator

to indicate its relevance to the clustering. Further, the

combinatorial explosion of the search space can be avoided

as well by casting the feature selection as an estimation

problem. Our approach also follows this strategy. Based on

recent progress on spectral clustering, the algorithm in [13]

tries to optimize the cluster coherence measured by the sum

of squared eigenvalues of an affinity matrix, which is

constructed by aggregating weak affinity matrices built

with weighted feature vectors. The solutions to the

clustering and feature weighting are obtained by an

efficient iterative algorithm based on eigendecomposition.

Nevertheless, the clustering algorithm in [13] is essentially

the kernel k-means with a linear kernel, which is a global

learning method; thus it is difficult to deal with the data

that lie on nonlinear manifold. In [12], feature weights are

estimated by modifying the M-step of the EM algorithm

through the Bayesian inference mechanism when there are

only two clusters. It is noteworthy that, in addition to

incorporating feature selection, there are several approaches

to learning pa rameterized simil arity func tions in the

spectral clustering for improving the clustering perfor-

mance [1], [28]. Despite the success in their application

domain, it is often nontrivial to interpret the physical

meaning of the parameters specified in these methods, e.g.,

the parameter associated with a feature having a negative

weight in [1], [28]. Also, the parameters specified for the

RBF kernel functions may increase the difficulty for the

optimization.

For kernel learning in clustering, some heuristic ap-

proaches [24], [28] directly learn the kernel parameters of

some specific kernels. Although some improvement can

often be achieved, an extension of the learning method to

other kernel functions is usually nontrivial [42]. In contrast,

a more effective framework, termed the multiple kernel

learning [26], learns a linear combination of base kernels

with the different weights, whic h will be estimated

simultaneously in the inference process, e.g., see [34], [41],

[25]. Our proposed method, which will be described later,

also belongs to this framework. In [34], the algorithm tries

to find a maximum margin hyperplane to cluster data

(restricted to the binary-class case), accompanied by

ZENG AND CHEUNG: FEATURE SELECTION AND KERNEL LEARNING FOR LOCAL LEARNING-BASED CLUSTERING 1533

learning a mixture of Laplacian matrices. The method in

[41] extends the kernel discriminant analysis technique to

clustering and learns a combination of kernel matrices

jointly. In [34], [41], no penalty is imposed on the kernel

weights; thus the sparsity may not be guaranteed. In [25],

clustering is phrased as a nonnegative matrix factorization

problem of a fused kernel matrices, and the sparseness of

kernel weights is controlled by a heuristic entropy penalty

which, however, favors a uniform weighting.

An important application of the multiple kernel learning

is to fuse the information from heterogeneous sources as

follows [26]: Associate each source with a kernel function,

and then combine the set of prototype kernels generated

from these sources to perform the inference. In this respect,

the multiview clustering is also a related work whose goal is

to learn a consensus result from multiple representations

[39], [46]. However, it implicitly treats all the sources

equally, regardless of the clustering performance with each

source. In contrast, our proposed method is able to

determine the weight for each source automatically accord-

ing to its capability of discrimination; thus it will be more

robust from the practical viewpoint.

3OVERVIEW OF THE LOCAL LEARNING-BASED

CLUSTERING ALGORITHM

Given n data points X¼fx

i¼1

ðx

2 IR

Þ, the data set will

be partitioned into C clusters. The clustering result can be

represented by a cluster assignment indicator matrix P ¼

½p

2f0; 1g

nC

such that p

¼ 1 if x

belongs to the

cth cluster, and p

¼ 0 otherwise. The scaled cluster assign-

ment indicator matrix used in this paper is defined as

Y ¼ PðP

PÞ



¼½y

; y

; ...; y

;

where y

¼½y

; ...;y



2 IR

ð1  c  CÞ is the cth column

of Y 2 IR

nC

. y

¼ p

ﬃﬃﬃﬃﬃ

can be regarded as the con-

fidence that x

is assigned to the cth cluster, where n

is the

size of the cth cluster. It is easy to verify that

Y ¼ I; ð1Þ

where I 2 IR

nn

is an identity matrix.

The starting point of the LLC [3] is that the cluster

assignments in the neighborhood of each point should be as

smooth as possible. Specifically, it assumes that the cluster

indicator value at each point should be well estimated by a

regression model trained locally with its neighbors and their

cluster indicator values. Suppose there exists an arbitrary Y

at first; for each x

, the model is built with the training data

fðx

Þg

ð1  c  C; 1  i; j  nÞ, where N

denotes

the set of neighboring

points of x

, but x

is excluded.

The output of the local model is of the following form:

ðxÞ¼x



; 8x 2 IR

; ð2Þ

where 

2 IR

is the local regression coefficient vector, f

ð:Þ

denotes the local model learned with the training data

fðx

Þg

. Here, the bias term is ignored for simplicity

provided that one of the features is always 1. In [3], the

model is obtained by solving the following l

norm

regularized least square problem:

min

f

c¼1

i¼1



 x











; ð3Þ

where  is a trade-off parameter. Let f



g be the solution to

the linear ridge regression problem (3), the predicted cluster

assignment for the test data x

can then be calculated by

¼ f

ðx

Þ¼x



¼ 

; ð4Þ

where



¼ x



X

þ I



1

; ð5Þ

¼½x

; x

; ...; x

 with x

being the kth neighbor of x

¼½y

; ...;y



, and n

is the size of N

After all of the local predictors have been constructed,

the LLC combines them together so that an optimal cluster

indicator matrix Y is found via minimizing the following

overall prediction error:

c¼1

i¼1

ðy



c¼1

 Ay

¼ trace½Y

ðI  AÞ

ðI  AÞY

¼ traceðY

MYÞ;

ð6Þ

where M ¼ðI  AÞ

ðI  AÞ, A is an n  n sparse matrix,

whose ði; jÞth entry a

is the corresponding element in 

(5) if x

and 0 otherwise.

As in the spectral clustering [14], [15], Y is relaxed into

the continuous domain while keeping the property of (1) for

the problem (6). The LLC then solves the following tractable

continuous optimization problem:

min

Y2IR

nC

traceðY

MYÞ

s:t: Y

Y ¼ I:

ð7Þ

A solution to Y is given by the first C eigenvectors of the

matrix M corresponding to the first C smallest eigenvalues.

Similarly to [14], [15], the final partition result is obtained by

discretizing Y via the method in [15] or by the k-means as

in [14]. Promising results have been reported in [3].

4FEATURE SELECTION FOR

LOCAL LEARNING-BASED CLUSTERING

In this section, we will integrate the feature selection into

the LLC. It should be noted that the key ingredient of the

LLC is to learn the local regression model, which is trained

only with the points in each neighborhood. However, there

may be too few data points in its neighborhood to learn a

good predictor. This can be even more difficult for a high-

dimensional data set. Furthermore, it may lead to non-

smooth predictions for points from overlapping zones of

adjacent neighborhoods as the result of independently

1534 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011

1. The k-mutual neighbors are adopted in order to well describe the local

structure, i.e., x

is considered as a neighbor of x

only if x

is also one of the

k-nearest neighbors of x

training the local regression model in each neighborhood.

Last but not least, the l

norm penalty in ridge regression is

known to be less robust to the irrelevant features. In order

to overcome these limitations, a more effective training

method which can reduce the complexity of the local

regression model in each neighborhood and enforce

smoothness among the local regressors is required. Inspired

by recent works on multitask learning [16], [47] which

extract a shared representation for a group of related

training tasks, demonstrating an improved performance

compared to learning each task independently, we propose

to select a small subset of features that is good for all the

local models.

To this end, we introduce a binary feature selection

vector  ¼½

;

; ...;

;

2f0; 1g to the local discriminant

function as follows:

ðxÞ¼x

diagð

ﬃﬃﬃ



Þ

þ b

l¼1

ﬃﬃﬃﬃ









þ b

; ð8Þ

where diagð

ﬃﬃﬃ



Þ2IR

dd

is a diagonal matrix with

ﬃﬃﬃ



2 IR

on the diagonal, ð

is the lth element of 

2 IR

, and

2 IR is the bias term. In (8), the entries of 

can be turned

on and off depending on the corresponding entries of the

switch variable . To avoid a combinatorial search for 

later, we relax the constraint 

2f0; 1g to 

 0 and further

restrict its scale by

l¼1



¼ 1.

Consequently, the local

discriminant function will be solved by

min

f

l¼1



¼1;

0

c¼1

i¼1



 x

diagð

ﬃﬃﬃ



Þ

 b



þ 



;

ð9Þ

or equivalently, the following problem:

min

l¼1



¼1;

0

c¼1

i¼1



 x

 b



þ w

diagð

1

Þw

;

ð10Þ

which is obtained by applying a change of variables

diagð

ﬃﬃﬃ



Þ

! w

. The local model is now tantamount to

being of the following form:

ðxÞ¼x

þ b

; ð11Þ

and the regression coefficient w

is now regularized with a

weighted l

norm: w

diagð

1

Þw

ðw



,i.e.,the

second term in the square bracket of (10). Thus, a small

value for 

, which is expected to be associated with an

irrelevant feature, will result in a large penalization on ðw

by this weighted norm. Furthermore, in the extreme case of



¼ 0, we will prove later that it leads to ðw

¼ 0 8i; c.

That is, the lth feature will be completely eliminated from

the prediction; thus an improved clustering result can be

expected. Subsequently, to perform the feature selection

together with the LLC, we develop an alternating update

algorithm to estimate the clustering captured in Y and the

feature weight  as follows:

4.1 Update Y As Given 

First, the nearest neighbors N

should be refound according

to the -weighted square euclidean distance, i.e.,



ðx

; x

Þ¼kx

 x



l¼1





ðlÞ

 x

ðlÞ



: ð12Þ

With the fixed feature weight , the analytic solution for

problem (10) can then be easily obtained by setting the

derivatives to zero. That is,

¼ 



X



þ diagð

1



1



; ð13Þ



 X



; ð14Þ

where e

¼½11 1

2 IR

, 

¼ I



is a centering

projection matrix, satisfying 



¼ 

. I

is an n

 n

unit

matrix.

For high-dimensional data, the computation of the matrix

inversion in (13) will be quite time-consuming because the

time complexity is Oðd

Þ. Fortunately, by applying the

Woodbury’s matrix inversion lemma, we can get

¼ diagðÞX









1

þ 

diagðÞX





1



diagðÞX





;

ð15Þ

in which the time complexity of the matrix inversion in (15)

is only Oðn

Þ. In general, we often have n

 d; thus the

computational cost can be considerably reduced. Besides,

from (15), it can be seen that ðw

ð8i; cÞ goes to 0 as the

feature weight 

vanishes.

Subsequently, the predicted cluster assignment confi-

dence for x

will be obtained as follows:

¼ x

þ b

¼ 

; ð16Þ

with



¼  k













ð

1

þ 





1









;

ð17Þ

where k



¼ x

diagðÞX

and K



¼ X

diagðÞX

As in the LLC, we construct the key matrix M by (17)

and (6). To solve the same optimization problem in (7), the

columns of Y are simply set at the first C eigenvectors of M

corresponding to the smallest C eigenvalues.

ZENG AND CHEUNG: FEATURE SELECTION AND KERNEL LEARNING FOR LOCAL LEARNING-BASED CLUSTERING 1535

2. As will be seen later, such a simplex constraint is crucial for enforcing

the sparsity of . Moreover, we simply set

l¼1



¼ 1 rather than

l¼1



¼ , where  is a tunable constant, in order to reduce the number

of free parameters.

3. In this paper, we will use the convention that

¼ 0 if z ¼ 0 and 1

otherwise.

4.2 Update  As Given Y

With the fixed Y and neighborhood determined at each

point, a reasonable  is the one that can lead to a better local

regression mo del whi ch is characteri zed by a lower

objective value at the minimum of (10). We will apply this

criterion to reestimate . We remove the bias term by

plugging (14) into (10), and we then have

min



;



c¼1

i¼1









ðX





þ w

diagð

1

Þw



ð18Þ

Subsequently, the estimation of  is reformulated as follows:

min



PðÞ; s:t:

l¼1



¼ 1;

 0; 8l; ð19Þ

where PðÞ¼F ðfw

c

g;Þ with fw

c

g¼arg min

Fðfw

Þ given in (15). Hence, the Lagrange of (19) is

Lð; ; ""Þ¼PðÞþ

l¼1



 1



l¼1



; ð20Þ

where the scalar   0 and the vector ""  0 are Lagrangian

multipliers. The derivative of L with respect to 

(l ¼ 1; ...;d) is computed as

@

þ   "

; ð21Þ

where

@

@Fðfw

g;Þ

@



¼w

c

i;c

@ðw

c

@

@Fðfw

g;Þ

@ðw



¼w

c

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

¼

c¼1

i¼1

ðw

c



ð22Þ

Thus, at the optimality, we have



c¼1

i¼1

ðw

c

  "

; 8l; ð23Þ

  0;"

 0;

 0; 8l; ð24Þ



¼ 1; ð25Þ



¼ 0; 8l: ð26Þ

By using the Karush-Kuhn-Tucker (KKT) condition [31], i.e.,

(26), it is easy to verify the following two cases:

. Case 1:

c¼1

i¼1

ðw

c

¼ 0 ) 

¼ 0;

. Case 2:

c¼1

i¼1

ðw

c

6¼ 0 ) "

¼ 0 and



ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

c¼1

i¼1



c



ﬃﬃﬃ



Together with (25), it follows that the optimal solution of

 can be calculated in a closed form:



ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

c¼1

i¼1

ðw

c

m¼1

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

c¼1

i¼1

ðw

c

: ð27Þ

The i ntuitive interpretation of (27) is as follows: The

lth feature weight 

is determined by the magnitude of the

lth element in the regression coefficients for all of the clusters

which are locally solved at each point. If this element in the

regression coefficients has neglectable magnitude for all the

clusters at each point, it is likely to indicate that the

corresponding feature is unimportant when predicting the

confidence of which cluster this point belongs to.

4.3 The Complete Algorithm

The comp lete local learning-based clust ering algor ithm

with feature selection (denoted as LLC-fs) is described in

Algorithm 1. The loop stops when the relative variation of

the trace value in (7) between two consecutive iterations

gets below a threshold (we set it at 10

2

in this paper),

indicating the partitioning has almost stabilized. After the

convergence, Y is discretized to obtain the final clustering

result with the k-means as in [14].

Algorithm 1. Feature selection for local learning-based

clustering algorithm.

input: X¼fx

i¼1

, size of the neighborhood k, trade-off

parameter 

output: Y;

1 Initialize 

, for l ¼ 1; ...;d;

2 while not converge do

3 Find k-mutual neighbors for fx

i¼1

, using the metric

defined in (12);

4 Construct the matrix M in (6) with 

given in (17),

and then solve the problem (7) to obtain Y;

5 Compute w

c

; 8i; c by (15) and update  using (27);

6 end

5MULTIPLE KERNEL LEARNING FOR

LOCAL LEARNING-BASED CLUSTERING

To deal with some complex data sets, the LLC algorithm

can be kernelized as in [3] by replacing the linear ridge

regression with the kernel ridge regression. Under such

circumstances, selecting a suitable kernel function will

become a crucial issue. In this section, we extend the

method presented in Section 4 to learn a proper linear

combination of several precomputed kernel matrices under

the multiple kernel learning framework [26].

In the kernel methods, the symmetric positive semide-

finite kernel function K : XX!IR implicitly maps the

original input space into a high-dimensional (possibly

infinite) Reproducing Kernel Hilbert Space (RKHS) H, which

is equipped with the inner product < ;  >

via a nonlinear

mapping  : X!H, i.e., Kðx; zÞ¼<ðxÞ;ðzÞ >

. Sup-

pose there are altogether L different kernel functions

ðlÞ

l¼1

available for the clustering task in hand. Accord-

ingly, there are L different a ssociated feature spaces,

1536 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011

Feature Selection and Kernel Learning for Local Learning-Based Clustering

Figures

Citations

A review of unsupervised feature selection methods

Robust Structured Subspace Learning for Data Representation

Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection

Feature Selection Based on Structured Sparsity: A Comprehensive Study

Online Feature Selection with Streaming Features

References

An introduction to variable and feature selection

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

On Spectral Clustering: Analysis and an algorithm

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Model selection and estimation in regression with grouped variables

Related Papers (5)

Laplacian Score for Feature Selection

Unsupervised feature selection for multi-cluster data

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

An introduction to variable and feature selection

Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Feature selection and kernel learning for local learning-based clustering" ?

Q2. What are the future works mentioned in the paper "Feature selection and kernel learning for local learning-based clustering" ?

Q3. What is the proposed method for learning a convex combination of kernels?

Q4. How do the authors reduce the constraint l2 f0 to l 0?

Q5. What is the main reason for the difficulty of kernel learning in unsupervised learning?

Q6. How can The authorfind k-mutual neighbors for fxigni141?

Q7. How many patches are there for each image?

Q8. What is the equivalence of the l2 norm with the standard simplex?

Q9. What are the advantages of the unsupervised feature selection methods?

Q10. What is the way to train the local regression model?

Q11. What are the parameters used for the LLC algorithm?