scispace - formally typeset
Open AccessJournal ArticleDOI

Feature Selection and Kernel Learning for Local Learning-Based Clustering

TLDR
The aim of this paper is to obtain an appropriate data representation through feature selection or kernel learning within the framework of the Local Learning-Based Clustering (LLC) method, which can outperform the global learning-based ones when dealing with the high-dimensional data lying on manifold.
Abstract
The performance of the most clustering algorithms highly relies on the representation of data in the input space or the Hilbert space of kernel methods. This paper is to obtain an appropriate data representation through feature selection or kernel learning within the framework of the Local Learning-Based Clustering (LLC) (Wu and Scholkopf 2006) method, which can outperform the global learning-based ones when dealing with the high-dimensional data lying on manifold. Specifically, we associate a weight to each feature or kernel and incorporate it into the built-in regularization of the LLC algorithm to take into account the relevance of each feature or kernel for the clustering. Accordingly, the weights are estimated iteratively in the clustering process. We show that the resulting weighted regularization with an additional constraint on the weights is equivalent to a known sparse-promoting penalty. Hence, the weights of those irrelevant features or kernels can be shrunk toward zero. Extensive experiments show the efficacy of the proposed methods on the benchmark data sets.

read more

Content maybe subject to copyright    Report

Feature Selection and Kernel Learning for
Local Learning-Based Clustering
Hong Zeng, Member, IEEE, and Yiu-ming Cheung, Senior Member, IEEE
Abstract—The performance of the most clustering algorithms highly relies on the representation of data in the input space or the
Hilbert space of kernel methods. This paper is to obtain an appropriate data representation through feature selection or kernel learning
within the framework of the Local Learning-Based Clustering (LLC) (Wu and Scho
¨
lkopf 2006) method, which can outperform the global
learning-based ones when dealing with the high-dimensional data lying on manifold. Specifically, we associate a weight to each feature
or kernel and incorporate it into the built-in regularization of the LLC algorithm to take into account the relevance of each feature or
kernel for the clustering. Accordingly, the weights are estimated iteratively in the clustering process. We show that the resulting
weighted regularization with an additional constraint on the weights is equivalent to a known sparse-promoting penalty. Hence, the
weights of those irrelevant features or kernels can be shrunk toward zero. Extensive experiments show the efficacy of the proposed
methods on the benchmark data sets.
Index Terms—High-dimensional data, local learning-based clustering, feature selection, kernel learning, sparse weighting.
Ç
1INTRODUCTION
I
T is common to perform high-dimensional data clustering
in a variety of pattern recognition and data mining
problems in which the high-dimensional data are repre-
sented by a large number of features. However, the
discrimination among patterns is often impeded by the
abundance of features. For instance, it is quite common to
have thousands of gene expression coefficients as features
for a single sample in genomic data analysis, but only a
small fraction is capable of discriminating among different
tissue classes. Those irrelevant features involved in the
prediction may seriously degrade the performance of an
inference machine [13]. Therefore, it is desirable to develop
an effective feature selection algorithm toward identifying
those features relevant to the inference task in hand.
On the other hand, the kernel methods have been widely
applied to a variety of learning problems in the past
decades, where the data are implicitly mapped into a
nonlinear high-dimensional space by kernel function [30]. It
is known that the performance of these methods will
heavily hinge on the choice of kernel. Unfortunately, the
most suitable kernel for a particular task is often unknown
in advance. Moreover, exhaustive search on a user-defined
pool of kernels will be quite time-consuming when the size
of the pool becomes large [29]. Hence, it is crucial to learn
an appropriate kernel efficiently to make the performance
of the employed kernel-based inference method robust or
even improved.
This paper attempts to obtain a n appropriate data
representation for clustering in the input space or the
Hilbert space (also interchangeably called feature space
hereinafter) of kernel methods. Accordingly, two issues, i.e.,
feature selection and kernel learning, are considered. In fact,
either of these two issues have been extensively studied in
the context of supervised learning, but are comparatively
less explored in the clustering problem. A major reason is
that feature selection or kernel learning in unsupervised
learning becomes more challenging without the presence of
ground-truth class labels that could guide the search for
relevant representations. Most recently, some research
works regarding these two issues have been done in the
unsupervised case, e.g., see [43], [12], [13], [25], [34]. A
predominant strategy among these approaches, which have
achieved prominent improved clustering performance, is to
first relax the binary hard decision on the relevance of feature
or kernel to a real-valued soft one, i.e., a confidence or
weight, turning the combinatorial search problem into a
continuous learning problem. Then, these approaches apply
the following two iterative steps until convergence: 1) esti-
mating the weights for features or kernels using the
intermediate clustering result, and 2) refeeding the weighted
feature or kernel into the employed clustering algorithm.
Despite the success of such common strategy for both the
feature selection and kernel learning in clustering, there are
still two problems at least not properly addressed. One
problem is on the exploited clustering algorithm which
generates the intermediate clustering result. The feature or
kernel is evaluated by the intermediate clustering result; an
improper intermediate partition may lead to a poor weight-
ing. Some employed clustering algorithms in those methods
may be prone to such failure, especially when dealing with
high-dimensional data lying on manifold. The other problem
is the sparseness of the weights. Sparse weighting, i.e., a big
1532 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011
. H. Zeng is with the School of Instrument Science and Engineering,
Southeast University, China, and the Department of Computer Science,
Hong Kong Baptist University, Hong Kong SAR, China.
E-mail: littlezenghong@gmail.com.
. Y.M. Cheung is with the Department of Computer Science, Hong Kong
Baptist University, Hong Kong SAR, China.
E-mail: ymc@comp.hkbu.edu.hk.
Manuscript received 17 Feb. 2009; revised 12 Dec. 2009; accepted 23 Oct.
2010; published online 29 Nov. 2010.
Recommended for acceptance by M. Meila.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2009-02-0117.
Digital Object Identifier no. 10.1109/TPAMI.2010.215.
0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

gap between the weights for informative and uninformative
representation, as well as vanishing weights for uninforma-
tive ones, is desirable so that the effect of irrelevant features
or kernels can be significantly mitigated. Moreover, it helps
to better understand the problem by focusing on only a few
dominant features or kernels that most contribute to the task.
To the best of our knowledge, few of those methods have
provided a principled and effective regularization on the
sparsity of weights.
In this paper, we shall propose two methods that
perform the feature selection and kernel learning within
the framework of the Local Learning-Based Clustering
(LLC) [3], respectively. The LLC algorithm tries to ensure
that the cluster label of each data point is close to the one
predicted by the local regression model, a current super-
vised learning method, with its neighboring points and
their cluster labels [3]. Essentially, it finds the partition
which is mostly able to embody such local configuration, it
is thereby expected to be good at clustering data sets lying
on manifold, e.g., the high-dimensional sparse data sets.
Furthermore, by utilizing the ridge regression in the
supervised learning to develop an unsupervised clustering
method, LLC has a built-in regularization for the model
complexity. In this paper, we modify such a built-in ridge
regularization in the local regression model to take into
account the relevance of each feature or kernel for
clustering. It is shown that the modified penalty term with
a constraint is equivalent to the existing sparse-promoting
penalty. Hence, it is guaranteed that the resulting weights
for features are sparse and then local configuration may get
refined; a better clustering result can thus be expected.
Moreover, the proposed feature selection method is ex-
tended from the observation space to the feature space,
naturally leading to the problem of learning a convex
combination of kernels for the local learning-based cluster-
ing. The main contributions of our work are two-fold:
1. A novel feature selection method and a kernel
learning method are proposed for local learning-
based clustering, respectively, whereas almost all of
the existing counterparts are developed for global
learning-based clustering.
2. The feature selection and kernel learn ing for
clustering are addressed in a unified approach
under the same regularization framework.
The remainder of this paper is organized as follows: Related
works are reviewed in Section 2. Section 3 gives an
overview of the LLC algorithm. We present the proposed
feature selection method in Section 4, and then extend it to
learn the combination o f kernels in Section 5. Some
discussions are given in Section 6. In Section 7, extensive
experiments are conducted to show the performance of the
proposed methods on several benchmark data sets. Finally,
we draw a conclusion in Section 8.
2RELATED WORKS
This section overviews the literature on the unsupervised
feature selection and kernel learning only. The reviews of
supervised feature selection and kernel learning can be
found in [5] and [27], respectively.
The approaches to unsupervised feature selection for
clustering can be generally categorized as the filter and
wrapper ones. The filter approaches [9], [40], [7], [8], [6] leave
out uninformative features before the clustering. They have
demonstrated great computational efficiency because they
do not involve clustering when evaluating the quality of
features. In general, such a method has to determine the
number of selected relevant features. Unfortunately, this
crucial issue has rarely been addressed in the literature,
thus causing difficulty in practical applications [13]. In
contrast, the wrapper approaches [10], [11], [12], [13] first
construct a candidate of feature subset on which its
goodness is then assessed by investigating the performance
of a specific clustering. These two steps are repeated until
convergence. In general, the wrapper approaches outper-
form the filter ones, but is more time-consuming because of
the exhaustive search in the space of feature subsets. In the
literature, some wrapper approaches, e.g., [10], [11], have
utilized the gree dy sea rch (i.e., a nonexhaustive one),
which, however, cannot guarantee to select all relevant
features. This shortcoming, as well as the issue of determin-
ing the number of selected relevant features in the filter
approaches, can be alleviated by assigning each feature a
nonnegative weight [12], [13] rather than a binary indicator
to indicate its relevance to the clustering. Further, the
combinatorial explosion of the search space can be avoided
as well by casting the feature selection as an estimation
problem. Our approach also follows this strategy. Based on
recent progress on spectral clustering, the algorithm in [13]
tries to optimize the cluster coherence measured by the sum
of squared eigenvalues of an affinity matrix, which is
constructed by aggregating weak affinity matrices built
with weighted feature vectors. The solutions to the
clustering and feature weighting are obtained by an
efficient iterative algorithm based on eigendecomposition.
Nevertheless, the clustering algorithm in [13] is essentially
the kernel k-means with a linear kernel, which is a global
learning method; thus it is difficult to deal with the data
that lie on nonlinear manifold. In [12], feature weights are
estimated by modifying the M-step of the EM algorithm
through the Bayesian inference mechanism when there are
only two clusters. It is noteworthy that, in addition to
incorporating feature selection, there are several approaches
to learning pa rameterized simil arity func tions in the
spectral clustering for improving the clustering perfor-
mance [1], [28]. Despite the success in their application
domain, it is often nontrivial to interpret the physical
meaning of the parameters specified in these methods, e.g.,
the parameter associated with a feature having a negative
weight in [1], [28]. Also, the parameters specified for the
RBF kernel functions may increase the difficulty for the
optimization.
For kernel learning in clustering, some heuristic ap-
proaches [24], [28] directly learn the kernel parameters of
some specific kernels. Although some improvement can
often be achieved, an extension of the learning method to
other kernel functions is usually nontrivial [42]. In contrast,
a more effective framework, termed the multiple kernel
learning [26], learns a linear combination of base kernels
with the different weights, whic h will be estimated
simultaneously in the inference process, e.g., see [34], [41],
[25]. Our proposed method, which will be described later,
also belongs to this framework. In [34], the algorithm tries
to find a maximum margin hyperplane to cluster data
(restricted to the binary-class case), accompanied by
ZENG AND CHEUNG: FEATURE SELECTION AND KERNEL LEARNING FOR LOCAL LEARNING-BASED CLUSTERING 1533

learning a mixture of Laplacian matrices. The method in
[41] extends the kernel discriminant analysis technique to
clustering and learns a combination of kernel matrices
jointly. In [34], [41], no penalty is imposed on the kernel
weights; thus the sparsity may not be guaranteed. In [25],
clustering is phrased as a nonnegative matrix factorization
problem of a fused kernel matrices, and the sparseness of
kernel weights is controlled by a heuristic entropy penalty
which, however, favors a uniform weighting.
An important application of the multiple kernel learning
is to fuse the information from heterogeneous sources as
follows [26]: Associate each source with a kernel function,
and then combine the set of prototype kernels generated
from these sources to perform the inference. In this respect,
the multiview clustering is also a related work whose goal is
to learn a consensus result from multiple representations
[39], [46]. However, it implicitly treats all the sources
equally, regardless of the clustering performance with each
source. In contrast, our proposed method is able to
determine the weight for each source automatically accord-
ing to its capability of discrimination; thus it will be more
robust from the practical viewpoint.
3OVERVIEW OF THE LOCAL LEARNING-BASED
CLUSTERING ALGORITHM
Given n data points fx
i
g
n
i¼1
ðx
i
2 IR
d
Þ, the data set will
be partitioned into C clusters. The clustering result can be
represented by a cluster assignment indicator matrix P ¼
½p
ic
2f0; 1g
nC
such that p
ic
¼ 1 if x
i
belongs to the
cth cluster, and p
ic
¼ 0 otherwise. The scaled cluster assign-
ment indicator matrix used in this paper is defined as
Y ¼ PðP
T
PÞ
1
2
¼½y
1
; y
2
; ...; y
C
;
where y
c
¼½y
1c
; ...;y
nc
T
2 IR
n
ð1 c CÞ is the cth column
of Y 2 IR
nC
. y
ic
¼ p
ic
=
ffiffiffiffi
n
c
p
can be regarded as the con-
fidence that x
i
is assigned to the cth cluster, where n
c
is the
size of the cth cluster. It is easy to verify that
Y
T
Y ¼ I; ð1Þ
where I 2 IR
nn
is an identity matrix.
The starting point of the LLC [3] is that the cluster
assignments in the neighborhood of each point should be as
smooth as possible. Specifically, it assumes that the cluster
indicator value at each point should be well estimated by a
regression model trained locally with its neighbors and their
cluster indicator values. Suppose there exists an arbitrary Y
at first; for each x
i
, the model is built with the training data
x
j
;y
jc
Þg
x
j
2N
i
ð1 c C; 1 i; j nÞ, where N
i
denotes
the set of neighboring
1
points of x
i
, but x
i
is excluded.
The output of the local model is of the following form:
f
c
i
ðxÞ¼x
T
c
i
; 8x 2 IR
d
; ð2Þ
where
c
i
2 IR
d
is the local regression coefficient vector, f
c
i
ð:Þ
denotes the local model learned with the training data
x
j
;y
jc
Þg
x
j
2N
i
. Here, the bias term is ignored for simplicity
provided that one of the features is always 1. In [3], the
model is obtained by solving the following l
2
norm
regularized least square problem:
min
f
c
i
g
X
C
c¼1
X
n
i¼1
X
x
j
2N
i
y
jc
x
T
j
c
i
2
þ
c
i
2
2
4
3
5
; ð3Þ
where is a trade-off parameter. Let f
^
c
i
g be the solution to
the linear ridge regression problem (3), the predicted cluster
assignment for the test data x
i
can then be calculated by
b
y
ic
¼ f
c
i
ðx
i
Þ¼x
T
i
^
c
i
¼
T
i
y
c
i
; ð4Þ
where
T
i
¼ x
T
i
X
i
X
T
i
þ I
1
X
i
; ð5Þ
X
i
¼½x
i
1
; x
i
2
; ...; x
i
n
i
with x
i
k
being the kth neighbor of x
i
,
y
c
i
¼½y
i
1
c
;y
i
2
c
; ...;y
i
n
i
c
T
, and n
i
is the size of N
i
.
After all of the local predictors have been constructed,
the LLC combines them together so that an optimal cluster
indicator matrix Y is found via minimizing the following
overall prediction error:
X
C
c¼1
X
n
i¼1
ðy
ic
b
y
ic
Þ
2
¼
X
C
c¼1
ky
c
Ay
c
k
2
¼ trace½Y
T
ðI AÞ
T
ðI AÞY
¼ traceðY
T
MYÞ;
ð6Þ
where M ¼ðI AÞ
T
ðI AÞ, A is an n n sparse matrix,
whose ði; jÞth entry a
ij
is the corresponding element in
i
by
(5) if x
j
2N
i
and 0 otherwise.
As in the spectral clustering [14], [15], Y is relaxed into
the continuous domain while keeping the property of (1) for
the problem (6). The LLC then solves the following tractable
continuous optimization problem:
min
Y2IR
nC
traceðY
T
MYÞ
s:t: Y
T
Y ¼ I:
ð7Þ
A solution to Y is given by the first C eigenvectors of the
matrix M corresponding to the first C smallest eigenvalues.
Similarly to [14], [15], the final partition result is obtained by
discretizing Y via the method in [15] or by the k-means as
in [14]. Promising results have been reported in [3].
4FEATURE SELECTION FOR
LOCAL LEARNING-BASED CLUSTERING
In this section, we will integrate the feature selection into
the LLC. It should be noted that the key ingredient of the
LLC is to learn the local regression model, which is trained
only with the points in each neighborhood. However, there
may be too few data points in its neighborhood to learn a
good predictor. This can be even more difficult for a high-
dimensional data set. Furthermore, it may lead to non-
smooth predictions for points from overlapping zones of
adjacent neighborhoods as the result of independently
1534 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011
1. The k-mutual neighbors are adopted in order to well describe the local
structure, i.e., x
j
is considered as a neighbor of x
i
only if x
i
is also one of the
k-nearest neighbors of x
j
.

training the local regression model in each neighborhood.
Last but not least, the l
2
norm penalty in ridge regression is
known to be less robust to the irrelevant features. In order
to overcome these limitations, a more effective training
method which can reduce the complexity of the local
regression model in each neighborhood and enforce
smoothness among the local regressors is required. Inspired
by recent works on multitask learning [16], [47] which
extract a shared representation for a group of related
training tasks, demonstrating an improved performance
compared to learning each task independently, we propose
to select a small subset of features that is good for all the
local models.
To this end, we introduce a binary feature selection
vector ¼½
1
;
2
; ...;
d
;
l
2f0; 1g to the local discriminant
function as follows:
f
c
i
ðxÞ¼x
T
diagð
ffiffi
p
Þ
c
i
þ b
c
i
¼
X
d
l¼1
x
l
ffiffiffi
l
p
c
i
l
þ b
c
i
; ð8Þ
where diagð
ffiffi
p
Þ2IR
dd
is a diagonal matrix with
ffiffi
p
2 IR
d
on the diagonal, ð
c
i
Þ
l
is the lth element of
c
i
2 IR
d
, and
b
c
i
2 IR is the bias term. In (8), the entries of
c
i
can be turned
on and off depending on the corresponding entries of the
switch variable . To avoid a combinatorial search for
later, we relax the constraint
l
2f0; 1g to
l
0 and further
restrict its scale by
P
d
l¼1
l
¼ 1.
2
Consequently, the local
discriminant function will be solved by
min
f
c
i
;b
c
i
g;
P
d
l¼1
l
¼1;
l
0
X
C
c¼1
X
n
i¼1
"
X
x
j
2N
i
y
jc
x
T
j
diagð
ffiffi
p
Þ
c
i
b
c
i
2
þ
c
i
T
c
i
#
;
ð9Þ
or equivalently, the following problem:
min
fw
c
i
;b
c
i
g;
P
d
l¼1
l
¼1;
l
0
X
C
c¼1
X
n
i¼1
"
X
x
j
2N
i
y
jc
x
T
j
w
c
i
b
c
i
2
þ w
cT
i
diagð
1
Þw
c
i
#
;
ð10Þ
which is obtained by applying a change of variables
diagð
ffiffi
p
Þ
c
i
! w
c
i
. The local model is now tantamount to
being of the following form:
f
c
i
ðxÞ¼x
T
w
c
i
þ b
c
i
; ð11Þ
and the regression coefficient w
c
i
is now regularized with a
weighted l
2
norm: w
cT
i
diagð
1
Þw
c
i
¼
P
l
ðw
c
i
Þ
2
l
l
,i.e.,the
second term in the square bracket of (10). Thus, a small
value for
l
, which is expected to be associated with an
irrelevant feature, will result in a large penalization on ðw
c
i
Þ
l
by this weighted norm. Furthermore, in the extreme case of
l
¼ 0, we will prove later that it leads to ðw
c
i
Þ
l
¼ 0 8i; c.
3
That is, the lth feature will be completely eliminated from
the prediction; thus an improved clustering result can be
expected. Subsequently, to perform the feature selection
together with the LLC, we develop an alternating update
algorithm to estimate the clustering captured in Y and the
feature weight as follows:
4.1 Update Y As Given
First, the nearest neighbors N
i
should be refound according
to the -weighted square euclidean distance, i.e.,
d
ðx
1
; x
2
Þ¼kx
1
x
2
k
2
¼
X
d
l¼1
l
x
ðlÞ
1
x
ðlÞ
2
2
: ð12Þ
With the fixed feature weight , the analytic solution for
problem (10) can then be easily obtained by setting the
derivatives to zero. That is,
w
c
i
¼
X
i
i
X
T
i
þ diagð
1
Þ
1
X
i
i
y
c
i
; ð13Þ
b
c
i
¼
1
n
i
e
T
i
y
c
i
X
T
i
w
c
i
; ð14Þ
where e
i
¼½11 1
T
2 IR
n
i
,
i
¼ I
i
1
n
i
e
i
e
i
T
is a centering
projection matrix, satisfying
i
i
¼
i
. I
i
is an n
i
n
i
unit
matrix.
For high-dimensional data, the computation of the matrix
inversion in (13) will be quite time-consuming because the
time complexity is Oðd
3
Þ. Fortunately, by applying the
Woodbury’s matrix inversion lemma, we can get
w
c
i
¼ diagðÞX
i
i
I
i
1
I
i
þ
i
X
T
i
diagðÞX
i
i
1
i
X
T
i
diagðÞX
i
i
y
c
i
;
ð15Þ
in which the time complexity of the matrix inversion in (15)
is only Oðn
3
i
Þ. In general, we often have n
i
d; thus the
computational cost can be considerably reduced. Besides,
from (15), it can be seen that ðw
c
i
Þ
l
ð8i; cÞ goes to 0 as the
feature weight
l
vanishes.
Subsequently, the predicted cluster assignment confi-
dence for x
i
will be obtained as follows:
b
y
ic
¼ x
T
i
w
c
i
þ b
c
i
¼
T
i
y
c
i
; ð16Þ
with
T
i
¼ k
i
1
n
i
e
T
i
K
i

i
I
i
ð
1
I
i
þ
i
K
i
i
Þ
1
i
K
i
i
þ
1
n
i
e
T
i
;
ð17Þ
where k
i
¼ x
T
i
diagðÞX
i
and K
i
¼ X
T
i
diagðÞX
i
.
As in the LLC, we construct the key matrix M by (17)
and (6). To solve the same optimization problem in (7), the
columns of Y are simply set at the first C eigenvectors of M
corresponding to the smallest C eigenvalues.
ZENG AND CHEUNG: FEATURE SELECTION AND KERNEL LEARNING FOR LOCAL LEARNING-BASED CLUSTERING 1535
2. As will be seen later, such a simplex constraint is crucial for enforcing
the sparsity of . Moreover, we simply set
P
d
l¼1
l
¼ 1 rather than
P
d
l¼1
l
¼ , where is a tunable constant, in order to reduce the number
of free parameters.
3. In this paper, we will use the convention that
z
0
¼ 0 if z ¼ 0 and 1
otherwise.

4.2 Update As Given Y
With the fixed Y and neighborhood determined at each
point, a reasonable is the one that can lead to a better local
regression mo del whi ch is characteri zed by a lower
objective value at the minimum of (10). We will apply this
criterion to reestimate . We remove the bias term by
plugging (14) into (10), and we then have
min
fw
c
i
g
F
w
c
i
;
¼
X
C
c¼1
X
n
i¼1
i
y
ic
ðX
i
i
Þ
T
w
c
i
2
þ w
cT
i
diagð
1
Þw
c
i
:
ð18Þ
Subsequently, the estimation of is reformulated as follows:
min
Þ; s:t:
X
d
l¼1
l
¼ 1;
l
0; 8l; ð19Þ
where Þ¼F ðfw
c
i
g;Þ with fw
c
i
arg min
fw
c
i
g
Fðfw
c
i
g;
Þ given in (15). Hence, the Lagrange of (19) is
; ; ""Þ¼PðÞþ
X
d
l¼1
l
1
!
X
d
l¼1
"
l
l
; ð20Þ
where the scalar 0 and the vector "" 0 are Lagrangian
multipliers. The derivative of L with respect to
l
,
(l ¼ 1; ...;d) is computed as
@L
@
l
¼
@P
@
l
þ "
l
; ð21Þ
where
@P
@
l
¼
@Fðfw
c
i
g;Þ
@
l
w
c
i
¼w
c
i
þ
X
i;c
@ðw
c
i
Þ
l
@
l
@Fðfw
c
i
g;Þ
@ðw
c
i
Þ
l
w
c
i
¼w
c
i
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
0
¼
P
C
c¼1
P
n
i¼1
ðw
c
i
Þ
2
l
2
l
:
ð22Þ
Thus, at the optimality, we have
2
l
¼
P
C
c¼1
P
n
i¼1
ðw
c
i
Þ
2
l
"
l
; 8l; ð23Þ
0;"
l
0;
l
0; 8l; ð24Þ
X
d
l
l
¼ 1; ð25Þ
"
l
l
¼ 0; 8l: ð26Þ
By using the Karush-Kuhn-Tucker (KKT) condition [31], i.e.,
(26), it is easy to verify the following two cases:
. Case 1:
P
C
c¼1
P
n
i¼1
ðw
c
i
Þ
2
l
¼ 0 )
l
¼ 0;
. Case 2:
P
C
c¼1
P
n
i¼1
ðw
c
i
Þ
2
l
0 ) "
l
¼ 0 and
l
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
C
c¼1
X
n
i¼1
w
c
i
2
l
v
u
u
t
,
ffiffi
p
:
Together with (25), it follows that the optimal solution of
can be calculated in a closed form:
l
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
C
c¼1
P
n
i¼1
ðw
c
i
Þ
2
l
q
P
d
m¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
C
c¼1
P
n
i¼1
ðw
c
i
Þ
2
m
q
: ð27Þ
The i ntuitive interpretation of (27) is as follows: The
lth feature weight
l
is determined by the magnitude of the
lth element in the regression coefficients for all of the clusters
which are locally solved at each point. If this element in the
regression coefficients has neglectable magnitude for all the
clusters at each point, it is likely to indicate that the
corresponding feature is unimportant when predicting the
confidence of which cluster this point belongs to.
4.3 The Complete Algorithm
The comp lete local learning-based clust ering algor ithm
with feature selection (denoted as LLC-fs) is described in
Algorithm 1. The loop stops when the relative variation of
the trace value in (7) between two consecutive iterations
gets below a threshold (we set it at 10
2
in this paper),
indicating the partitioning has almost stabilized. After the
convergence, Y is discretized to obtain the final clustering
result with the k-means as in [14].
Algorithm 1. Feature selection for local learning-based
clustering algorithm.
input: fx
i
g
n
i¼1
, size of the neighborhood k, trade-off
parameter
output: Y;
1 Initialize
l
¼
1
d
, for l ¼ 1; ...;d;
2 while not converge do
3 Find k-mutual neighbors for fx
i
g
n
i¼1
, using the metric
defined in (12);
4 Construct the matrix M in (6) with
i
given in (17),
and then solve the problem (7) to obtain Y;
5 Compute w
c
i
; 8i; c by (15) and update using (27);
6 end
5MULTIPLE KERNEL LEARNING FOR
LOCAL LEARNING-BASED CLUSTERING
To deal with some complex data sets, the LLC algorithm
can be kernelized as in [3] by replacing the linear ridge
regression with the kernel ridge regression. Under such
circumstances, selecting a suitable kernel function will
become a crucial issue. In this section, we extend the
method presented in Section 4 to learn a proper linear
combination of several precomputed kernel matrices under
the multiple kernel learning framework [26].
In the kernel methods, the symmetric positive semide-
finite kernel function K : XX!IR implicitly maps the
original input space into a high-dimensional (possibly
infinite) Reproducing Kernel Hilbert Space (RKHS) H, which
is equipped with the inner product < ; >
H
via a nonlinear
mapping : X!H, i.e., x; zÞ¼<ðxÞ;ðzÞ >
H
. Sup-
pose there are altogether L different kernel functions
fK
ðlÞ
g
L
l¼1
available for the clustering task in hand. Accord-
ingly, there are L different a ssociated feature spaces,
1536 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011

Citations
More filters
Journal ArticleDOI

A review of unsupervised feature selection methods

TL;DR: A comprehensive and structured review of the most relevant and recent unsupervised feature selection methods reported in the literature is provided and a taxonomy of these methods is presented.
Journal ArticleDOI

Robust Structured Subspace Learning for Data Representation

TL;DR: A novel Robust Structured Subspace Learning (RSSL) algorithm by integrating image understanding and feature learning into a joint learning framework is proposed, and the learned subspace is adopted as an intermediate space to reduce the semantic gap between the low-level visual features and the high-level semantics.
Journal ArticleDOI

Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection

TL;DR: A novel unsupervised feature selection algorithm, named clustering-guided sparse structural learning (CGSSL), is proposed by integrating cluster analysis and sparse structural analysis into a joint framework and experimentally evaluated and demonstrated efficiency and effectiveness.
Journal ArticleDOI

Feature Selection Based on Structured Sparsity: A Comprehensive Study

TL;DR: This paper compares the differences and commonalities of these methods based on regression and regularization strategies, but also provides useful guidelines to practitioners working in related fields to guide them how to do feature selection.
Journal ArticleDOI

Online Feature Selection with Streaming Features

TL;DR: A novel Online Streaming Feature Selection method to select strongly relevant and nonredundant features on the fly and an efficient Fast-OSFS algorithm is proposed to improve feature selection performance.
References
More filters
Journal ArticleDOI

An introduction to variable and feature selection

TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Journal ArticleDOI

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Proceedings Article

On Spectral Clustering: Analysis and an algorithm

TL;DR: A simple spectral clustering algorithm that can be implemented using a few lines of Matlab is presented, and tools from matrix perturbation theory are used to analyze the algorithm, and give conditions under which it can be expected to do well.
BookDOI

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

TL;DR: Learning with Kernels provides an introduction to SVMs and related kernel methods that provide all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms.
Journal ArticleDOI

Model selection and estimation in regression with grouped variables

TL;DR: In this paper, instead of selecting factors by stepwise backward elimination, the authors focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Feature selection and kernel learning for local learning-based clustering" ?

This paper is to obtain an appropriate data representation through feature selection or kernel learning within the framework of the Local Learning-Based Clustering ( LLC ) ( Wu and Schölkopf 2006 ) method, which can outperform the global learning-based ones when dealing with the high-dimensional data lying on manifold. The authors show that the resulting weighted regularization with an additional constraint on the weights is equivalent to a known sparse-promoting penalty. 

The future work will focus on solving the feature selection/kernel learning problem and the clustering problem with a unified objective function. 

the proposed feature selection method is extended from the observation space to the feature space, naturally leading to the problem of learning a convex combination of kernels for the local learning-based clustering. 

To avoid a combinatorial search for later, the authors relax the constraint l 2 f0; 1g to l 0 and further restrict its scale by Pd l¼1 l ¼ 1. 

A major reason is that feature selection or kernel learning in unsupervised learning becomes more challenging without the presence of ground-truth class labels that could guide the search for relevant representations. 

Find k-mutual neighbors for fxigni¼1, using the metric defined in (12);4 Construct the matrix M in (6) with i given in (17),and then solve the problem (7) to obtain Y;5 Compute wc i ; 8i; c by (15) and update using (27); 6 endTo deal with some complex data sets, the LLC algorithm can be kernelized as in [3] by replacing the linear ridge regression with the kernel ridge regression. 

there are 27 patches in total for each image: nine patches from the original image, nine patches from the horizontal edge maps, and nine patches from the vertical edge maps. 

In the input space, the authors address this equivalence based on the fact that the infimum of the weighted l2 norm, with the weights defined on the standard simplex, is equal to a squared special l1 norm regularization. 

They have demonstrated great computational efficiency because they do not involve clustering when evaluating the quality of features. 

In order to overcome these limitations, a more effective training method which can reduce the complexity of the local regression model in each neighborhood and enforce smoothness among the local regressors is required. 

It can be seen that the proposed LLC-fs algorithm almost outperforms the baseline k-means, spectral clustering, and the basic LLC algorithm on all data sets except the mfeafou, but note that spectral clustering and LLC have used their best parameters.