scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Information-theoretic metric learning

20 Jun 2007-pp 209-216
TL;DR: An information-theoretic approach to learning a Mahalanobis distance function that can handle a wide variety of constraints and can optionally incorporate a prior on the distance function and derive regret bounds for the resulting algorithm.
Abstract: In this paper, we present an information-theoretic approach to learning a Mahalanobis distance function. We formulate the problem as that of minimizing the differential relative entropy between two multivariate Gaussians under constraints on the distance function. We express this problem as a particular Bregman optimization problem---that of minimizing the LogDet divergence subject to linear constraints. Our resulting algorithm has several advantages over existing methods. First, our method can handle a wide variety of constraints and can optionally incorporate a prior on the distance function. Second, it is fast and scalable. Unlike most existing methods, no eigenvalue computations or semi-definite programming are required. We also present an online version and derive regret bounds for the resulting algorithm. Finally, we evaluate our method on a recent error reporting system for software called Clarify, in the context of metric learning for nearest neighbor classification, as well as on standard data sets.

Summary (2 min read)

1 Introduction

  • The authors propose a new formulation for learning a Mahalanobis distance under constraints.
  • The authors model the problem in an information-theoretic setting by leveragingan equivalence between the multivariate Gaussian distribution and the Mahalanobis distance.
  • To solve their problem, the authors show an interesting connection to a recently proposed low-rank kernel learning problem [6].
  • It was shown that this problem can be optimized using an iterative optimization procedure with costO(cd2) per iteration, wherec is the number of distance constraints, andd is the dimensionality of the data.
  • In particular, this method does not require costly eigenvaluecomputations, unlike many other metric learning algorithms [4, 10, 11].

2 Problem Formulation

  • Two points are similar if the Mahalanobis distance between them is smaller than a given upper bound,dA(xi,xj) ≤ u for a relatively small value ofu.
  • The authors problem is to learn a matrixA which parameterizes a Mahalanobis distance that satisifiesa given set of constraints.
  • To quantify this more formally, the authors propose the following information-theoretic framework.
  • Given a Mahalanobis distance parameterized byA, the authors express its corresponding multivariate Gaussian asp(x;m, A) = 1 Z exp (−dA(x,m)), whereZ is a nor- malizing constant.

3 Algorithm

  • The authors demonstrate how to solve the information-theoretic metric learning problem (2) by proving its equivalence to a low-rank kernel learning problem.
  • Using this equivalence, the authors appeal to the algorithm developed in [6] to solve their problem.

3.1 Equivalence to Low-Rank Kernel Learning

  • It can beshown that the Burg divergence between two matrices is finite if and only if their range spaces are thesame [6].
  • The authors now state a surprising equivalence between problems (2) and (3).
  • Here, the two mean vectors are the same, so their Mahalanobisdistance is zero.
  • Thus, the relative entropy, KL(p(x;m, A)‖p(x;m, I)), is proportional to the Burg matrix divergence fromA to I.
  • This lemma confirms that if the authors have a feasible kernel matrixK satisfying the constraints of (3), the corresponding Mahalanobis distance parameterized byA satisfies the constraints of (2).

3.2 Metric Learning Algorithm

  • Given the connection stated above, the authors can use the methods in [6] to solve (3).
  • Since the output of the low-rank kernel learning algorithm isW , and the authors preferA in its factored formWT W for most applications, no additional work is required beyond running the low-rank kernel learning algorithm.
  • The authors metric learning algorithm is given as Algorithm 1; each constraint projection costsO(d2) per iteration and requires no eigendecomposition.
  • Thus, an iterat on of the algorithm (i.e., looping through allc constraints) requiresO(cd2) time.
  • By employing the Sherman-Morrison-Woodbury inverse formula appropriately, this projection—which generally has no closed-form solution—can be computed analytically.

4 Discussion

  • In this work the authors formulate the Mahalanobis metric learning problem in an information-theoretic setting and provide an explicit connection to low-rank kernel learning.
  • The authors now briefly discuss extensions to the basic framework, and they contrast their approch with other work on metric learning.
  • The authors consider finding the Mahalanobis distance closest to the bas line Euclidean distance as measured by differential relative entropy.
  • The authors approach can be adapted to handle this setting.
  • A simple extension to their framework can incorporate slack variables on the distance constraints tohandle such infeasible cases.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Information-Theoretic Metric Learning
Jason Davis, Brian Kulis, Suvrit Sra and Inderjit Dhillon
Dept. of Computer Science
University of Texas at Austin
Austin, TX 78712
Abstract
We formulate the metric learning problem as that of minimizing the differen-
tial relative entropy between two multivariate Gaussians under constraints on the
Mahalanobis distance function. Via a surprising equivalence, we show that this
problem can be solved as a low-rank kernel learning problem. Specifically, we
minimize the Burg divergence of a low-rank kernel to an input kernel, subject to
pairwise distance constraints. Our approach has several advantages over exist-
ing methods. First, we present a natural information-theoretic formulation for the
problem. Second, the algorithm utilizes the methods developed by Kulis et al.
[6], which do not involve any eigenvector computation; in particular, the running
time of our method is faster than most existing techniques. Third, the formulation
offers insights into connections between metric learning and kernel learning.
1 Introduction
We propose a new formulation for learning a Mahalanobis distance under constraints. We model the
problem in an information-theoretic setting by leveraging an equivalence between the multivariate
Gaussian distribution and the Mahalanobis distance. We show that the problem of learning an op-
timal Mahalanobis distance translates to learning the optimal Gaussian with respect to an entropic
objective. Thus, our problem can be thought of as maximizing the entropy of a multivariate Gaussian
subject to pairwise constraints on the associated Mahalanobis distance.
To solve our problem, we show an interesting connection to a recently proposed low-rank kernel
learning problem [6]. Here, a low-rank kernel K is learned that satisfies a set of given distance
constraints as well as minimizes the Burg matrix divergence to the given kernel K
0
. It was shown
that this problem can be optimized using an iterative optimization procedure with cost O(cd
2
) per
iteration, where c is the number of distance constraints, and d is the dimensionality of the data. In
particular, this method does not require costly eigenvalue computations, unlike many other metric
learning algorithms [4, 10, 11].
2 Problem Formulation
Given a set of n points {x
1
, ..., x
n
} in
d
, we seek a positive definite matrix A which parameterizes
the Mahalanobis distance:
d
A
(x
i
, x
j
) = (x
i
x
j
)
T
A(x
i
x
j
).
We assume that some prior knowledge about the distances between these points is known. Specifi-
cally, we consider relationships constraining the similarity or dissimilarity between pairs of points.
Two points are similar if the Mahalanobis distance between them is smaller than a given upper
bound, d
A
(x
i
, x
j
) u for a relatively small value of u. Similarly, two points are dissimilar if
d
A
(x
i
, x
j
) l for sufficiently large l.

In particular, for a classification setting where class labels are known for each instance (as in Glober-
son and Roweis [4]), distances between points in the same class can be constrained to be small, and
distances between two points in different classes can be constrained to be large.
Our problem is to learn a matrix A which parameterizes a Mahalanobis distance that satisifies a
given set of constraints. Typically, this learned distance function is used for k-nearest neighbor
search, k-means clustering, etc. We note that, in the absence of prior knowledge, these algorithms
typically use the standard squared Euclidean distance, or equivalently, the Mahalanobis distance
parameterized by the identity matrix I.
In general, the set of distance functions in our feasible set will be infinite (we discuss later how to
re-formulate the problem for the case when the feasible set is empty). Therefore, we regularize the
problem by choosing the Mahalanobis matrix A that is as close as possible to the identity matrix I
(which parameterizes the baseline Euclidean distance function). To quantify this more formally, we
propose the following information-theoretic framework.
There exists a simple bijection between the set of Mahalanobis distances and the set of multivari-
ate Gaussians with fixed mean m. Given a Mahalanobis distance parameterized by A, we express
its corresponding multivariate Gaussian as p(x; m, A) =
1
Z
exp (d
A
(x, m)), where Z is a nor-
malizing constant. Using this bijection, we define the distance between two Mahalanobis distance
functions parametrized by A
1
and A
2
as the (differential) relative entropy between their correspond-
ing multivariate Gaussians:
KL(p(x; m, A
1
)kp(x; m, A
2
)) =
Z
p(x; m, A
1
) log
p(x; m, A
1
)
p(x; m, A
2
)
dx. (1)
Given a set of pairs of similar points S and pairs of dissimilar points D, our distance metric learning
problem is
min KL(p(x; m, A)kp(x; m, I))
subject to d
A
(x
i
, x
j
) u (i, j) S,
d
A
(x
i
, x
j
) l (i, j) D.
(2)
Note that m is an arbitrary fixed vector.
3 Algorithm
In this section, we demonstrate how to solve the information-theoretic metric learning problem (2)
by proving its equivalence to a low-rank kernel learning problem. Using this equivalence, we appeal
to the algorithm developed in [6] to solve our problem.
3.1 Equivalence to Low-Rank Kernel Learning
Let X = [x
1
x
2
... x
n
], and the Gram matrix over the input points be K
0
= X
T
X. Consider the
following kernel learning problem, to be solved for K:
min D
Burg
(K, K
0
)
subject to K
ii
+ K
jj
2K
ij
u (i, j) S,
K
ii
+ K
jj
2K
ij
l (i, j) D,
K º 0.
(3)
The Burg matrix divergence is a Bregman matrix divergence generated by the convex function
φ(X) = log det X over the cone of semi-definite matrices, and it is defined as
D
Burg
(K, K
0
) = Tr(KK
1
0
) log det(KK
1
0
) n. (4)
Formulation (3) attempts to find the nearest kernel matrix in Burg-divergence to the input Gram
matrix, subject to linear inequality constraints. It can be shown that the Burg divergence between
two matrices is finite if and only if their range spaces are the same [6]. This fact allows us to

conclude that the range spaces of K and K
0
are the same if the problem has a feasible solution.
Furthermore, the learned matrix K can be written as a rank-d kernel K = X
T
W
T
W X, for some
(d × d) full-rank matrix W .
We now state a surprising equivalence between problems (2) and (3). By solving (3) for K =
X
T
W
T
W X, the optimal A for (2) can be easily constructed via A = W
T
W . We will not provide
a detailed proof of this result; however, we present the two key lemmas.
Lemma 1: D
Burg
(K, K
0
) = 2KL(p(x; m, A)kp(x; m, I)) + c, where c is a constant.
Lemma 1 establishes that the objectives for information-theoretic metric learning and low-rank ker-
nel learning are essentially the same. It was recently shown [3] that the differential relative entropy
between two multivariate Gaussians can be expressed as the convex combination of a Mahalanobis
distance between mean vectors and the Burg matrix divergence between the covariance matrices.
Here, the two mean vectors are the same, so their Mahalanobis distance is zero. Thus, the relative
entropy, KL(p(x; m, A)kp(x; m, I)), is proportional to the Burg matrix divergence from A to I.
Therefore, the proof of the Lemma 1 reduces to showing that D
Burg
(K, K
0
) and D
Burg
(A, I) differ
by only a constant. Interestingly, the dimensions of the matrices in these two divergences are
different: K and K
0
are (n × n), while A and I are (d × d).
Lemma 2: Given K = X
T
AX, A is feasible for (2) if and only if K is feasible for (3).
This lemma confirms that if we have a feasible kernel matrix K satisfying the constraints of (3), the
corresponding Mahalanobis distance parameterized by A satisfies the constraints of (2). Note that
by associating the kernel matrix with the Mahalanobis distance, we can generalize to unseen data
points, thus circumventing a problem often associated with kernel learning.
3.2 Metric Learning Algorithm
Given the connection stated above, we can use the methods in [6] to solve (3). Since the output of
the low-rank kernel learning algorithm is W , and we prefer A in its factored form W
T
W for most
applications, no additional work is required beyond running the low-rank kernel learning algorithm.
Our metric learning algorithm is given as Algorithm 1; each constraint projection costs O(d
2
) per
iteration and requires no eigendecomposition. Thus, an iteration of the algorithm (i.e., looping
through all c constraints) requires O(cd
2
) time. Note that a naive implementation would cost O(cd
3
)
time per iteration (because of the multiplication W L), but the Cholesky factorization can be com-
bined with the matrix multiplication into a single O(d
2
) routine, leading to the more efficient O(cd
2
)
per iteration running time.
The low-rank kernel learning algorithm which forms the basis for Algorithm 1 repeatedly computes
Bregman projections, which project the current solution onto a single constraint. By employing the
Sherman-Morrison-Woodbury inverse formula appropriately, this projection—which generally has
no closed-form solution—can be computed analytically. Furthermore, it can be computed efficiently
on a low-rank factorization of the kernel matrix.
4 Discussion
In this work we formulate the Mahalanobis metric learning problem in an information-theoretic
setting and provide an explicit connection to low-rank kernel learning. We now briefly discuss
extensions to the basic framework, and we contrast our approach with other work on metric learning.
We consider finding the Mahalanobis distance closest to the baseline Euclidean distance as measured
by differential relative entropy. In some applications, it may be more appropriate to consider finding
a Mahalanobis distance closest to some other baseline; for example, one could use the Mahalanobis
distance parametrized by the sample covariance matrix S as a baseline, in which case the resulting
Burg divergence problem becomes a minimization of D
Burg
(A, S). We note that extensions of this
sort can be solved by variants of our proposed framework.

ALGORITHM 1: Algorithm for information-theoretic metric learning
ITMETRICLEARN(X, S, D, u, l)
Input: X: input d × n matrix, S: set of similar pairs, D: set of dissimilar pairs, u, l:
distance thresholds
Output: W : output factor matrix, where W
T
W = A
1. Set W = I
d
and λ
ij
= 0 i, j
2. Repeat until convergence:
Pick a constraint (i, j) S or (i, j) D
Let v
T
be row i of X minus row j of X
Set the following variables:
1. w = W v
2. if (similarity constraint)
γ = min
³
λ
ij
,
1
kwk
2
2
1
u
´
β = γ/(1 γkwk
2
2
)
else if (dissimilarity constraint)
γ = min
³
λ
ij
,
1
l
1
kwk
2
2
´
β = γ/(1 + γkwk
2
2
)
3. λ
ij
= λ
ij
γ
Compute the Cholesky factorization LL
T
= I + βww
T
Set W L
T
W
3. Return W
We consider simple distance constraints for similar and dissimilar points, though it is straightforward
to incorporate other constraints. For example, Schutz and Joachims [8] consider a formulation where
the distance metric is learned subject to relative nearness constraints on the input points (as in,
the distance between i and j is closer than the distance between i and k). Our approach can be
adapted to handle this setting. In fact, it is possible to incorporate arbitrary linear constraints into
our framework.
Finally, our basic formulation assumes that there exists a feasible point that satisfies all of the dis-
tance constraints, but in practice, this may fail to hold. A simple extension to our framework can
incorporate slack variables on the distance constraints to handle such infeasible cases.
4.1 Related Work
Xing et al. [11] use a semidefinite programming formulation for learning a Mahalanobis distance
metric. Their algorithm aims to minimize the sum of squared distances between input points that are
“similar”, while at the same time aiming to separate the “dissimilar” points by a specified minimum
amount. Our formulation differs from theirs in two respects. First, we minimize a Burg-divergence,
and second, instead of considering the sum of distortions over dissimilar points, we consider pairs
of constrained points.
Weinberger et al. [10] formulate the metric learning problem in a large margin setting, with a focus
on kNN classification. They formulate the problem as a semidefinite programming problem and
consequently solve it using a combination of sub-gradient descent and alternating projections. Our
formulation does not solely have kNN as a focal point, and differs significantly in the algorithmic
machinery used.
The paper of Globerson and Roweis [4] proceeds to learn a Mahalanobis metric by essentially
shrinking the distance between similar points to zero, and expanding the distance between dissimilar
points to infinity. They formulate a convex optimization problem which they propose to solve by
a projected-gradient method. Our approach allows more refined interpoint constraints than just a
zero/one approach.

Chopra et al. [1] presented a discriminative method based on pairs of convolutional neural networks.
Their method aims to learn a distance metric, wherein the interpoint constraints are approximately
enforced by penalizing large distances between similar points or small distances between dissim-
ilar points. Our method is solved more efficiently, and the constraints are enforced incrementally.
Furthermore, as discussed above, by including slacks on our constraints, we can accommodate “soft-
margin” constraints.
Shalev-Shwartz et al. [9] consider an online metric learning setting, where the interpoint constraints
are similar to ours. They also provide a margin interpretation, similar to that of [10]. Their formula-
tion considers distances between all pairs of similar and dissimilar points, whereas we consider only
a fixed set of input pairwise constrained points.
Other notable work includes the articles [2, 5, 7, 8]. Crammer et al. [2] applies boosting to kernel
learning, for a connection of our method kernel learning see Section 3. Lanckriet et al. [7] study
the problem of kernel learning via semidefinite programming. Goldberger et al. [5] proposed neigh-
borhood component analysis to explicitly aid kNN; however, the formulation is non-convex and can
lead to local optima.
Acknowledgements This research was supported by NSF grant CCF-0431257, NSF Career
Award ACI-0093404, and NSF-ITR award IIS-0325116.
References
[1] S. Chopra, R. Hadsell, and Y. LeCun. Learning a Similarity Metric Discriminatively, with
Application to Face Verification. In CVPR, 2005.
[2] K. Crammer, J. Keshet, and Y. Singer. Kernel Design Using Boosting. In NIPS, 2002.
[3] J. V. Davis and I. S. Dhillon. Differential Entropic Clustering of Multivariate Gaussians. In
NIPS, 2006.
[4] A. Globerson and S. Roweis. Metric Learning by Collapsing Classes. In NIPS, 2005.
[5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood Component Anal-
ysis. In NIPS, 2004.
[6] B. Kulis, M. Sustik, and I. S. Dhillon. Learning Low-rank Kernels. In ICML, 2006.
[7] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the
Kernel Matrix with Semidefinite Programming. In JMLR, 2004.
[8] M. Schutz and T. Joachims. Learning a Distance Metric from Relative Comparisons. In NIPS,
2003.
[9] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and Batch Learning of Pseudo-Metrics. In
ICML, 2004.
[10] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance Metric Learning for Large Margin
Nearest Neighbor Classification. In NIPS, 2005.
[11] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to
clustering with side-information. In NIPS, volume 14, 2002.
Citations
More filters
Proceedings ArticleDOI
07 Dec 2015
TL;DR: A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset.
Abstract: This paper contributes a new high quality dataset for person re-identification, named "Market-1501". Generally, current datasets: 1) are limited in scale, 2) consist of hand-drawn bboxes, which are unavailable under realistic settings, 3) have only one ground truth and one query image for each identity (close environment). To tackle these problems, the proposed Market-1501 dataset is featured in three aspects. First, it contains over 32,000 annotated bboxes, plus a distractor set of over 500K images, making it the largest person re-id dataset to date. Second, images in Market-1501 dataset are produced using the Deformable Part Model (DPM) as pedestrian detector. Third, our dataset is collected in an open system, where each identity has multiple images under each camera. As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor. We view person re-identification as a special task of image search. In experiment, we show that the proposed descriptor yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large-scale 500k dataset.

3,564 citations

Journal ArticleDOI
TL;DR: This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied toTransfer learning, which can be applied to big data environments.
Abstract: Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are taken from the same domain, such that the input feature space and data distribution characteristics are the same. However, in some real-world machine learning scenarios, this assumption does not hold. There are cases where training data is expensive or difficult to collect. Therefore, there is a need to create high-performance learners trained with more easily obtained data from different domains. This methodology is referred to as transfer learning. This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning. Lastly, there is information listed on software downloads for various transfer learning solutions and a discussion of possible future research work. The transfer learning solutions surveyed are independent of data size and can be applied to big data environments.

2,900 citations


Cites methods from "Information-theoretic metric learni..."

  • ...Davis J, Kulis B, Jain P, Sra S, Dhillon I. Information theoretic metric learning....

    [...]

  • ...Baseline approaches tested include k-nearest neighbors, SVM, metric learning proposed by Davis [23], feature augmentation proposed by Daumé [22], and a cross domain metric learning method proposed by Saenko [100]....

    [...]

Book ChapterDOI
05 Sep 2010
TL;DR: This paper introduces a method that adapts object models acquired in a particular visual domain to new imaging conditions by learning a transformation that minimizes the effect of domain-induced changes in the feature distribution.
Abstract: Domain adaptation is an important emerging topic in computer vision. In this paper, we present one of the first studies of domain shift in the context of object recognition. We introduce a method that adapts object models acquired in a particular visual domain to new imaging conditions by learning a transformation that minimizes the effect of domain-induced changes in the feature distribution. The transformation is learned in a supervised manner and can be applied to categories for which there are no labeled examples in the new domain. While we focus our evaluation on object recognition tasks, the transform-based adaptation technique we develop is general and could be applied to nonimage data. Another contribution is a new multi-domain object database, freely available for download. We experimentally demonstrate the ability of our method to improve recognition on categories with few or no target domain labels and moderate to large changes in the imaging conditions.

2,624 citations


Cites background or methods from "Information-theoretic metric learni..."

  • ...Furthermore, the learned kernel function may be computed over arbtirary points, and the method may be scaled for very large data sets; see [8,14] for details....

    [...]

  • ...In the following, we compare k-NN classifiers that use the proposed crossdomain transformation to the following baselines: 1) k-NN classifiers that operate in the original feature space using a Euclidean distance, and 2) k-NN classifiers that use traditional supervised metric learning, implemented using the ITML [8] method, trained using all available labels in both domains....

    [...]

  • ...This regularizer is a special case of the LogDet divergence, which has many properties desirable for metric learning such as scale and rotation invariance [8]....

    [...]

  • ...We follow the approach given in [8] to find the optimal W for (3)....

    [...]

  • ...Learning W using ITML....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: A novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter is proposed and significantly outperforms state-of-the-art methods on this dataset.
Abstract: Person re-identification is to match pedestrian images from disjoint camera views detected by pedestrian detectors. Challenges are presented in the form of complex variations of lightings, poses, viewpoints, blurring effects, image resolutions, camera settings, occlusions and background clutter across camera views. In addition, misalignment introduced by the pedestrian detector will affect most existing person re-identification methods that use manually cropped pedestrian images and assume perfect detection. In this paper, we propose a novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter. All the key components are jointly optimized to maximize the strength of each component when cooperating with others. In contrast to existing works that use handcrafted features, our method automatically learns features optimal for the re-identification task from data. The learned filter pairs encode photometric transforms. Its deep architecture makes it possible to model a mixture of complex photometric and geometric transforms. We build the largest benchmark re-id dataset with 13, 164 images of 1, 360 pedestrians. Unlike existing datasets, which only provide manually cropped pedestrian images, our dataset provides automatically detected bounding boxes for evaluation close to practical applications. Our neural network significantly outperforms state-of-the-art methods on this dataset.

2,417 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This paper proposes an effective feature representation called Local Maximal Occurrence (LOMO), and a subspace and metric learning method called Cross-view Quadratic Discriminant Analysis (XQDA), and presents a practical computation method for XQDA.
Abstract: Person re-identification is an important technique towards automatic search of a person's presence in a surveillance video. Two fundamental problems are critical for person re-identification, feature representation and metric learning. An effective feature representation should be robust to illumination and viewpoint changes, and a discriminant metric should be learned to match various person images. In this paper, we propose an effective feature representation called Local Maximal Occurrence (LOMO), and a subspace and metric learning method called Cross-view Quadratic Discriminant Analysis (XQDA). The LOMO feature analyzes the horizontal occurrence of local features, and maximizes the occurrence to make a stable representation against viewpoint changes. Besides, to handle illumination variations, we apply the Retinex transform and a scale invariant texture operator. To learn a discriminant metric, we propose to learn a discriminant low dimensional subspace by cross-view quadratic discriminant analysis, and simultaneously, a QDA metric is learned on the derived subspace. We also present a practical computation method for XQDA, as well as its regularization. Experiments on four challenging person re-identification databases, VIPeR, QMUL GRID, CUHK Campus, and CUHK03, show that the proposed method improves the state-of-the-art rank-1 identification rates by 2.2%, 4.88%, 28.91%, and 31.55% on the four databases, respectively.

2,209 citations


Cites methods from "Information-theoretic metric learni..."

  • ...Besides robust features, metric learning has been widely applied for person re-identification [43, 4, 11, 5, 49, 18, 14, 24]....

    [...]

  • ...In practice, many previous metric learning methods [43, 4, 5, 14, 18, 38] show a two-stage processing for metric learning, that is, the Principle Component Analysis (PCA) is first applied for dimension reduction, then metric learning is performed on the PCA subspace....

    [...]

References
More filters
Proceedings Article
05 Dec 2005
TL;DR: In this article, a Mahanalobis distance metric for k-NN classification is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin.
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

4,433 citations

Book
01 Jan 1950
TL;DR: In this paper, the authors present an approach for estimating the average risk of a risk-optimal risk maximization algorithm for a set of risk-maximization objectives, including maximalaxity and admissibility.
Abstract: Preface to the Second Edition.- Preface to the First Edition.- List of Tables.- List of Figures.- List of Examples.- Table of Notation.- Preparations.- Unbiasedness.- Equivariance.- Average Risk Optimality.- Minimaxity and Admissibility.- Asymptotic Optimality.- References.- Author Index.- Subject Index.

4,382 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: The idea is to learn a function that maps input patterns into a target space such that the L/sub 1/ norm in the target space approximates the "semantic" distance in the input space.
Abstract: We present a method for training a similarity metric from data. The method can be used for recognition or verification applications where the number of categories is very large and not known during training, and where the number of training samples for a single category is very small. The idea is to learn a function that maps input patterns into a target space such that the L/sub 1/ norm in the target space approximates the "semantic" distance in the input space. The method is applied to a face verification task. The learning process minimizes a discriminative loss function that drives the similarity metric to be small for pairs of faces from the same person, and large for pairs from different persons. The mapping from raw to the target space is a convolutional network whose architecture is designed for robustness to geometric distortions. The system is tested on the Purdue/AR face database which has a very high degree of variability in the pose, lighting, expression, position, and artificial occlusions such as dark glasses and obscuring scarves.

3,870 citations


"Information-theoretic metric learni..." refers methods in this paper

  • ...[1] presented a discriminative method based on pairs of convolutional neural networks....

    [...]

  • ...…methods include neighborhood component analysis (NCA) (Goldberger et al., 2004) that learns a distance metric speci.cally for nearest-neighbor based classi.cation; convolutional neural net based meth­ods of (Chopra et al., 2005); and a general Riemannian met­ric learning method (Lebanon, 2006)....

    [...]

Proceedings Article
01 Jan 2002
TL;DR: This paper presents an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in �”n, learns a distance metric over ℝn that respects these relationships.
Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many "plausible" ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider "similar." For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in ℝn, learns a distance metric over ℝn that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance.

3,176 citations


"Information-theoretic metric learni..." refers background or methods or result in this paper

  • ...Consistent with existing work (Globerson & Roweis, 2005), we found the method of (Xing et al., 2002) to be very slow and inaccurate....

    [...]

  • ...Earlier work by (Xing et al., 2002) uses a semidefinite programming formulation under similarity and dissimilarity constraints....

    [...]

  • ...Earlier work by (Xing et al., 2002) uses a semide.nite programming formulation under simi­larity and dissimilarity constraints....

    [...]

  • ...To this end, there have been several recent approaches that attempt to learn distance functions, e.g., (Weinberger et al., 2005; Xing et al., 2002; Globerson & Roweis, 2005; Shalev-Shwartz et al., 2004)....

    [...]

  • ...Consistent with existing work (Globerson & Roweis, 2005), we found the method of (Xing et al., 2002) to be very slow and inaccurate....

    [...]

Book ChapterDOI
01 Jan 1992
TL;DR: In this paper, the authors consider the problem of finding the best unbiased estimator of a linear function of the mean of a set of observed random variables. And they show that for large samples the maximum likelihood estimator approximately minimizes the mean squared error when compared with other reasonable estimators.
Abstract: It has long been customary to measure the adequacy of an estimator by the smallness of its mean squared error. The least squares estimators were studied by Gauss and by other authors later in the nineteenth century. A proof that the best unbiased estimator of a linear function of the means of a set of observed random variables is the least squares estimator was given by Markov [12], a modified version of whose proof is given by David and Neyman [4]. A slightly more general theorem is given by Aitken [1]. Fisher [5] indicated that for large samples the maximum likelihood estimator approximately minimizes the mean squared error when compared with other reasonable estimators. This paper will be concerned with optimum properties or failure of optimum properties of the natural estimator in certain special problems with the risk usually measured by the mean squared error or, in the case of several parameters, by a quadratic function of the estimators. We shall first mention some recent papers on this subject and then give some results, mostly unpublished, in greater detail.

2,651 citations


"Information-theoretic metric learni..." refers background in this paper

  • ...The LogDet divergence is also known as Stein s loss, hav­ing originated in the work of (James & Stein, 1961)....

    [...]

  • ...The LogDet divergence is also known as Stein’s loss, having originated in the work of ( James & Stein, 1961 )....

    [...]