Information-theoretic metric learning
Summary (2 min read)
1 Introduction
- The authors propose a new formulation for learning a Mahalanobis distance under constraints.
- The authors model the problem in an information-theoretic setting by leveragingan equivalence between the multivariate Gaussian distribution and the Mahalanobis distance.
- To solve their problem, the authors show an interesting connection to a recently proposed low-rank kernel learning problem [6].
- It was shown that this problem can be optimized using an iterative optimization procedure with costO(cd2) per iteration, wherec is the number of distance constraints, andd is the dimensionality of the data.
- In particular, this method does not require costly eigenvaluecomputations, unlike many other metric learning algorithms [4, 10, 11].
2 Problem Formulation
- Two points are similar if the Mahalanobis distance between them is smaller than a given upper bound,dA(xi,xj) ≤ u for a relatively small value ofu.
- The authors problem is to learn a matrixA which parameterizes a Mahalanobis distance that satisifiesa given set of constraints.
- To quantify this more formally, the authors propose the following information-theoretic framework.
- Given a Mahalanobis distance parameterized byA, the authors express its corresponding multivariate Gaussian asp(x;m, A) = 1 Z exp (−dA(x,m)), whereZ is a nor- malizing constant.
3 Algorithm
- The authors demonstrate how to solve the information-theoretic metric learning problem (2) by proving its equivalence to a low-rank kernel learning problem.
- Using this equivalence, the authors appeal to the algorithm developed in [6] to solve their problem.
3.1 Equivalence to Low-Rank Kernel Learning
- It can beshown that the Burg divergence between two matrices is finite if and only if their range spaces are thesame [6].
- The authors now state a surprising equivalence between problems (2) and (3).
- Here, the two mean vectors are the same, so their Mahalanobisdistance is zero.
- Thus, the relative entropy, KL(p(x;m, A)‖p(x;m, I)), is proportional to the Burg matrix divergence fromA to I.
- This lemma confirms that if the authors have a feasible kernel matrixK satisfying the constraints of (3), the corresponding Mahalanobis distance parameterized byA satisfies the constraints of (2).
3.2 Metric Learning Algorithm
- Given the connection stated above, the authors can use the methods in [6] to solve (3).
- Since the output of the low-rank kernel learning algorithm isW , and the authors preferA in its factored formWT W for most applications, no additional work is required beyond running the low-rank kernel learning algorithm.
- The authors metric learning algorithm is given as Algorithm 1; each constraint projection costsO(d2) per iteration and requires no eigendecomposition.
- Thus, an iterat on of the algorithm (i.e., looping through allc constraints) requiresO(cd2) time.
- By employing the Sherman-Morrison-Woodbury inverse formula appropriately, this projection—which generally has no closed-form solution—can be computed analytically.
4 Discussion
- In this work the authors formulate the Mahalanobis metric learning problem in an information-theoretic setting and provide an explicit connection to low-rank kernel learning.
- The authors now briefly discuss extensions to the basic framework, and they contrast their approch with other work on metric learning.
- The authors consider finding the Mahalanobis distance closest to the bas line Euclidean distance as measured by differential relative entropy.
- The authors approach can be adapted to handle this setting.
- A simple extension to their framework can incorporate slack variables on the distance constraints tohandle such infeasible cases.
Did you find this useful? Give us your feedback
Citations
3,564 citations
2,900 citations
Cites methods from "Information-theoretic metric learni..."
...Davis J, Kulis B, Jain P, Sra S, Dhillon I. Information theoretic metric learning....
[...]
...Baseline approaches tested include k-nearest neighbors, SVM, metric learning proposed by Davis [23], feature augmentation proposed by Daumé [22], and a cross domain metric learning method proposed by Saenko [100]....
[...]
2,624 citations
Cites background or methods from "Information-theoretic metric learni..."
...Furthermore, the learned kernel function may be computed over arbtirary points, and the method may be scaled for very large data sets; see [8,14] for details....
[...]
...In the following, we compare k-NN classifiers that use the proposed crossdomain transformation to the following baselines: 1) k-NN classifiers that operate in the original feature space using a Euclidean distance, and 2) k-NN classifiers that use traditional supervised metric learning, implemented using the ITML [8] method, trained using all available labels in both domains....
[...]
...This regularizer is a special case of the LogDet divergence, which has many properties desirable for metric learning such as scale and rotation invariance [8]....
[...]
...We follow the approach given in [8] to find the optimal W for (3)....
[...]
...Learning W using ITML....
[...]
2,417 citations
2,209 citations
Cites methods from "Information-theoretic metric learni..."
...Besides robust features, metric learning has been widely applied for person re-identification [43, 4, 11, 5, 49, 18, 14, 24]....
[...]
...In practice, many previous metric learning methods [43, 4, 5, 14, 18, 38] show a two-stage processing for metric learning, that is, the Principle Component Analysis (PCA) is first applied for dimension reduction, then metric learning is performed on the PCA subspace....
[...]
References
4,433 citations
4,382 citations
3,870 citations
"Information-theoretic metric learni..." refers methods in this paper
...[1] presented a discriminative method based on pairs of convolutional neural networks....
[...]
...…methods include neighborhood component analysis (NCA) (Goldberger et al., 2004) that learns a distance metric speci.cally for nearest-neighbor based classi.cation; convolutional neural net based methods of (Chopra et al., 2005); and a general Riemannian metric learning method (Lebanon, 2006)....
[...]
3,176 citations
"Information-theoretic metric learni..." refers background or methods or result in this paper
...Consistent with existing work (Globerson & Roweis, 2005), we found the method of (Xing et al., 2002) to be very slow and inaccurate....
[...]
...Earlier work by (Xing et al., 2002) uses a semidefinite programming formulation under similarity and dissimilarity constraints....
[...]
...Earlier work by (Xing et al., 2002) uses a semide.nite programming formulation under similarity and dissimilarity constraints....
[...]
...To this end, there have been several recent approaches that attempt to learn distance functions, e.g., (Weinberger et al., 2005; Xing et al., 2002; Globerson & Roweis, 2005; Shalev-Shwartz et al., 2004)....
[...]
...Consistent with existing work (Globerson & Roweis, 2005), we found the method of (Xing et al., 2002) to be very slow and inaccurate....
[...]
2,651 citations
"Information-theoretic metric learni..." refers background in this paper
...The LogDet divergence is also known as Stein s loss, having originated in the work of (James & Stein, 1961)....
[...]
...The LogDet divergence is also known as Stein’s loss, having originated in the work of ( James & Stein, 1961 )....
[...]