scispace - formally typeset
Open AccessProceedings ArticleDOI

MatchNet: Unifying feature and metric learning for patch-based matching

Reads0
Chats0
TLDR
A unified approach to combining feature computation and similarity networks for training a patch matching system that improves accuracy over previous state-of-the-art results on patch matching datasets, while reducing the storage requirement for descriptors is confirmed.
Abstract
Motivated by recent successes on learning feature representations and on learning feature comparison functions, we propose a unified approach to combining both for training a patch matching system. Our system, dubbed Match-Net, consists of a deep convolutional network that extracts features from patches and a network of three fully connected layers that computes a similarity between the extracted features. To ensure experimental repeatability, we train MatchNet on standard datasets and employ an input sampler to augment the training set with synthetic exemplar pairs that reduce overfitting. Once trained, we achieve better computational efficiency during matching by disassembling MatchNet and separately applying the feature computation and similarity networks in two sequential stages. We perform a comprehensive set of experiments on standard datasets to carefully study the contributions of each aspect of MatchNet, with direct comparisons to established methods. Our results confirm that our unified approach improves accuracy over previous state-of-the-art results on patch matching datasets, while reducing the storage requirement for descriptors. We make pre-trained MatchNet publicly available.

read more

Content maybe subject to copyright    Report

MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching
Xufeng Han
Thomas Leung
Yangqing Jia
Rahul Sukthankar
Alexander C. Berg
University of North Carolina at Chapel Hill
Google Research
xufeng@cs.unc.edu {leungt,jiayq,sukthankar}@google.com aberg@cs.unc.edu
Abstract
Motivated by recent successes on learning feature rep-
resentations and on learning feature comparison functions,
we propose a unified approach to combining both for train-
ing a patch matching system. Our system, dubbed Match-
Net, consists of a deep convolutional network that extracts
features from patches and a network of three fully con-
nected layers that computes a similarity between the ex-
tracted features. To ensure experimental repeatability, we
train MatchNet on standard datasets and employ an input
sampler to augment the training set with synthetic exemplar
pairs that reduce overfitting. Once trained, we achieve bet-
ter computational efficiency during matching by disassem-
bling MatchNet and separately applying the feature com-
putation and similarity networks in two sequential stages.
We perform a comprehensive set of experiments on stan-
dard datasets to carefully study the contributions of each
aspect of MatchNet, with direct comparisons to established
methods. Our results confirm that our unified approach im-
proves accuracy over previous state-of-the-art results on
patch matching datasets, while reducing the storage re-
quirement for descriptors. We make pre-trained MatchNet
publicly available.
1
1. Introduction
Patch-based image matching is used extensively in com-
puter vision. Finding accurate correspondences between
patches is instrumental in a broad variety of applica-
tions including wide-baseline stereo (e.g., [14]), object in-
stance recognition (e.g., [13], fine-grained classification
(e.g., [36]), multi-view reconstruction (e.g. [20]), image
stitching (e.g. [4]), and structure from motion (e.g. [17]).
Since 1999, the advent of the influential SIFT descrip-
tor [13], research on patch-based matching has attempted
to improve both accuracy and speed. Early efforts focused
on identifying better affine region detectors [16], engineer-
ing more robust local descriptors [7, 15], and exploring im-
1
http://www.cs.unc.edu/
˜
xufeng/matchnet
provements in descriptor matching using alternate distance
metrics [8, 9].
Early efforts at unsupervised data-driven learning of
local descriptors (e.g., [11]) were typically outperformed
by modern engineered descriptors, such as SURF [1],
ORB [18]. However, the greater availability of labeled
training data and increased computational resources has
recently reversed this trend, leading to a new generation
of learned descriptors [3, 22, 27, 28] and comparison met-
rics [9]. These approaches typically train a nonlinear model
discriminatively using large datasets of patches with known
ground truth matches and serve as motivation for our work.
Concurrently, approaches based on deep convolutional
neural networks have recently made dramatic progress on a
range of difficult computer vision problems, including im-
age classification [12], object detection [6], human pose es-
timation [26], and action recognition in video [10,23]. This
line of research highlights the benefits of jointly learning a
feature representation and a classifier (or distance metric),
which to our knowledge has not been adequately explored
in patch-based matching.
In this paper, we propose a unified approach that jointly
learns a deep network for patch representation as well as
a network for robust feature comparison. In our system,
dubbed MatchNet, each patch passes through a convolu-
tional network to generate a fixed-dimensional representa-
tion reminiscent of SIFT. However, unlike in SIFT, where
two descriptors are compared in feature space using the Eu-
clidean distance, in MatchNet, the representations are com-
pared using a learned distance metric, implemented as a set
of fully connected layers.
Our contributions include: 1) A new state-of-the-art sys-
tem for patch-based matching using deep convolutional net-
works that significantly improves on the previous results. 2)
Improved performance over the previous state of the art [22]
using smaller descriptors (with fewer bits). 3) A careful set
of experiments using standard datasets to study the relative
contributions of different parts of the system, showing that
MatchNet improves over both hand-crafted and learned de-
scriptors plus comparison functions. 4) Finally we provide
a public release of MatchNet trained using our own large

collection of patches.
The remainder of this paper is organized as follows. Sec-
tion 2 discusses related work, focusing on learned descrip-
tors and metric learning. Section 3 details the network ar-
chitecture behind MatchNet. Section 4 explains how the
joint training and the two-stage evaluation pipeline are per-
formed. Section 5 presents the experimental methodology
and results on a suite of standard datasets. We conclude
with a summary and ideas for future work.
2. Related work
Much previous work considers improving some com-
ponents in the detector-descriptor-similarity pipeline for
matching patches. Here we address the most related work
that considers learning descriptors or similarities, organized
by goal and the types of non-linearity used.
Feature learning methods such as [3], [28] and [22] en-
code non-linearity into the procedure for mapping intensity
patches to descriptors. Their goal is to learn descriptors
whose similarity with respect to a chosen distance metric
match the ground truth. For [3] and [22], the procedure
includes multiple parameterized blocks of gradient compu-
tation, spatial pooling, feature normalization and dimension
reduction. [28] uses boosting with weak learners consisting
of a family of functions parameterized by gradient orien-
tations and spatial location. Each weak learner represents
the result of feature normalization, orientation pooling and
thresholding in its +1/1 output. Weighting and combin-
ing multiple weak learners builds a highly non-linear map-
ping from gradients to robust descriptors. Different types of
learning algorithms are proposed to find the optimal param-
eters: Powell minimization, boosting and convex optimiza-
tion for [3], [28] and [22], respectively. In [3] and [22] the
similarity functions are simply the Euclidean distance. [28]
uses a Mahalanobis distance and jointly learns the descrip-
tors and the metric. In comparison, our proposed feature
extraction uses a deep convolutional network with multiple
convolutional and spatial pooling layers plus an optional
bottle neck layer to obtain feature vectors, followed by a
similarity measure also based on neural nets.
Metric learning methods such as [8] and [9] learn a
similarity function between descriptors that approximates
a ground truth notion of which patches should be similar,
and achieve results that improve on simple similarity func-
tions, most often the Euclidean distance. Jain et al. [8] in-
troduces non-linearity with a predefined kernel on patches.
A Mahalanobis metric is learned on top of that similarity.
Jia et al. [9] uses a parametric distance based on a heavy-
tailed Gamma-Compound-Laplace distribution, which ap-
proximates the empirical distribution of elements in the
difference of matching SIFT descriptors. The parameters
for this distance are estimated using the training data. In
comparison, we use a two-layer fully connected neural net-
work to learn the pairwise similarity, which has the poten-
tial to embrace more complex similarity functions beyond
distance metrics such as Euclidean.
Semantic hashing or embedding learning methods learn
non-linear mappings to generate low dimensional represen-
tations, whose similarity in some easy-to-compute distance
metric (e.g., Hamming distance) correlates with the seman-
tic similarity. This can be done using neural networks, e.g.,
[5] and [19] with a two-tower structure and recently [33]
that samples triplets for training. Spectral hashing [34] or
boosting [21, 27] can also be used to learn the mapping.
This approach can be applied to raw image input [19] as
well as to local feature descriptors [25]. In comparison, al-
though we do not map input to an intermediate embedding
space explicitly, the representation provided by our feature
extraction network naturally serves the purpose, and the di-
mensionality can be controlled depending on the accuracy
vs. storage and computation tradeoff. We explore and ana-
lyze such effects in Section 5.
Our network structure is similar to the recent preprint
[37] for stereo matching, with a notable difference that we
use pooling layers to learn compact representations from
patches. Our approach, MatchNet, is designed for general
wide-baseline viewpoint invariant matching, a significantly
different problem than the more local matching problem in
stereo. As one example, for wide-baseline matching, scale
estimation from the key point descriptor may not be accu-
rate. The pooling layers increase the robustness of the net-
work robust to such variation. MatchNet has several other
architectural differences, an additional convolutional layer,
two fewer fully connected layers, and various differences in
filter supports and layer complexity compared to [37]. We
evaluate some architectural variations in Section 5.
3. Network architecture
MatchNet is a deep-network architecture (Fig. 1 C) for
jointly learning a feature network that maps a patch to a
feature representation (Fig. 1 A) and a metric network that
maps pairs of features to a similarity (Fig. 1 B). It consists of
several types of layers commonly used in deep-networks for
computer vision. We show details of these layer in Table 1,
and discuss some of the high-level architectural choices in
this section.
The feature network: The feature network is influenced
by AlexNet [12], which achieved good object recognition
performance. We use many fewer parameters and do not use
Local Response Normalization or Dropout. We use Rectfied
Linear Units (ReLU) as non-linearity for the convolution
layers.
The metric network: We model the similarity between
features using three fully-connected layers with ReLU non-
linearity. FC3 also uses Softmax. Input to the network is the
concatenation of a pair of features. We output two values in

Preprocessing
Conv0
Pool0
Conv1
Pool1
Metric network
Cross-Entropy Loss
Sampling
Conv2
Conv3
Conv4
Bottleneck
Pool4
FC2
FC1
FC3 + Softmax
A: Feature network
B: Metric network
C: MatchNet in training
Figure 1. The MatchNet architecture. A: The feature network used
for feature encoding, with an optional bottleneck layer to reduce
feature dimension. B: The metric network used for feature com-
parison. C: In training, the feature net is applied as two “towers”
on pairs of patches with shared parameters. Output from the two
towers are concatenated as the metric network’s input. The entire
network is jointly trained on labeled patch-pairs generated from
the sampler to minimize the cross-entropy loss. In prediction, the
two sub-networks (A and B) are conveniently used in a two-stage
pipeline (See Section 4.2).
[0, 1] from the two units of FC3, These are non-negative,
sum up to one, and can be interpreted as the network’s es-
timate of probability that the two patches match and do not
match, respectively.
Two-tower structure with tied parameters: The patch-
based matching task usually assumes that patches go
through the same feature encoding before computing a sim-
ilarity. Therefore we need just one feature network. During
training, this can be realized by employing two feature net-
works (or “towers”) that connect to a comparison network,
with the constraint that the two towers share the same pa-
rameters. Updates for either tower will be applied to the
shared coefficients.
This approach is related to the Siamese network [2, 5],
which also uses two towers, but with carefully designed
loss functions instead of a learned metric network. A re-
cent preprint on learning a network for stereo matching has
also used the two-tower-plus-fully-connected comparison-
network approach [37]. In contrast, MatchNet includes
max-pooling layers to deal with scale changes that are not
present in stereo reconstruction problems, and it also has
Table 1. Layer parameters of MatchNet. The output dimension
is given by (height × width × depth). PS: patch size for con-
volution and pooling layers; S: stride. Layer types: C: convo-
lution, MP: max-pooling, FC: fully-connected. We always pad
the convolution and pooling layers so the output height and width
are those of the input divided by the stride. For FC layers,
their size B and F are chosen from: B {64, 128, 256, 512},
F {128, 256, 512, 1024}. All convolution and FC layers use
ReLU activation except for FC3, whose output is normalized with
Softmax (Equation 2).
Name Type Output Dim. PS S
Conv0 C 64 × 64 × 24 7 × 7 1
Pool0 MP 32 × 32 × 24 3 × 3 2
Conv1 C 32 × 32 × 64 5 × 5 1
Pool1 MP 16 × 16 × 64 3 × 3 2
Conv2 C 16 × 16 × 96 3 × 3 1
Conv3 C 16 × 16 × 96 3 × 3 1
Conv4 C 16 × 16 × 64 3 × 3 1
Pool4 MP 8 × 8 × 64 3 × 3 2
Bottleneck FC B - -
FC1 FC F - -
FC2 FC F - -
FC3 FC 2 - -
more convolutional layers compared to [37].
In other settings, where similarity is defined over patches
from two significantly different domains, the MatchNet
framework can be generalized to have two towers that share
fewer layers or towers with different structures.
The bottleneck layer: The bottleneck layer can be used
to reduce the dimension of the feature representation and to
control overfitting of the network. It is a fully-connected
layer of size B, between the 4096 (8 × 8 × 64) nodes in
the output of Pool4 and the final output of the feature net-
work. We evaluate how B affects matching performance in
Section 5 and plot results in Figure 4.
The preprocessing layer: Following a previous conven-
tion, for each pixel in the input grayscale patch we normal-
ize its intensity value x (in [0, 255]) to (x 128)/160.
4. Training and prediction
The feature and metric networks are trained jointly in a
supervised setting using a two-tower structure illustrated in
Figure 1-C. We minimize the cross-entropy error
E =
1
n
n
X
i=1
[y
i
log( ˆy
i
) + (1 y
i
) log(1 ˆy
i
)] (1)
over a training set of n patch pairs using stochastic gradient
descent (SGD) with a batch size of 32. Here y
i
is the 0/1
label for input pair x
i
. 1 indicates match. ˆy
i
and 1 ˆy
i
are
the Softmax activations computed on the values of the two

Figure 2. All 24 of the 7 × 7 filters learned in Conv0 from the
liberty dataset. The pseudo-colors represent intensity.
nodes in FC3, v
0
(x
i
) and v
1
(x
i
) as follows.
ˆy
i
=
e
v
1
(x
i
)
e
v
0
(x
i
)
+ e
v
1
(x
i
)
. (2)
ˆy
i
is used as the probability estimate for label 1 in Equa-
tion 1.
We experimented with different learning rates and mo-
mentum values and found using plain SGD with 0.01 for
the learning rate yields better validation accuracy than using
larger learning rates and/or with momentum, even though
convergence in the latter settings is faster. Depending on the
network architecture, it takes between 18 hours to 1 week
to train the full network. Using a learning rate schedule can
speed up the training significantly.
Figure 2 visualizes Conv0 filters MatchNet learned on
the Liberty dataset. Figure 5 visualizes the network’s re-
sponse to an example patch at different layers in the feature
network.
4.1. Sampling in training
Sampling is important in training, as the matching (+)
and non-matching (-) pairs are highly unbalanced. We use a
sampler to generate equal number of positives and negatives
in each mini-batch so that the network will not be overly bi-
ased towards negative decisions. The sampler also enforces
variety to prevent overfitting to a limited negative set.
Particularly, in our setting, the training set has already
been grouped into matching patches; e.g. The UBC patch
dataset has an average group size around 3. The learner
streams through the training set by reading one group at a
time. For positive sampling, we randomly pick two from
the group; for negative sampling, we use a reservoir sam-
pler [32] with a buffer size of R patches. At any time T the
buffer maintains R patches as if uniformly sampled from the
patch stream up to T , allowing a variety of non-matching
pairs to be generated efficiently. The buffer size controls
Algorithm 1 Generate a batch of 2S pairs with a sampler.
for b = 0 . . . S 1 do
Extract all patches p
1
. . . p
k
from the next group;
Randomly choose p
i
and p
j
, i 6= j, i, j {1 . . . k};
Sample(2b) (1, p
i
, p
j
);
for m = 0 . . . k do
Consider adding p
m
to the reservoir;
1
end for
repeat at most 1000 times
Randomly draw p
u
and p
v
from the reservoir;
until p
u
and p
v
are from different group;
2
if negative sampling is successful then
Sample(2b + 1) (0, p
u
, p
v
);
else
Sample(2b + 1) (1, p
i
, p
j
);
end if
end for
return Sample;
the trade-off between memory and negative variety. In our
experiments, R = 128 was too small and led to severe over-
fitting; R = 16384 has worked consistently. This procedure
is detailed in Algorithm 1.
For instance, if the batch size is 32, in each training it-
eration we feed SGD 16 positives and 16 negatives. The
positives are obtained by reading the next 16 groups from
the database and randomly picking one pair in each group.
Since we go through the whole dataset many times, even
though we only pick one positive pair from each group in
each pass, the network still gets good positive coverage,
especially when the average group size is small. The 16
negatives are obtained by sampling two pairs from different
groups from the reservoir buffer that stores previous loaded
patches. At the first few iterations, the buffer would be
empty or contain only matching patches. In that case we
simply fill the slot in the batch with the most recent positive
pair.
4.2. A two-stage prediction pipeline
A common scenario for patch-based matching is that
there are sets of patches each extracted from two images.
The goal is to compute a N
1
×N
2
matrix of pairwise match-
ing scores, where N
1
and N
2
are the number of patches in
from image. Pushing each pair through the full network
is not efficient because the feature tower would run on the
same patch multiple times. Instead, we can use the fea-
ture tower and the metric network separately and in two
1
Following [32], if the sampler’s reservoir is not full, the candidate is
always added; otherwise for the T-th candidate, with probability R/T it is
added and replaces a random element in the reservoir and with probability
1-R/T it gets rejected. R is the reservoir size.
2
We store meta data along with the patches in the buffer so it is efficient
to check whether two patches match or not.

...
...
...
...
Patch set 1
Patch set 2
Feature set 1
Feature set 2
n
1
n
2
Pairwise
matching
scores
Trained feature network
Trained metric network
64
64
B
B
n
1
n
2
n
1
n
2
n
1
x n
2
Feature pairs
2B
Figure 3. MatchNet is disassembled during prediction. The feature
network and the metric network run in a pipeline.
stages (Figure 3). First we generate feature encodings for all
patches. Then we pair the features and push them through
the metric network to get the scores. In our experiment, on
one NVIDIA K40 GPU, after tuning batch size, the feature
net without bottleneck runs at 3.56K patch/sec; the metric
net (B=128, F=512) runs at 416.6K pair/sec. The computa-
tion can be further pipelined and distributed for large-scale
applications.
5. Experiments
Dataset. The UBC patch dataset [30] (UBC) was col-
lected by Winder et al. [35] for learning descriptors. The
patches were extracted around real interest points from sev-
eral internet photo collections published in [24]. The dataset
includes three subsets with a total of more than 1.5 million
patches. It is suitable for discriminative descriptor or metric
learning, and has been used as a standard benchmark dataset
by many [3, 9, 22, 27, 28]. The dataset comes with patches
extracted using either Difference of Gaussian (DoG) inter-
est point detector or multi-scale Harris corner detector. We
use the DoG set.
There are three subsets in UBC: Liberty, Notredame and
Yosemite. Each comes with pre-generated labeled pairs of
100k, 200k and 500k, all with 50% matches. Each also pro-
vides all unique patches and their corresponding 3D point
ID. The number of unique patches in each dataset is 450k
for Liberty, 468k for Notredame and 634k for Yosemite.
Evaluation protocol. Following the standard protocol
established in [3], people train the descriptor on one subset
and test on the other two subsets. Although people may use
any of the pre-generated pair sets or the grouped patches
in the training subset for training and validation, the testing
is done on the 100k labeled pairs in the test subset. The
commonly used evaluation metric is the false positive rate
at 95% recall (Error@95%), the lower the better.
SIFT baselines. We use VLFeat [31]’s vlsift() with
default parameters and custom frame input to extract SIFT
descriptor on patches. The frame center is the center of the
patch at (32.5, 32.5). The scale is set to be 16/3, where 3
is the default magnifying coefficient, so that the bin size for
the descriptor will be 16. With 4 bins along each side, the
descriptor footprint covers the entire patch. In our prelimi-
nary experiments we found that normalized SIFT (nSIFT),
which is raw SIFT scaled so its L2-norm is 1, gives slightly
better performance than SIFT, so nSIFT is used for all our
baseline experiments.
For a pair of nSIFT, we compute similarity using L2,
linear SVM on 128d element-wise squared difference fea-
tures (Squared diff.) and a two-layer fully-connected neu-
ral networks on 256d nSIFT concatenation (Concat.). For
nSIFT Square diff.+ linearSVM, we use Liblinear [29] to
train the SVM and search the regularization parameter C
among {10
4
, 10
3
. . . , 10
4
} using 10% of the training set
for validation. For nSIFT Concat.+ NNet, the network has
the same structure (with F=512) as the metric network in
MatchNet (Figure 1-B) and is trained using plain SGD with
learning
rate=0.01, batch size=128 and iteration=150k.
MatchNet. We train MatchNet using techniques de-
scribed in Section 4 and evaluate the performance under
different (F, B) combinations, where F and B are the di-
mension of fully-connected layers (F1 and F2) and the bot-
tleneck layer respectively. F {128, 256, 512, 1024}.
B {64, 128, 256, 512}. We also evaluate using the fea-
ture network without the bottleneck layer.
MatchNet with quantized features. We evaluate the
performance of MatchNet with quantized features. The out-
put features of the bottleneck layer in the feature tower (Fig-
ure 1-A) are represented as floating point numbers. They
are the outputs of ReLu units, thus the values are always
non-negative. We quantize these feature values in a simplis-
tic way. For a trained network, we compute the maximum
value M for the features across all dimensions on a set of
random patches in the training set. Then each element v
in the feature is quantized as q(v) = min(2
n
1, b(2
n
1)v/M c), where n is the number of bits we quantize the
feature to. When the feature is fed to the metric network, v
is restored using q(v)M/(2
n
1). We evaluate the perfor-
mance using different quantization levels.
The quantized features give us a very compact repre-
sentation. The ReLU output of the bottleneck layer is not
dense. For example, for the (B=64, F=1024) network, the
average density over all the UBC data is 67.9%. Using a
naive representation: 1 bit to encode whether the value is

Figures
Citations
More filters
Proceedings ArticleDOI

Learning to Compare: Relation Network for Few-Shot Learning

TL;DR: A conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each, which is easily extended to zero- shot learning.
Journal ArticleDOI

Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources

TL;DR: The challenges of using deep learning for remote-sensing data analysis are analyzed, recent advances are reviewed, and resources are provided that hope will make deep learning in remote sensing seem ridiculously simple.
Posted Content

Learning to Compare: Relation Network for Few-Shot Learning

TL;DR: Relation Network (RN) as mentioned in this paper learns to learn a deep distance metric to compare a small number of images within episodes, each of which is designed to simulate the few-shot setting.
Proceedings ArticleDOI

Unsupervised Learning of Depth and Ego-Motion from Video

TL;DR: In this paper, an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences is presented, which uses single-view depth and multiview pose networks with a loss based on warping nearby views to the target using the computed depth and pose.
Proceedings ArticleDOI

Volumetric and Multi-view CNNs for Object Classification on 3D Data

TL;DR: In this paper, two distinct network architectures of volumetric CNNs and multi-view CNNs are introduced, where they introduce multiresolution filtering in 3D. And they provide extensive experiments designed to evaluate underlying design choices.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

Object recognition from local scale-invariant features

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Book ChapterDOI

SURF: speeded up robust features

TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Proceedings ArticleDOI

ORB: An efficient alternative to SIFT or SURF

TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.
Journal ArticleDOI

A performance evaluation of local descriptors

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Frequently Asked Questions (14)
Q1. What have the authors contributed in "Matchnet: unifying feature and metric learning for patch-based matching" ?

Motivated by recent successes on learning feature representations and on learning feature comparison functions, the authors propose a unified approach to combining both for training a patch matching system. The authors perform a comprehensive set of experiments on standard datasets to carefully study the contributions of each aspect of MatchNet, with direct comparisons to established methods. Their results confirm that their unified approach improves accuracy over previous state-of-the-art results on patch matching datasets, while reducing the storage requirement for descriptors. 

The authors also evaluate a suite of architectural variations to study the tradeoff between accuracy vs. storage/computation. This work demonstrates that deep convolutional neural networks can be effective for general wide-baseline patch matching. This suggests that using deep learning approaches— and more advanced quantization—can make even more significant improvements in the accuracy/feature size trade-off. 

The bottleneck layer can be used to reduce the dimension of the feature representation and to control overfitting of the network. 

Using a naive representation: 1 bit to encode whether the value iszero or not, quantizing the features to 6 bits yields a representation of 64 + 6 × 64× 0.679 = 324.7 bits on average. 

Their best model is trained without a bottleneck and it learns a high-dimensional patch representation coupled with a discriminatively trained metric. 

Without discriminative projection, at around 1500d, the error rate is still above 9%, more than twice as much as MatchNet’s error rate (3.87%) with 4096d patch representation. 

Finding accurate correspondences between patches is instrumental in a broad variety of applications including wide-baseline stereo (e.g., [14]), object instance recognition (e.g., [13], fine-grained classification (e.g., [36]), multi-view reconstruction (e.g. [20]), image stitching (e.g. [4]), and structure from motion (e.g. [17]). 

Different types of learning algorithms are proposed to find the optimal parameters: Powell minimization, boosting and convex optimization for [3], [28] and [22], respectively. 

Since the authors go through the whole dataset many times, even though the authors only pick one positive pair from each group in each pass, the network still gets good positive coverage, especially when the average group size is small. 

The authors train MatchNet using techniques described in Section 4 and evaluate the performance under different (F,B) combinations, where F and B are the dimension of fully-connected layers (F1 and F2) and the bottleneck layer respectively. 

the authors can use the feature tower and the metric network separately and in two1Following [32], if the sampler’s reservoir is not full, the candidate is always added; otherwise for the T-th candidate, with probability R/T it is added and replaces a random element in the reservoir and with probability 1-R/T it gets rejected. 

With a bottleneck of 64d, their 64-1024×1024 model achieves 10.94% average error rate vs. [22]’s 10.75% using features with about the same dimension. 

On the other, with a 512d bottleneck and quantization, MatchNet still outperforms [22]’s PR (<640d) results in 4 out of 6 train-test pairs with up to 7% improvement in absolute error rate. 

This suggests that using deep learning approaches— and more advanced quantization—can make even more significant improvements in the accuracy/feature size trade-off.