How many bits does the naive representation yield?

Using a naive representation: 1 bit to encode whether the value iszero or not, quantizing the features to 6 bits yields a representation of 64 + 6 × 64× 0.679 = 324.7 bits on average.

What is the model for nSIFT?

Their best model is trained without a bottleneck and it learns a high-dimensional patch representation coupled with a discriminatively trained metric.

How much error rate is the nSIFT concat.+NNet?

Without discriminative projection, at around 1500d, the error rate is still above 9%, more than twice as much as MatchNet’s error rate (3.87%) with 4096d patch representation.

How does the model achieve the performance?

With a bottleneck of 64d, their 64-1024×1024 model achieves 10.94% average error rate vs. [22]’s 10.75% using features with about the same dimension.

How much improvement in absolute error rate does MatchNet achieve?

On the other, with a 512d bottleneck and quantization, MatchNet still outperforms [22]’s PR (<640d) results in 4 out of 6 train-test pairs with up to 7% improvement in absolute error rate.

What is the significance of the trade-off between accuracy and feature size?

This suggests that using deep learning approaches— and more advanced quantization—can make even more significant improvements in the accuracy/feature size trade-off.

(Open Access) MatchNet: Unifying feature and metric learning for patch-based matching (2015) | Xufeng Han

Q: What have the authors contributed in "Matchnet: unifying feature and metric learning for patch-based matching" ?

Motivated by recent successes on learning feature representations and on learning feature comparison functions, the authors propose a unified approach to combining both for training a patch matching system. The authors perform a comprehensive set of experiments on standard datasets to carefully study the contributions of each aspect of MatchNet, with direct comparisons to established methods. Their results confirm that their unified approach improves accuracy over previous state-of-the-art results on patch matching datasets, while reducing the storage requirement for descriptors.

Q: What are the future works in "Matchnet: unifying feature and metric learning for patch-based matching" ?

The authors also evaluate a suite of architectural variations to study the tradeoff between accuracy vs. storage/computation. This work demonstrates that deep convolutional neural networks can be effective for general wide-baseline patch matching. This suggests that using deep learning approaches— and more advanced quantization—can make even more significant improvements in the accuracy/feature size trade-off.

Q: What is the use of the bottleneck layer?

The bottleneck layer can be used to reduce the dimension of the feature representation and to control overfitting of the network.

Q: What is the purpose of patch-based image matching?

Finding accurate correspondences between patches is instrumental in a broad variety of applications including wide-baseline stereo (e.g., [14]), object instance recognition (e.g., [13], fine-grained classification (e.g., [36]), multi-view reconstruction (e.g. [20]), image stitching (e.g. [4]), and structure from motion (e.g. [17]).

Q: What types of learning algorithms are proposed to find the optimal parameters for [3], [28]?

Different types of learning algorithms are proposed to find the optimal parameters: Powell minimization, boosting and convex optimization for [3], [28] and [22], respectively.

Q: How many times do the authors go through the whole dataset?

Since the authors go through the whole dataset many times, even though the authors only pick one positive pair from each group in each pass, the network still gets good positive coverage, especially when the average group size is small.

Q: How is the performance of the matchnet evaluated?

The authors train MatchNet using techniques described in Section 4 and evaluate the performance under different (F,B) combinations, where F and B are the dimension of fully-connected layers (F1 and F2) and the bottleneck layer respectively.

MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching

Xufeng Han

†

Thomas Leung

‡

Yangqing Jia

‡

Rahul Sukthankar

‡

Alexander C. Berg

†

University of North Carolina at Chapel Hill

‡

Google Research

xufeng@cs.unc.edu {leungt,jiayq,sukthankar}@google.com aberg@cs.unc.edu

Abstract

Motivated by recent successes on learning feature rep-

resentations and on learning feature comparison functions,

we propose a uniﬁed approach to combining both for train-

ing a patch matching system. Our system, dubbed Match-

Net, consists of a deep convolutional network that extracts

features from patches and a network of three fully con-

nected layers that computes a similarity between the ex-

tracted features. To ensure experimental repeatability, we

train MatchNet on standard datasets and employ an input

sampler to augment the training set with synthetic exemplar

pairs that reduce overﬁtting. Once trained, we achieve bet-

ter computational efﬁciency during matching by disassem-

bling MatchNet and separately applying the feature com-

putation and similarity networks in two sequential stages.

We perform a comprehensive set of experiments on stan-

dard datasets to carefully study the contributions of each

aspect of MatchNet, with direct comparisons to established

methods. Our results conﬁrm that our uniﬁed approach im-

proves accuracy over previous state-of-the-art results on

patch matching datasets, while reducing the storage re-

quirement for descriptors. We make pre-trained MatchNet

publicly available.

1. Introduction

Patch-based image matching is used extensively in com-

puter vision. Finding accurate correspondences between

patches is instrumental in a broad variety of applica-

tions including wide-baseline stereo (e.g., [14]), object in-

stance recognition (e.g., [13], ﬁne-grained classiﬁcation

(e.g., [36]), multi-view reconstruction (e.g. [20]), image

stitching (e.g. [4]), and structure from motion (e.g. [17]).

Since 1999, the advent of the inﬂuential SIFT descrip-

tor [13], research on patch-based matching has attempted

to improve both accuracy and speed. Early efforts focused

on identifying better afﬁne region detectors [16], engineer-

ing more robust local descriptors [7, 15], and exploring im-

http://www.cs.unc.edu/

xufeng/matchnet

provements in descriptor matching using alternate distance

metrics [8, 9].

Early efforts at unsupervised data-driven learning of

local descriptors (e.g., [11]) were typically outperformed

by modern engineered descriptors, such as SURF [1],

ORB [18]. However, the greater availability of labeled

training data and increased computational resources has

recently reversed this trend, leading to a new generation

of learned descriptors [3, 22, 27, 28] and comparison met-

rics [9]. These approaches typically train a nonlinear model

discriminatively using large datasets of patches with known

ground truth matches and serve as motivation for our work.

Concurrently, approaches based on deep convolutional

neural networks have recently made dramatic progress on a

range of difﬁcult computer vision problems, including im-

age classiﬁcation [12], object detection [6], human pose es-

timation [26], and action recognition in video [10,23]. This

line of research highlights the beneﬁts of jointly learning a

feature representation and a classiﬁer (or distance metric),

which to our knowledge has not been adequately explored

in patch-based matching.

In this paper, we propose a uniﬁed approach that jointly

learns a deep network for patch representation as well as

a network for robust feature comparison. In our system,

dubbed MatchNet, each patch passes through a convolu-

tional network to generate a ﬁxed-dimensional representa-

tion reminiscent of SIFT. However, unlike in SIFT, where

two descriptors are compared in feature space using the Eu-

clidean distance, in MatchNet, the representations are com-

pared using a learned distance metric, implemented as a set

of fully connected layers.

Our contributions include: 1) A new state-of-the-art sys-

tem for patch-based matching using deep convolutional net-

works that signiﬁcantly improves on the previous results. 2)

Improved performance over the previous state of the art [22]

using smaller descriptors (with fewer bits). 3) A careful set

of experiments using standard datasets to study the relative

contributions of different parts of the system, showing that

MatchNet improves over both hand-crafted and learned de-

scriptors plus comparison functions. 4) Finally we provide

a public release of MatchNet trained using our own large

collection of patches.

The remainder of this paper is organized as follows. Sec-

tion 2 discusses related work, focusing on learned descrip-

tors and metric learning. Section 3 details the network ar-

chitecture behind MatchNet. Section 4 explains how the

joint training and the two-stage evaluation pipeline are per-

formed. Section 5 presents the experimental methodology

and results on a suite of standard datasets. We conclude

with a summary and ideas for future work.

2. Related work

Much previous work considers improving some com-

ponents in the detector-descriptor-similarity pipeline for

matching patches. Here we address the most related work

that considers learning descriptors or similarities, organized

by goal and the types of non-linearity used.

Feature learning methods such as [3], [28] and [22] en-

code non-linearity into the procedure for mapping intensity

patches to descriptors. Their goal is to learn descriptors

whose similarity with respect to a chosen distance metric

match the ground truth. For [3] and [22], the procedure

includes multiple parameterized blocks of gradient compu-

tation, spatial pooling, feature normalization and dimension

reduction. [28] uses boosting with weak learners consisting

of a family of functions parameterized by gradient orien-

tations and spatial location. Each weak learner represents

the result of feature normalization, orientation pooling and

thresholding in its +1/−1 output. Weighting and combin-

ing multiple weak learners builds a highly non-linear map-

ping from gradients to robust descriptors. Different types of

learning algorithms are proposed to ﬁnd the optimal param-

eters: Powell minimization, boosting and convex optimiza-

tion for [3], [28] and [22], respectively. In [3] and [22] the

similarity functions are simply the Euclidean distance. [28]

uses a Mahalanobis distance and jointly learns the descrip-

tors and the metric. In comparison, our proposed feature

extraction uses a deep convolutional network with multiple

convolutional and spatial pooling layers plus an optional

bottle neck layer to obtain feature vectors, followed by a

similarity measure also based on neural nets.

Metric learning methods such as [8] and [9] learn a

similarity function between descriptors that approximates

a ground truth notion of which patches should be similar,

and achieve results that improve on simple similarity func-

tions, most often the Euclidean distance. Jain et al. [8] in-

troduces non-linearity with a predeﬁned kernel on patches.

A Mahalanobis metric is learned on top of that similarity.

Jia et al. [9] uses a parametric distance based on a heavy-

tailed Gamma-Compound-Laplace distribution, which ap-

proximates the empirical distribution of elements in the

difference of matching SIFT descriptors. The parameters

for this distance are estimated using the training data. In

comparison, we use a two-layer fully connected neural net-

work to learn the pairwise similarity, which has the poten-

tial to embrace more complex similarity functions beyond

distance metrics such as Euclidean.

Semantic hashing or embedding learning methods learn

non-linear mappings to generate low dimensional represen-

tations, whose similarity in some easy-to-compute distance

metric (e.g., Hamming distance) correlates with the seman-

tic similarity. This can be done using neural networks, e.g.,

[5] and [19] with a two-tower structure and recently [33]

that samples triplets for training. Spectral hashing [34] or

boosting [21, 27] can also be used to learn the mapping.

This approach can be applied to raw image input [19] as

well as to local feature descriptors [25]. In comparison, al-

though we do not map input to an intermediate embedding

space explicitly, the representation provided by our feature

extraction network naturally serves the purpose, and the di-

mensionality can be controlled depending on the accuracy

vs. storage and computation tradeoff. We explore and ana-

lyze such effects in Section 5.

Our network structure is similar to the recent preprint

[37] for stereo matching, with a notable difference that we

use pooling layers to learn compact representations from

patches. Our approach, MatchNet, is designed for general

wide-baseline viewpoint invariant matching, a signiﬁcantly

different problem than the more local matching problem in

stereo. As one example, for wide-baseline matching, scale

estimation from the key point descriptor may not be accu-

rate. The pooling layers increase the robustness of the net-

work robust to such variation. MatchNet has several other

architectural differences, an additional convolutional layer,

two fewer fully connected layers, and various differences in

ﬁlter supports and layer complexity compared to [37]. We

evaluate some architectural variations in Section 5.

3. Network architecture

MatchNet is a deep-network architecture (Fig. 1 C) for

jointly learning a feature network that maps a patch to a

feature representation (Fig. 1 A) and a metric network that

maps pairs of features to a similarity (Fig. 1 B). It consists of

several types of layers commonly used in deep-networks for

computer vision. We show details of these layer in Table 1,

and discuss some of the high-level architectural choices in

this section.

The feature network: The feature network is inﬂuenced

by AlexNet [12], which achieved good object recognition

performance. We use many fewer parameters and do not use

Local Response Normalization or Dropout. We use Rectﬁed

Linear Units (ReLU) as non-linearity for the convolution

layers.

The metric network: We model the similarity between

features using three fully-connected layers with ReLU non-

linearity. FC3 also uses Softmax. Input to the network is the

concatenation of a pair of features. We output two values in

Preprocessing

Conv0

Pool0

Conv1

Pool1

Metric network

Cross-Entropy Loss

Sampling

Conv2

Conv3

Conv4

Bottleneck

Pool4

FC2

FC1

FC3 + Softmax

A: Feature network

B: Metric network

C: MatchNet in training

Figure 1. The MatchNet architecture. A: The feature network used

for feature encoding, with an optional bottleneck layer to reduce

feature dimension. B: The metric network used for feature com-

parison. C: In training, the feature net is applied as two “towers”

on pairs of patches with shared parameters. Output from the two

towers are concatenated as the metric network’s input. The entire

network is jointly trained on labeled patch-pairs generated from

the sampler to minimize the cross-entropy loss. In prediction, the

two sub-networks (A and B) are conveniently used in a two-stage

pipeline (See Section 4.2).

[0, 1] from the two units of FC3, These are non-negative,

sum up to one, and can be interpreted as the network’s es-

timate of probability that the two patches match and do not

match, respectively.

Two-tower structure with tied parameters: The patch-

based matching task usually assumes that patches go

through the same feature encoding before computing a sim-

ilarity. Therefore we need just one feature network. During

training, this can be realized by employing two feature net-

works (or “towers”) that connect to a comparison network,

with the constraint that the two towers share the same pa-

rameters. Updates for either tower will be applied to the

shared coefﬁcients.

This approach is related to the Siamese network [2, 5],

which also uses two towers, but with carefully designed

loss functions instead of a learned metric network. A re-

cent preprint on learning a network for stereo matching has

also used the two-tower-plus-fully-connected comparison-

network approach [37]. In contrast, MatchNet includes

max-pooling layers to deal with scale changes that are not

present in stereo reconstruction problems, and it also has

Table 1. Layer parameters of MatchNet. The output dimension

is given by (height × width × depth). PS: patch size for con-

volution and pooling layers; S: stride. Layer types: C: convo-

lution, MP: max-pooling, FC: fully-connected. We always pad

the convolution and pooling layers so the output height and width

are those of the input divided by the stride. For FC layers,

their size B and F are chosen from: B ∈ {64, 128, 256, 512},

F ∈ {128, 256, 512, 1024}. All convolution and FC layers use

ReLU activation except for FC3, whose output is normalized with

Softmax (Equation 2).

Name Type Output Dim. PS S

Conv0 C 64 × 64 × 24 7 × 7 1

Pool0 MP 32 × 32 × 24 3 × 3 2

Conv1 C 32 × 32 × 64 5 × 5 1

Pool1 MP 16 × 16 × 64 3 × 3 2

Conv2 C 16 × 16 × 96 3 × 3 1

Conv3 C 16 × 16 × 96 3 × 3 1

Conv4 C 16 × 16 × 64 3 × 3 1

Pool4 MP 8 × 8 × 64 3 × 3 2

Bottleneck FC B - -

FC1 FC F - -

FC2 FC F - -

FC3 FC 2 - -

more convolutional layers compared to [37].

In other settings, where similarity is deﬁned over patches

from two signiﬁcantly different domains, the MatchNet

framework can be generalized to have two towers that share

fewer layers or towers with different structures.

The bottleneck layer: The bottleneck layer can be used

to reduce the dimension of the feature representation and to

control overﬁtting of the network. It is a fully-connected

layer of size B, between the 4096 (8 × 8 × 64) nodes in

the output of Pool4 and the ﬁnal output of the feature net-

work. We evaluate how B affects matching performance in

Section 5 and plot results in Figure 4.

The preprocessing layer: Following a previous conven-

tion, for each pixel in the input grayscale patch we normal-

ize its intensity value x (in [0, 255]) to (x − 128)/160.

4. Training and prediction

The feature and metric networks are trained jointly in a

supervised setting using a two-tower structure illustrated in

Figure 1-C. We minimize the cross-entropy error

E = −

i=1

log( ˆy

) + (1 − y

) log(1 − ˆy

)] (1)

over a training set of n patch pairs using stochastic gradient

descent (SGD) with a batch size of 32. Here y

is the 0/1

label for input pair x

. 1 indicates match. ˆy

and 1 − ˆy

are

the Softmax activations computed on the values of the two

Figure 2. All 24 of the 7 × 7 ﬁlters learned in Conv0 from the

liberty dataset. The pseudo-colors represent intensity.

nodes in FC3, v

) and v

) as follows.

ˆy

)

+ e

)

. (2)

ˆy

is used as the probability estimate for label 1 in Equa-

tion 1.

We experimented with different learning rates and mo-

mentum values and found using plain SGD with 0.01 for

the learning rate yields better validation accuracy than using

larger learning rates and/or with momentum, even though

convergence in the latter settings is faster. Depending on the

network architecture, it takes between 18 hours to 1 week

to train the full network. Using a learning rate schedule can

speed up the training signiﬁcantly.

Figure 2 visualizes Conv0 ﬁlters MatchNet learned on

the Liberty dataset. Figure 5 visualizes the network’s re-

sponse to an example patch at different layers in the feature

network.

4.1. Sampling in training

Sampling is important in training, as the matching (+)

and non-matching (-) pairs are highly unbalanced. We use a

sampler to generate equal number of positives and negatives

in each mini-batch so that the network will not be overly bi-

ased towards negative decisions. The sampler also enforces

variety to prevent overﬁtting to a limited negative set.

Particularly, in our setting, the training set has already

been grouped into matching patches; e.g. The UBC patch

dataset has an average group size around 3. The learner

streams through the training set by reading one group at a

time. For positive sampling, we randomly pick two from

the group; for negative sampling, we use a reservoir sam-

pler [32] with a buffer size of R patches. At any time T the

buffer maintains R patches as if uniformly sampled from the

patch stream up to T , allowing a variety of non-matching

pairs to be generated efﬁciently. The buffer size controls

Algorithm 1 Generate a batch of 2S pairs with a sampler.

for b = 0 . . . S − 1 do

Extract all patches p

. . . p

from the next group;

Randomly choose p

and p

, i 6= j, i, j ∈ {1 . . . k};

Sample(2b) ← (1, p

, p

);

for m = 0 . . . k do

Consider adding p

to the reservoir;

end for

repeat at most 1000 times

Randomly draw p

and p

from the reservoir;

until p

and p

are from different group;

if negative sampling is successful then

Sample(2b + 1) ← (0, p

, p

);

else

Sample(2b + 1) ← (1, p

, p

);

end if

end for

return Sample;

the trade-off between memory and negative variety. In our

experiments, R = 128 was too small and led to severe over-

ﬁtting; R = 16384 has worked consistently. This procedure

is detailed in Algorithm 1.

For instance, if the batch size is 32, in each training it-

eration we feed SGD 16 positives and 16 negatives. The

positives are obtained by reading the next 16 groups from

the database and randomly picking one pair in each group.

Since we go through the whole dataset many times, even

though we only pick one positive pair from each group in

each pass, the network still gets good positive coverage,

especially when the average group size is small. The 16

negatives are obtained by sampling two pairs from different

groups from the reservoir buffer that stores previous loaded

patches. At the ﬁrst few iterations, the buffer would be

empty or contain only matching patches. In that case we

simply ﬁll the slot in the batch with the most recent positive

pair.

4.2. A two-stage prediction pipeline

A common scenario for patch-based matching is that

there are sets of patches each extracted from two images.

The goal is to compute a N

×N

matrix of pairwise match-

ing scores, where N

and N

are the number of patches in

from image. Pushing each pair through the full network

is not efﬁcient because the feature tower would run on the

same patch multiple times. Instead, we can use the fea-

ture tower and the metric network separately and in two

Following [32], if the sampler’s reservoir is not full, the candidate is

always added; otherwise for the T-th candidate, with probability R/T it is

added and replaces a random element in the reservoir and with probability

1-R/T it gets rejected. R is the reservoir size.

We store meta data along with the patches in the buffer so it is efﬁcient

to check whether two patches match or not.

...

Patch set 1

Patch set 2

Feature set 1

Feature set 2

Pairwise

matching

scores

Trained feature network

Trained metric network

x n

Feature pairs

Figure 3. MatchNet is disassembled during prediction. The feature

network and the metric network run in a pipeline.

stages (Figure 3). First we generate feature encodings for all

patches. Then we pair the features and push them through

the metric network to get the scores. In our experiment, on

one NVIDIA K40 GPU, after tuning batch size, the feature

net without bottleneck runs at 3.56K patch/sec; the metric

net (B=128, F=512) runs at 416.6K pair/sec. The computa-

tion can be further pipelined and distributed for large-scale

applications.

5. Experiments

Dataset. The UBC patch dataset [30] (UBC) was col-

lected by Winder et al. [35] for learning descriptors. The

patches were extracted around real interest points from sev-

eral internet photo collections published in [24]. The dataset

includes three subsets with a total of more than 1.5 million

patches. It is suitable for discriminative descriptor or metric

learning, and has been used as a standard benchmark dataset

by many [3, 9, 22, 27, 28]. The dataset comes with patches

extracted using either Difference of Gaussian (DoG) inter-

est point detector or multi-scale Harris corner detector. We

use the DoG set.

There are three subsets in UBC: Liberty, Notredame and

Yosemite. Each comes with pre-generated labeled pairs of

100k, 200k and 500k, all with 50% matches. Each also pro-

vides all unique patches and their corresponding 3D point

ID. The number of unique patches in each dataset is 450k

for Liberty, 468k for Notredame and 634k for Yosemite.

Evaluation protocol. Following the standard protocol

established in [3], people train the descriptor on one subset

and test on the other two subsets. Although people may use

any of the pre-generated pair sets or the grouped patches

in the training subset for training and validation, the testing

is done on the 100k labeled pairs in the test subset. The

commonly used evaluation metric is the false positive rate

at 95% recall (Error@95%), the lower the better.

SIFT baselines. We use VLFeat [31]’s vlsift() with

default parameters and custom frame input to extract SIFT

descriptor on patches. The frame center is the center of the

patch at (32.5, 32.5). The scale is set to be 16/3, where 3

is the default magnifying coefﬁcient, so that the bin size for

the descriptor will be 16. With 4 bins along each side, the

descriptor footprint covers the entire patch. In our prelimi-

nary experiments we found that normalized SIFT (nSIFT),

which is raw SIFT scaled so its L2-norm is 1, gives slightly

better performance than SIFT, so nSIFT is used for all our

baseline experiments.

For a pair of nSIFT, we compute similarity using L2,

linear SVM on 128d element-wise squared difference fea-

tures (Squared diff.) and a two-layer fully-connected neu-

ral networks on 256d nSIFT concatenation (Concat.). For

nSIFT Square diff.+ linearSVM, we use Liblinear [29] to

train the SVM and search the regularization parameter C

among {10

−4

, 10

−3

. . . , 10

} using 10% of the training set

for validation. For nSIFT Concat.+ NNet, the network has

the same structure (with F=512) as the metric network in

MatchNet (Figure 1-B) and is trained using plain SGD with

learning

rate=0.01, batch size=128 and iteration=150k.

MatchNet. We train MatchNet using techniques de-

scribed in Section 4 and evaluate the performance under

different (F, B) combinations, where F and B are the di-

mension of fully-connected layers (F1 and F2) and the bot-

tleneck layer respectively. F ∈ {128, 256, 512, 1024}.

B ∈ {64, 128, 256, 512}. We also evaluate using the fea-

ture network without the bottleneck layer.

MatchNet with quantized features. We evaluate the

performance of MatchNet with quantized features. The out-

put features of the bottleneck layer in the feature tower (Fig-

ure 1-A) are represented as ﬂoating point numbers. They

are the outputs of ReLu units, thus the values are always

non-negative. We quantize these feature values in a simplis-

tic way. For a trained network, we compute the maximum

value M for the features across all dimensions on a set of

random patches in the training set. Then each element v

in the feature is quantized as q(v) = min(2

− 1, b(2

−

1)v/M c), where n is the number of bits we quantize the

feature to. When the feature is fed to the metric network, v

is restored using q(v)M/(2

− 1). We evaluate the perfor-

mance using different quantization levels.

The quantized features give us a very compact repre-

sentation. The ReLU output of the bottleneck layer is not

dense. For example, for the (B=64, F=1024) network, the

average density over all the UBC data is 67.9%. Using a

naive representation: 1 bit to encode whether the value is

MatchNet: Unifying feature and metric learning for patch-based matching

Figures

Citations

Learning to Compare: Relation Network for Few-Shot Learning

Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources

Learning to Compare: Relation Network for Few-Shot Learning

Unsupervised Learning of Depth and Ego-Motion from Video

Volumetric and Multi-view CNNs for Object Classification on 3D Data

References

ImageNet Classification with Deep Convolutional Neural Networks

Object recognition from local scale-invariant features

SURF: speeded up robust features

ORB: An efficient alternative to SIFT or SURF

A performance evaluation of local descriptors

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Deep Residual Learning for Image Recognition

SURF: speeded up robust features

ImageNet Classification with Deep Convolutional Neural Networks

ORB: An efficient alternative to SIFT or SURF

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Matchnet: unifying feature and metric learning for patch-based matching" ?

Q2. What are the future works in "Matchnet: unifying feature and metric learning for patch-based matching" ?

Q3. What is the use of the bottleneck layer?

Q4. How many bits does the naive representation yield?

Q5. What is the model for nSIFT?

Q6. How much error rate is the nSIFT concat.+NNet?

Q7. What is the purpose of patch-based image matching?

Q8. What types of learning algorithms are proposed to find the optimal parameters for [3], [28]?

Q9. How many times do the authors go through the whole dataset?

Q10. How is the performance of the matchnet evaluated?

Q11. What is the way to use the feature tower and the metric network?

Q12. How does the model achieve the performance?

Q13. How much improvement in absolute error rate does MatchNet achieve?

Q14. What is the significance of the trade-off between accuracy and feature size?