scispace - formally typeset
Open AccessProceedings ArticleDOI

Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context

Gautam Singh, +1 more
- pp 3151-3157
TLDR
This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features and examines the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates.
Abstract
This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. We learn the relevance of individual feature channels at test time using a locally adaptive distance metric. To further improve the accuracy of the nonparametric approach, we examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. The approach is validated by experiments on several datasets used for semantic parsing demonstrating the superiority of the method compared to the state of art approaches.

read more

Content maybe subject to copyright    Report

Nonparametric scene parsing with adaptive feature relevance and semantic
context
Gautam Singh Jana Ko
ˇ
seck
´
a
George Mason University
Fairfax, VA
{gsinghc,kosecka}@cs.gmu.edu
Abstract
This paper presents a nonparametric approach to se-
mantic parsing using small patches and simple gradient,
color and location features. We learn the relevance of in-
dividual feature channels at test time using a locally adap-
tive distance metric. To further improve the accuracy of the
nonparametric approach, we examine the importance of the
retrieval set used to compute the nearest neighbours using a
novel semantic descriptor to retrieve better candidates. The
approach is validated by experiments on several datasets
used for semantic parsing demonstrating the superiority of
the method compared to the state of art approaches.
1. Introduction
The problem of semantic labelling, requires simultane-
ous segmentation of an image into regions and categoriza-
tion of all the image pixels. The main ingredients of the
problem are the choice of elementary regions (pixels, super-
pixels), types of features used to characterize them, methods
for computing local label evidence and means of integrating
the spatial information. Semantic segmentation has been
particularly active in recent years, due to the development
of methods for integration of object detection techniques,
with various contextual cues and top down information as
well as advancements in inference algorithms used to com-
pute the optimal labelling.
With the increasing complexity and size of the datasets
used for evaluation of semantic segmentation, nonpara-
metric techniques [15, 26] combined with various context
driven retrieval strategies have demonstrated notable im-
provement in the performance. These methods typically
start with an oversegmentation of an image into superpix-
els followed by the computation of a rich set of features
characterizing both appearance and local geometry at the
superpixel level. Due to a large number of diverse features,
distance learning techniques have been shown to be effec-
tive for retrieval of the closest neighbours.
In the proposed work, we follow a nonparametric ap-
proach and make the following contributions: (i) We forgo
the use of large superpixels and complex features and tackle
the problem of semantic segmentation using local patches
characterized by gradient orientation, color and location
features. The appeal of this representation is its simplic-
ity and resemblance to local patch based methods used in
the context of biologically inspired methods; (ii) We adopt
an approach for learning the relevance of individual feature
channels (gradient orientation, color and location) used in
k-nearest neighbour (k-NN) retrieval and (iii) We demon-
strate a novel approach for obtaining a retrieval set where
the coarse semantic labelling is used to retrieve similar
views and refine the likelihood estimates. The proposed ap-
proach is validated extensively on several semantic segmen-
tation datasets consistently showing improved performance
over the state of the art methods.
2. Related Work
In recent years, a large number of approaches for se-
mantic segmentation have been proposed. Due to the com-
plex nature of the problem, the existing approaches differ
in the choice of elementary regions, choice of features to
describe them, methods for modeling spatial relationships,
means of incorporating of context and choice of optimiza-
tion techniques for solving the optimal labelling problem.
The most successful approaches typically use Conditional
Random Field (CRF) models [7, 6, 11, 23, 13, 12]. Tradi-
tional CRF models [23] combine local appearance informa-
tion with a smoothness prior that favours same labellings for
neighbouring regions. Researchers in [11] proposed the use
of higher order potentials in a hierarchical framework which
allowed the integration of features at different levels (pixels
and superpixels). Other works have looked at exploring ob-
ject co-occurrence statistics [7, 12] and combining results
from object detectors [13] .
With the increasing sizes of datasets and an increasing
2013 IEEE Conference on Computer Vision and Pattern Recognition
1063-6919/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPR.2013.405
3149
2013 IEEE Conference on Computer Vision and Pattern Recognition
1063-6919/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPR.2013.405
3149
2013 IEEE Conference on Computer Vision and Pattern Recognition
1063-6919/13 $26.00 © 2013 IEEE
DOI 10.1109/CVPR.2013.405
3151

number of labels, the use of nonparametric approaches have
shown notable progress [15, 26, 4, 31]. They are appeal-
ing as they can utilize efficient approximate nearest neigh-
bour search techniques e.g. k-d trees [19] and contextual
cues. Context is often captured by a retrieval set of im-
ages similar to the query and methods developed for es-
tablishing matches between image regions (at pixel or su-
perpixel level) for labelling the image. Using the method
of SIFT Flow, pixel-wise correspondences are established
between images for label transfer in [15]. Authors in [26]
work at the superpixel-level and retrieve similar images us-
ing global image features which is followed by superpixel-
level matching using local features and a Markov random
field (MRF) to incorporate neighbourhood context. The
work of [26] was extended by [4] by training per superpixel
per feature weights and also by incorporating superpixel-
level semantic context. A set of partially similar images is
used in [31] by searching for matches for each region of the
query image and then using the retrieval set for label trans-
fer. A nonparametric method which avoids the construction
of a retrieval set is [8] which instead addresses the prob-
lem of semantic labelling by building a graph of patch cor-
respondences across image sets and transfers annotations
to unlabeled images using the established correspondences.
However the degree of the graph vertices is limited due to
memory requirements for large datasets like SiftFlow [15].
Our work is closely related to the work of [26, 4] in
that we also pursue nonparametric approach, but differ in
the choice of elementary regions, features, feature relevance
learning and the method for computing the retrieval set for
k-NN classification. In our case, the retrieval set is obtained
in a feedback manner using a novel semantic label descrip-
tor computed from the initial semantic segmentation. Sim-
ilarly to [4], we follow the observation that a single global
distance metric is often not sufficient for handling the large
variations within a class and propose to compute weights
for individual features channels. The weights in our case
are computed at the test time to indicate the importance of
color, gradient orientation vs location for individual regions.
The computation of the feature relevance we adopt falls into
a broad class of distance metric learning techniques which
have been shown to be beneficial for many problems like
image classification [5], object segmentation [17] and im-
age annotation [9]. For a comprehensive survey on distance
functions, we refer the reader to [22].
3. Approach
In this section, we will describe our baseline approach,
followed by the method of weight computation in Section 4
and semantic contextual retrieval in Section 5.
3.1. Problem Formulation
We formulate the semantic segmentation of an image
segmented into small superpixels. The output of the seman-
tic segmentation is a labelling L =(l
1
,l
2
,...l
S
)
with
hidden variables assigning each superpixel s
i
a unique la-
bel, l
i
∈{1, 2,...,nL}, where nL and S is the total num-
ber of the semantic categories and superpixels respectively.
The posterior probability of a labelling L given the observed
appearance feature vectors A =[a
1
, a
2
,...,a
S
] computed
for each superpixel can be expressed as:
P (L|A)=
P (A|L) P (L)
P (A)
. (1)
We estimate the labelling L as a Maximum A Posteriori
Probability (MAP),
argmax
L
P (L|A) = argmax
L
P (A|L) P (L). (2)
The observation likelihood P (A|L) and the joint prior
P (L) are described in later subsections.
3.2. Superpixels and features
For an image, we extract superpixels utilizing a seg-
mentation method [29] where superpixel boundaries are ob-
tained as watersheds on a negative absolute Laplacian im-
age with LoG extremas as seeds. These blob-based super-
pixels are efficient to compute and naturally consistent with
the boundaries. Similarly to [18], for each superpixel,
we compute a 133-dimensional feature vector a
i
comprised
of SIFT descriptor (128 dimensions), color mean over the
pixels of an individual superpixel in Lab color space (3 di-
mensions) and the location of the superpixel centroid (2 di-
mensions). The SIFT descriptor for a superpixel is com-
puted at a fixed scale and orientation using publicly avail-
able code [27].
3.3. Appearance Likelihood
In order to compute the appearance likelihood for the
entire image, we approximate the Naive Bayes assumption
yielding
P (A|L)
S
i=1
P (a
i
|l
i
). (3)
Such an approximation assumes independence between ap-
pearance features of the superpixels given their labels.
The individual label likelihood P(a
i
|l
j
) for a superpixel
s
i
is obtained using a k-NN method. Since a superpixel is
uniquely represented by its feature vector, we use the sym-
bols s
i
and a
i
interchangeably. For each class l
j
and every
superpixel s
i
of the query image, we compute a label likeli-
hood score:
L(a
i
,l
j
)=
n(l
j
,N
ik
)/n(l
j
,G)
n(
¯
l
j
,N
ik
)/n(
¯
l
j
,G)
(4)
315031503152

where
¯
l
j
= L \ l
j
is the set of all labels excluding l
j
;
N
ik
is a neighbourhood around a
i
with exactly k
points in it;
n(l
j
,N
ik
) is the number of superpixels of class l
j
in-
side N
ik
;
n(l
j
,G) is the number of superpixels of class l
j
in the
set G (described later in Section 3.5).
We compute the normalized label likelihood score using
the individual label likelihood:
P (a
i
|l
j
)=
L(a
i
,l
j
)
nL
l
k
=1
L(a
i
,l
k
)
(5)
A straightforward way to compute the neighbourhood N
ik
is to use the concatenated feature a
i
(Section 3.2) and re-
trieve the k nearest points by computing distance to super-
pixels in G. Such a retrieval can be efficiently performed by
the use of approximate nearest neighbour methods like k-d
trees [19].
3.4. Inference
For the joint prior P (L), we adapt the approach of [18]
which used as its smoothness term E
smooth
, a combination
of the Potts model (using constant penalty δ) and a color dif-
ference based term. The maximization in Eq. (2) can be re-
written in log-space and the optimal labelling L
achieved
as
argmin
L
S
i=1
E
app
+ λ
(i,j)∈E
E
smooth
, (6)
where E
app
= log P (a
i
|l
j
) from Eq. (5) and the set E
contains all neighbouring superpixel pairs. The scalar λ
is the weight for the smoothness term. We perform the
inference in the MRF, i.e. a search for a MAP assign-
ment, using an efficient and fast publicly available
MAX-
SUM solver [28].
3.5. Retrieval Set
The computation of the appearance likelihood in Sec-
tion3.3 uses images from the training set. Instead of using
the entire training set in the k-NN method, it is more useful
to utilize a subset of images which are similar to the query
image. For example, when trying to label a seaside image,
it is more helpful if we search for the nearest neighbours in
images of beaches and discard views from street scenes. We
use overall scene appearance to find a relatively smaller set
of training images instead of using the entire training set. It
helps discard images which are dissimilar to the query im-
age and provides a scene-level context which can help im-
prove the labelling performance. The retrieval subset will
serve as the source of image annotations which will be used
to label the query image. We compute three global image
features for the dataset, namely: (i) GIST [21], (ii) spatial
pyramid [14] of quantized SIFT [16] and (iii) rgb-color his-
tograms with 8 bins per color channel. All the images in the
training set T are ranked for each individual global image
feature in ascending order of the Euclidean distance from
the query image. We then add the individual feature ranks
and re-rank the images of the training set based on the ag-
gregate rank. Finally, we select a subset of images T
g
from
the training set T as the retrieval set. The superpixels of the
images in set T
g
compose the set of training instances G in
Eq. (5).
This constitutes our baseline approach and is denoted
UKNN-MRF in the experiments for the uniformly weighted
k-NN. Its distinguishing characteristics are the use of small
patch-like superpixels, simple features and approximate
nearest neighbour methods in the context of k-NN classi-
fication. In the next two sections, we describe in detail the
two contributions of this work: a method for weighting dif-
ferent feature channels and the strategy for improving the
retrieval set.
4. Weighted k-NN
The baseline k-NN approach uses Euclidean distance to
compute the neighbourhood around the point. We propose
to use a weighted k-NN method to compute the neighbour-
hood of a query point. To compute a weighted distance be-
tween two superpixels a
i
and a
j
, we split the feature vec-
tor into three feature channels of gradient orientation, color
and location and first compute distances in individual fea-
ture spaces:
d
ij
f
=[d
ij
c
,d
ij
s
,d
ij
l
]
(7)
where d
ij
c
,d
ij
s
,d
ij
l
are the Euclidean distances between the
color, SIFT and location channels of the feature vectors a
i
and a
j
of the two superpixels respectively. We now define
a weighted distance between the two superpixels as
d
ij
w
= w
d
ij
f
(8)
where w =[w
1
,w
2
,w
3
] ∈
3
defines the weights for the
individual feature distances. Using the weighted distance
from Eq. (8), we can now obtain the neighbourhood N
ik
around a superpixel by applying it to the feature distance
vector d
ij
f
between a
i
and a
j
G to compute the label
likelihood scores in Eq. (4). We now describe an approach
to compute these weights.
Weight computation With the varying nature of the
retrieval set for individual query images, we use the locally
adaptive metric approach of [3] for the weight computation.
It is a query-based technique which uses a global metric to
select neighbours for a test point which are then used to
315131513153

refine the feature weights. In our setting, the test points are
the individual superpixels of the query image.
The goal is to estimate the relevance of a feature channel
i by evaluating its ability to predict class posterior proba-
bilites locally at a query point. This is done by computing
the expectation of the posterior P (l
j
|x) conditioned at a test
point x
0
along feature channel i. The ability of feature
channel i to predict P (l
j
|z) at x
i
= z
i
is defined as
r
i
(z)=
nL
l
j
=1
(P (l
j
|z)
¯
P (l
j
|x
i
= z
i
))
2
¯
P (l
j
|x
i
= z
i
)
(9)
Intuitively, the smaller the difference between P (l
j
|z) and
¯
P (l
j
|x
i
= z
i
), the more information feature channel i pro-
vides for predicting the class posterior probabilities locally
at z. For the query point x
0
, the relevance for feature i can
be computed by averaging the r
i
(z)s in its neighbourhood
¯r
i
(x
0
)=
1
|N(x
0
)|
zN(x
0
)
r
i
(z) (10)
where N (x
0
) denotes a neighbourhood centered at x
0
(us-
ing the current feature weights) with K
0
points in it. The
relative relevance can then be computed as
w
i
(x
0
)=
exp (cR
i
(x
0
))
m
p=1
exp (cR
p
(x
0
)
(11)
where m is the number of individual feature channels (three
in our case), c is a parameter which determines the influ-
ence of ¯r
i
(at c =0, all three feature channels have equal
weights) and R
i
(x
0
)=max
m
p=1
{¯r
p
(x
0
)}−¯r
i
(x
0
). The
quantities P (l
j
|z) and
¯
P (l
j
|x
i
= z
i
) in Eq. (9) are es-
timated by considering neighbourhoods centered at z de-
scribed in detail by [3]. In the experiments section, this
method evaluates the effect of the weight learning on the
final classification and is denoted WKNN-MRF for the
weighted k-NN.
5. Semantic Contextual Retrieval
The semantic labelling of an image, even if inaccurate
provides a strong cue about the presence and absence of
different categories in the image. While the idea of using
context to improve the labelling has been explored in the
past for image superpixels [20, 4], here we examine the ef-
fectiveness of this idea in the stage of improving the entire
retrieval set. In order to do so, we propose a global descrip-
tor derived from the intial labelling of the image which will
be used to improve the retrieval set.
To summarize the semantic label information of a la-
beled image, we introduce the semantic label descriptor for
a labelled image. This descriptor captures the basic under-
lying structure of the image and can help divide images into
sets of semantically similar images. For example, streets
inside a city have high rise buildings on the side while high-
ways generally have trees and plants besides the roadside.
Our proposed descriptor helps encode the positional infor-
mation of each category in the image and can be used for
semantic contextual retrieval.
Given an image which has been labelled using the
WKNN-MRF method, we consider a spatial pyramid of n
levels over the labelled image. At level i in the pyramid, we
divide I into a uniform grid of d × d cells where d =2
i1
.
Within each grid cell, we compute the distribution for each
of the nL classes using the number of individual pixels in
that grid cell which have been assigned that class. This re-
sults in a nL-bin histogram for a single grid cell. The class
distribution values for each cell are normalized so that they
sum to one. The histograms for all the grid cells in the spa-
tial pyramid are concatenated together resulting in a image
feature f
seman
of length nL × C where C =
n
i=1
4
i1
is
the total number of cells in the spatial pyramid.
A higher value for n will capture the details of the lay-
out more precisely but be more prone to classification errors
while a lower value for n would be less sensitive to errors in
the labelling but does not encode the spatial position of the
semantic categories as well. This approach of computing a
semantic label-based descriptor is similar to [10]. However
our method differs in the fact that we use a spatial pyramid
over the labelled image instead of a single grid to encode
the semantic label information and we do not include addi-
tional appearance information in the descriptor, because it
has already been captured through other global image fea-
tures (Section 3.5). Our method also differs from [4] who
compute a superpixel-level semantic context descriptor as a
normalized label histogram of neighbouring regions.
5.1. Semantic Retrieval Set
Global image features (GIST, color histograms and spa-
tial pyramid over SIFT) were used to build retrieval set T
g
in Section 3.5. We now use the semantic label descriptor
f
seman
introduced above to help us refine the quality of the
retrieval set by exploiting the semantic context.
For each image I
k
in the training set, we perform leave-
one-out-classification on the image using the WKNN-MRF
approach. Using the resultant semantic image labelling,
we generate its corresponding semantic label descriptor
f
k
seman
. Similarily, for the query view I
q
, we label it us-
ing WKNN-MRF method and compute the corresponding
semantic label descriptor. We generate a new set of ranking
for the images in training set T based on the distance be-
tween their semantic label descriptor and that of the query
image. The ranking is computed in an ascending order of
the semantic label descriptor distances. We can now use
315231523154

this ranking in isolation or combine it with the rankings for
other global image feature types as was done in Section 3.5
to obtain the semantic retrieval set T
s
. Using the new re-
trieval set T
s
, we once again perform semantic labelling on
the image by the process described in Section 3.3- 3.4. This
method is denoted as WLKNN-MRF in our experimental
results. The WLKNN refers to a weighted k-NN using a
retrieval set built using the label descriptor only. We also
experiment with using the semantic layout descriptor with
all the other three global image features for the building of
the retrieval set and denote this method WAKNN-MRF.
6. Experiments
For evaluating the performance of our method, we tested
and compared it with several state-of-the-art techniques on
four different datasets: SiftFlow [15], SUN09 [1], Google
Street View [30] and Stanford Background [6]. The eval-
uation criterion for the methods is the per pixel accuracy
(percentage of pixels correctly labelled) and per class accu-
racy (the average of semantic category accuracies).
For Stanford Background and Google Street View
datasets, we selected 10% of the training images as the size
of our retrieval set. In case of the other two datasets, we
used a retrieval set of 75 images. For all our experiments,
we set k =9in Eq. (4) and λ =0.4 in Eq. (6). We ob-
tained these parameters by selecting a small subset of the
training images as a validation set. Computation of the
feature weights required an average of four minutes for a
single query image. To help speed up the computation of
the weights, we approximate the neighbourhood construc-
tion of [3] through k-d trees [19]. For the query view,
we index the individual features from the retrieval set in
a k-d tree, constructing one k-d tree per feature channel.
The neighbourhood computation is then approximated us-
ing the set union of the k-NN from different feature chan-
nels. We carry out 5 iterations of the weight computation
step in Eq. (11) adaptively changing the nearest neighbours
in the weighted neighbourhood space. While this approxi-
mates the weight computation, it affected our performance
only slightly (a maximum decrease of 0.4% in per-pixel ac-
curacy across the three datasets) and helped reduce the time
for weight computation for an image to 20 seconds. For an
image, feature computation, k-NN likelihood computation
and MRF inference took 1 second, 13 seconds and 0.5 sec-
ond respectively. When reporting the performance, we used
the following variants of our approach:
UKNN-MRF: uniform weights for the features with
retrieval set obtained by global image features
WKNN-MRF: computed weights for the features with
retrieval set obtained by global image features
WLKNN-MRF: computed weights with retrieval set
built using the semantic layout descriptor only
WAKNN-MRF: computed weights with retrieval set
built using a union of semantic layout descriptor and
the three other global image features.
SiftFlow SiftFlow is a large dataset of 2688 images with 33
semantic categories. [15] split the dataset into 2488 training
images and 200 test images. Table 1 reports our perfor-
mance on this dataset. Our weighted k-NN MRF performs
on a comparable level on the per-pixel accuracy with the top
methods. However it still trails [4] for the per-class accu-
racy. When we incorporate semantic context to obtain a re-
fined retrieval set, our system achieves the best performance
for both per-pixel and per-class accuracies. The categories
which saw an increase of more than 10% after the use of
semantic context include field, car, river, plant, sidewalk,
bridge, door, crosswalk. These are categories which do not
occur very frequently but achieved improved labelling with
the context. For example, identifying road and highways
helps label cars, sidewalk and crosswalk.
System Per-Pixel Per-Class
Liu et al. [15] 76.7 -
Tighe et al. [26] 76.9 29.4
Eigen et al. [4] 77.1 32.5
UKNN-MRF 75.6 27.9
WKNN-MRF 77.2 29.3
WLKNN-MRF 78.5 32.0
WAKNN-MRF 79.2 33.8
WKNN-MRF (with HOG) 76.7 27.4
Table 1. Semantic labelling performance on the SiftFlow dataset
We also experimented with replacing the SIFT feature
for the superpixel with a HOG feature [2]. This feature was
computed by using a 4 × 4 spatial grid of 4-pixel HOG
cells with the grid centered at the superpixel’s center. The
individual HOG cell descriptors were averaged to compute
the superpixel feature. The last row in Table 1 contains the
performance for this method. Classes which significantly
improved with the use of HOG instead of SIFT include
tree, mountain, car while the accuracy dropped for road,
sea, grass, sidewalk.
SUN09 SUN09 dataset [1] has fully labelled per-pixel
ground truth for a set of 107 semantic categories. In the
experiments, the dataset was split into 4352 training images
and 4310 test images. Table 2 reports the performance of
our method on this dataset. Using the semantic context
helped obtain an improvement of 3.6% compared to the
WKNN-MRF method. In comparison to [25], we perform
better on per-pixel accuracy but trail on per-class accuracy.
It was observed that the per-pixel labelling accuracy of
outdoor scenes was more than 11% better than indoor
scenes highlighting the challenge of labelling indoor views.
Google-StreetView The Google Street View dataset
contains 320 images selected from a set of 10,000 images
315331533155

Citations
More filters
Proceedings ArticleDOI

Towards unified depth and semantic prediction from a single image

TL;DR: This work proposes a unified framework for joint depth and semantic prediction that effectively leverages the advantages of both tasks and provides the state-of-the-art results.
Proceedings ArticleDOI

Learning to segment under various forms of weak supervision

TL;DR: This work proposes a unified approach that incorporates various forms of weak supervision - image level tags, bounding boxes, and partial labels - to produce a pixel-wise labeling on the challenging Siftflow dataset.
Proceedings ArticleDOI

ReSeg: A Recurrent Neural Network-Based Model for Semantic Segmentation

TL;DR: In this article, the authors proposed a structured prediction architecture, which exploits the local generic features extracted by Convolutional Neural Networks and the capacity of Recurrent Neural Networks (RNN) to retrieve distant dependencies.
Proceedings ArticleDOI

Context Driven Scene Parsing with Attention to Rare Classes

TL;DR: This paper focuses on rare object classes, which play an important role in achieving richer semantic understanding of visual scenes, compared to common background classes, and makes two novel contributions: rare class expansion and semantic context description.
Proceedings ArticleDOI

DAG-Recurrent Neural Networks for Scene Labeling

TL;DR: Direct acyclic graph RNNs are proposed to process DAG-structured images, which enables the network to model long-range semantic dependencies among image units and proposes a novel class weighting function that attends to rare classes, which phenomenally boosts the recognition accuracy for non-frequent classes.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Proceedings ArticleDOI

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Journal ArticleDOI

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

TL;DR: The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Related Papers (5)
Frequently Asked Questions (14)
Q1. How long did it take to compute the weights for an image?

For an image, feature computation, k-NN likelihood computation and MRF inference took 1 second, 13 seconds and 0.5 second respectively. 

This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. To further improve the accuracy of the nonparametric approach, the authors examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. 

For future work, the authors would like to explore better methods for incorporating spatial information at the patch level and also explore learning semantic concepts for scene understanding. 

The problem of semantic labelling, requires simultaneous segmentation of an image into regions and categorization of all the image pixels. 

For the query point x0, the relevance for feature i can be computed by averaging the ri(z)’s in its neighbourhoodr̄i(x0) = 1 |N(x0)| ∑z∈N(x0) ri(z) (10)where N(x0) denotes a neighbourhood centered at x0 (using the current feature weights) with K0 points in it. 

The authors carry out 5 iterations of the weight computation step in Eq. (11) adaptively changing the nearest neighbours in the weighted neighbourhood space. 

The categories which saw an increase of more than 10% after the use of semantic context include field, car, river, plant, sidewalk, bridge, door, crosswalk. 

In order to compute the appearance likelihood for the entire image, the authors approximate the Naive Bayes assumption yieldingP (A|L) ≈ S∏i=1P (ai|li). 

For evaluating the performance of their method, the authors tested and compared it with several state-of-the-art techniques on four different datasets: SiftFlow [15], SUN09 [1], Google Street View [30] and Stanford Background [6]. 

With the varying nature of the retrieval set for individual query images, the authors use the locally adaptive metric approach of [3] for the weight computation. 

Their proposed descriptor helps encode the positional information of each category in the image and can be used for semantic contextual retrieval. 

Using the new retrieval set Ts, the authors once again perform semantic labelling on the image by the process described in Section 3.3- 3.4. 

The work of [26] was extended by [4] by training per superpixel per feature weights and also by incorporating superpixellevel semantic context. 

The posterior probability of a labelling L given the observed appearance feature vectors A = [a1, a2, . . . ,aS ] computed for each superpixel can be expressed as:P (L|A) = P (A|L)P (L) P (A) .