scispace - formally typeset
Open AccessJournal ArticleDOI

Simultaneous Object Recognition and Segmentation from Single or Multiple Model Views

Reads0
Chats0
TLDR
In this paper, a novel object recognition approach based on affine invariant regions is presented, which actively counters the problems related to the limited repeatability of the region detectors, and the difficulty of matching, in the presence of large amounts of background clutter and particularly challenging viewing conditions.
Abstract
We present a novel Object Recognition approach based on affine invariant regions. It actively counters the problems related to the limited repeatability of the region detectors, and the difficulty of matching, in the presence of large amounts of background clutter and particularly challenging viewing conditions. After producing an initial set of matches, the method gradually explores the surrounding image areas, recursively constructing more and more matching regions, increasingly farther from the initial ones. This process covers the object with matches, and simultaneously separates the correct matches from the wrong ones. Hence, recognition and segmentation are achieved at the same time. The approach includes a mechanism for capturing the relationships between multiple model views and exploiting these for integrating the contributions of the views at recognition time. This is based on an efficient algorithm for partitioning a set of region matches into groups lying on smooth surfaces. Integration is achieved by measuring the consistency of configurations of groups arising from different model views. Experimental results demonstrate the stronger power of the approach in dealing with extensive clutter, dominant occlusion, and large scale and viewpoint changes. Non-rigid deformations are explicitly taken into account, and the approximative contours of the object are produced. All presented techniques can extend any view-point invariant feature extractor.

read more

Content maybe subject to copyright    Report

ETH Library
Simultaneous object recognition
and segmentation from single or
multiple model views
Journal Article
Author(s):
Ferrari, Vittorio; Tuytelaars, Tinne; Van Gool, Luc
Publication date:
2006-04
Permanent link:
https://doi.org/10.3929/ethz-b-000036902
Rights / license:
In Copyright - Non-Commercial Use Permitted
Originally published in:
International Journal of Computer Vision 67(2), https://doi.org/10.1007/s11263-005-3964-7
This page was generated automatically upon download from the ETH Zurich Research Collection.
For more information, please consult the Terms of use.

International Journal of Computer Vision 67(2), 159–188, 2006
c
2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
DOI: 10.1007/s11263-005-3964-7
Simultaneous Object Recognition and Segmentation from Single or Multiple
Model Views
VITTORIO FERRARI
Computer Vision Group (BIWI), ETH Zuerich, Switzerland
ferrari@vision.ee.ethz.ch
TINNE TUYTELAARS
ESAT-PSI, University of Leuven, Belgium
Tinne.Tuytelaars@esat.kuleuven.ac.be
LUC VAN GOOL
Computer Vision Group (BIWI), ETH Zuerich, Switzerland ; ESAT-PSI, University of Leuven, Belgium
vangool@vision.ee.ethz.ch
Received September 21, 2004; Revised April 4, 2005; Accepted May 3, 2005
First online version published in January, 2006
Abstract. We present a novel Object Recognition approach based on affine invariant regions. It actively counters
the problems related to the limited repeatability of the region detectors, and the difficulty of matching, in the
presence of large amounts of background clutter and particularly challenging viewing conditions. After producing
an initial set of matches, the method gradually explores the surrounding image areas, recursively constructing more
and more matching regions, increasingly farther from the initial ones. This process covers the object with matches,
and simultaneously separates the correct matches from the wrong ones. Hence, recognition and segmentation are
achieved at the same time. The approach includes a mechanism for capturing the relationships between multiple
model views and exploiting these for integrating the contributions of the views at recognition time. This is based on
an efficient algorithm for partitioning a set of region matches into groups lying on smooth surfaces. Integration is
achieved by measuring the consistency of configurations of groups arising from different model views. Experimen-
tal results demonstrate the stronger power of the approach in dealing with extensive clutter, dominant occlusion,
and large scale and viewpoint changes. Non-rigid deformations are explicitly taken into account, and the approx-
imative contours of the object are produced. All presented techniques can extend any view-point invariant feature
extractor.
1. Introduction
The modern trend in Object Recognition has aban-
doned model-based approaches (e.g. Bebis et al.,
This research was supported by EC project VIBES, the Fund for
Scientific Research Flanders, and the IST Network of Excellence
PASCAL.
1995), which require a 3D model of the object as in-
put, in favor of appearance-based ones, where some
example images suffice. Two kinds of appearance-
based methods exist: global and local. Global methods
build an object representation by integrating informa-
tion over an entire image (e.g. Cyr and Kimia, 2001;
Murase and Nayar, 1995; Swain and Ballard, 1991),
and are therefore very sensitive to background clutter

160 Ferrari, Tuytelaars and Van Gool
and partial occlusion. Hence, global methods only con-
sider test images without background, or necessitate a
prior segmentation, a task which has proven extremely
difficult. Additionally, robustness to large viewpoint
changes is hard to achieve, because the global object
appearance varies in a complex and unpredictable way
(the object’s geometry is unknown). Local methods
counter problems due to clutter and occlusion by rep-
resenting images as a collection of features extracted
based on local information only (e.g. Selinger and
Nelson, 1999). After the influential work of Schmid
(1996), who proposed the use of rotation-invariant fea-
tures, there has been important evolution. Feature ex-
tractors have appeared (Lowe, 2004; Mikolajczyk and
Schmid, 2001) which are invariant also under scale
changes, and more recently recognition under gen-
eral viewpoint changes has become possible, thanks
to extractors adapting the complete affine shape of the
feature to the viewing conditions (Baumberg, 2000;
Matas et al., 2002; Mikolajczyk and Schmid, 2002;
Schaffalitzky and Zisserman, 2002; Tuytelaars et al.,
1999; Tuytelaars and Van-Gool, 2000). These affine
invariant features are particularly significant: even
though the global appearance variation of 3D objects
is very complex under viewpoint changes, it can be
approximated by simple affine transformations on a
local scale, where each feature is approximately planar
(a region). Local invariant features are used in many
recent works, and provide the currently most success-
ful paradigm for Object Recognition (e.g. Lowe, 2004;
Mikolajczyk and Schmid, 2002; Obrdzalek and Matas,
2002; Rothganger et al., 2005; Tuytelaars and Van-
Gool, 2000). In the basic common scheme a number
of features are extracted independently from both a
model and a test image, then characterized by invari-
ant descriptors and finally matched.
In spite of their success, the robustness and general-
ity of these approaches are limited by the repeatability
of the feature extraction, and the difficulty of matching
correctly, in the presence of large amounts of clutter
and challenging viewing conditions. Indeed,large scale
or viewpoint changes considerably lower the proba-
bility that any given model feature is re-extracted in
the test image. Simultaneously, occlusion reduces the
number of visible model features. The combined effect
is that only a small fraction of model features has a cor-
respondence in the test image. This fraction represents
the maximal number of features that can be correctly
matched. Unfortunately, at the same time extensive
clutter gives rise to a large number of non-object fea-
tures, which disturb the matching process. As a final
outcome of these combined difficulties, only a few, if
any, correct matches are produced. Because these of-
ten come together with many mismatches, recognition
tends to fail.
Even in easier cases, to suit the needs for repeata-
bility in spite of viewpoint changes, only a sparse set
of distinguished features (Matas et al., 2002)areex-
tracted. As a result, only a small portion of the object
is typically covered with matches. Densely covering
the visible part of the object is desirable, as it increases
the evidence for its presence, which results in higher
detection power. Moreover, it would allow to find the
contours of the object, rather than just its location.
Simultaneous recognition and segmentation. In the
first part of the paper we tackle these problems with a
new, powerful technique to match a model view to the
test image which no longer relies solely on matching
viewpoint invariant features. We start by producing an
initial large set of unreliable region correspondences,
so as to maximize the number of correct matches, at
the cost of introducing many mismatches. Addition-
ally, we generate a grid of regions densely covering the
model image. The core of the method then iteratively
alternates between expansion phases and contraction
phases. Each expansion phase tries to construct re-
gions corresponding to the coverage ones, based on the
geometric transformation of nearby existing matches.
Contraction phases try to remove incorrect matches,
using filters that tolerate non-rigid deformations.
This scheme anchors on the initial matches and then
looks around them trying to construct more. As new
matches arise, they are exploited to construct even
more, in a process which gradually explores the test im-
age, recursively constructing more and more matches,
increasingly farther from the initial ones. At each iter-
ation, the presence of the new matches helps the filter
taking better removal decisions. In turn, the cleaner
set of matches makes the next expansion more effec-
tive. As a result, the number, percentage and extent
of correct matches grow with every iteration. The two
closely cooperating processes of expansion and con-
traction gather more evidence about the presence of
the object and separate correct matches from wrong
ones at the same time. Hence, they achieve simultane-
ous recognition and segmentation of the object.
By constructing matches for the coverage regions,
the system succeedsincovering also image areas which
are not interesting for the feature extractor or not

Simultaneous Object Recognition and Segmentation 161
discriminative enough to be correctly matched by tra-
ditional techniques. During the expansion phases, the
shape of each new region is adapted to the local sur-
face orientation, allowing the exploration process to
follow curved surfaces and deformations (e.g. a folded
magazine).
The basic advantage of our approach is that each sin-
gle correct initial match can expand to cover a smooth
surface with many correct matches, even when start-
ing from a large number of mismatches. This leads to
filling the visible portion of the object with matches.
Some interesting direct advantages derivefrom it. First,
robustness to scale, viewpoint, occlusion and clutter
are greatly enhanced, because most cases where tradi-
tional approaches generate only a few correct matches
are now solvable. Secondly, discriminative power is in-
creased, because decisions about the object’s identity
are based on information densely distributed over the
entire portion of the object visible in the test image.
Thirdly, the approximate boundary of the object in the
test image is suggested by the final set of matches.
Fourthly, non-rigid deformations are explicitly taken
into account.
Integrating multiple model views. When multiple
model views are available, there usually are signifi-
cant overlaps between the object parts seen by different
views. In the second part of the paper, we extend our
method to capture the relationships between the model
views, and to exploit these for integrating the contri-
butions of the views during recognition. The main in-
gredient is the novel concept of a group of aggregated
matches (GAM). A GAM is a set of region matches be-
tween two images, which are distributed over a smooth
surface of the object. A set of matches, including an
arbitrary amount of mismatches, can be partitioned
into GAMs. The more matches there are in a GAM,
the more likely it is that they are correct. Moreover,
the matches in a GAM are most often all correct, or all
incorrect. When evaluating the correctness and inter-
relations of sets of matches, it is convenient to reason at
the higher perceptual grouping level that GAMs offer:
no longer consider unrelated region matches, but the
collection of GAMs instead. Hence, GAMs become
the atomic unit, with their size carrying precious infor-
mation. Moreover, the computational complexity of a
problem can be reduced, because there are consider-
ably fewer relevant GAMs than region matches.
Concretely, multiple-view integration is achieved as
follows. During modeling, the model views are con-
nected by a number of region-tracks. At recognition
time, each model view is matched to the test image,
and the resulting matches are partitioned into GAMs.
The coherence of a configuration of GAMs, possibly
originating from different model views, is evaluated
using the region tracks that span the model views. We
search for the most consistent configuration, covering
the object as completely as possible, and define a confi-
dence score which strongly increases in the presence of
compatible GAMs. In this fashion, the detection power
improves over the simple approach of considering the
contribution of each model view independently. More-
over, incorrect GAMs are discovered because they do
not belong to the best configuration, thus improving
the segmentation.
Paper structure. Sections 2 to 8 cover the first part:
the image-exploration technique to match a model view
to the test image. The integration of multiple model
views is described in the second part, Sections 9 to 12.
A discussion of related work can be found in Section
14, while experimental results are given in Section 13.
Finally, Section 15 closes the paper with conclusions
and possible directions for future research. Preliminary
versions of this work have appeared in Ferrari et al.
(2004a, b).
2. Overview of Part I: Simultaneous Recognition
and Segmentation
Figure 2(a) shows a challenging example, which is
used as case-study throughout the first part of the paper.
There is a large scale change (factor 3.3), out-of-plane
rotation, extensive clutter and partial occlusion. All
these factors make the life of the feature extraction and
matching algorithms hard.
A scheme of the approach is illustrated in Fig. 1.
We build upon a multi-scale extension of the extrac-
tor of Tuytelaars and Van-Gool (2000). However, the
method works in conjunction with any affine invariant
region extractor (Baumberg, 2000; Matas et al., 2002;
Mikolajczyk and Schmid, 2002). In the first phase (soft
matching), we form a large set of initial region corre-
spondences. The goal is to obtain some correct matches
Figure 1. Phases of the image-exploration technique.

162 Ferrari, Tuytelaars and Van Gool
also in difficult cases, even at the price of including a
large majority of mismatches. Next, a grid of circular
regions covering the model image is generated (coined
coverage regions). The early expansion phase tries to
propagate these coverage regions based on the geomet-
ric transformation of nearby initial matches. By propa-
gating a region, we mean constructing the correspond-
ing one in the test image. The propagated matches and
the initial ones are then passed through a novel local fil-
ter, during the early contraction phase, which removes
some of the mismatches. The processing continues by
alternating faster expansion phases (main expansion),
where coverage regions are propagated over a larger
area, with contraction phases based on a global filter
(main contraction). This filter exploits both topological
arrangements and appearance information, and toler-
ates non-rigid deformations.
The ‘early’ phases differ from the ‘main’ phases in
that they are specialized to deal with the extremely
low percentage of correct matches given by the initial
matcher in particularly difficult cases.
3. Soft Matching
The first stage is to compute an initial set of region
matches between a model image I
m
and a test image I
t
.
The region extraction algorithm (Tuytelaars and
Van-Gool, 2000) is applied to both images indepen-
dently, producing two sets of regions
m
,
t
, and a
vector of invariants describing each region (Tuytelaars
and Van-Gool, 2000). Test regions
t
are matched
to model regions
m
in two steps, explained in the
next two subsections. The matching procedure allows
for soft matches, i.e. more than one model region is
matched to the same test region, or vice versa.
3.1. Tentative Matches
For each test region T
t
we first compute the Maha-
lanobis distance of the descriptors to all model regions
M
m
. Next, the following appearance similarity
measure is computed between T and each of the 10
closest model regions:
sim(M, T) = NCC(M, T ) +
1
d
RGB(M, T)
100
(1)
where NCC is the normalized cross-correlation be-
tween the regions’ greylevel patterns, while d
RGB
is the average pixel-wise Euclidean distance in RGB
color-space after independent normalization of the
3 colorbands (necessary to achieve photometric in-
variance). Before computation, the two regions are
aligned by the affine transformation mapping T to
M. This mixed measure is more discriminative than
NCC alone, which is the most common choice in the
literature (Obrdzalek and Matas, 2002; Mikolajczyk
and Schmid, 2002; Tuytelaars and Van-Gool, 2000).
NCC mostly looks at the pattern structure, and dis-
cards valuable color information. A green disc on a
red background, and a bright blue disc on a dark blue
background would be very similar under NCC. d
RGB
captures complementary properties. As it focuses on
color correspondence, it would correctly score low the
previous disc example. However, it would confuse a
green disc on a bright green background with a green
cross on a bright green background, a difference which
NCC would spot. By summing these two measures, we
obtain a more robust one which alleviates their com-
plementary shortcomings.
Each of the 3 test regions most similar to T above
a low threshold t
1
are considered tentatively matched
to T. Repeating this operation for all regions T
t
,
yields a first set of tentative matches. At this point,
every test region could be matched to either none, 1, 2
or 3 model regions.
3.2. Refinement and Re-Thresholding
Since all regions are independently extracted from the
two images, the geometric registration of a correct
match is often not optimal. Two matching regions often
do not cover exactly the same physical surface, which
lowers their similarity. The registration of the tentative
matches is now refined using our algorithm (Ferrari
et al., 2003), that efficiently looks for the affine trans-
formation that maximizes the similarity. This results
in adjusting the region’s location and shape in one of
the images. Besides raising the similarity of correct
matches, this improves the quality of the forthcoming
expansion stage, where new matches are constructed
based on the affine transformation of the initial ones.
After refinement, the similarity is re-evaluated and
only matches scoring above a second, higher threshold
t
2
are kept.
1
Refinement tends to raise the similarity of
correct matches much more than that of mismatches.
The increased separation between the similarity distri-
butions makes the second thresholding more effective.
At this point, about 1/3 to 1/2 of the tentative matches
are left.

Citations
More filters
Journal ArticleDOI

A Comparison of Affine Region Detectors

TL;DR: A snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions to establish a reference test set of images and performance software so that future detectors can be evaluated in the same framework.
Proceedings ArticleDOI

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

TL;DR: DenseFusion as mentioned in this paper proposes a heterogeneous architecture that processes the two complementary data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated.
Proceedings ArticleDOI

3D Bounding Box Estimation Using Deep Learning and Geometry

TL;DR: In this paper, a hybrid discrete-continuous loss is proposed to estimate 3D bounding box dimensions and geometric constraints provided by a 2D object bounding boxes. But this method requires a large amount of training data and is computationally expensive.
Posted Content

3D Bounding Box Estimation Using Deep Learning and Geometry

TL;DR: Although conceptually simple, this method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors and produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset.
Proceedings ArticleDOI

3D generic object categorization, localization and pose estimation

TL;DR: This work proposes a novel and robust model to represent and learn generic 3D object categories, and proposes a framework in which learning is done via minimal supervision compared to previous works.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

A performance evaluation of local descriptors

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Proceedings ArticleDOI

Video Google: a text retrieval approach to object matching in videos

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Journal ArticleDOI

Color indexing

TL;DR: In this paper, color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models, and they can differentiate among a large number of objects.
Journal ArticleDOI

Robust wide-baseline stereo from maximally stable extremal regions

TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What have the authors contributed in "Simultaneous object recognition and segmentation from single or multiple model views" ?

The authors present a novel Object Recognition approach based on affine invariant regions. 

Although the authors plan a number of speedups, the method is unlikely to reach the speed of the fastest other systems ( the system of Lowe ( 2001, 2004 ) is reported to perform recognition within seconds ). Finally, using several types of affine invariant regions simultaneously, rather than only those of Tuytelaars and Van-Gool ( 2000 ), would push the performance further upwards. 

the exploration process tends to implode in negative cases, because the expansion phases can do little and the contraction phases eat up most of the matches. 

The large scale change, combined with the modest resolution (720×576), causes heavy image degradation which corrupts edges and texture. 

Densely covering the visible part of the object is desirable, as it increases the evidence for its presence, which results in higher detection power. 

The proposed filter does not try to capture the transformations of all matches in a single, overall model, but it relies instead on simpler, weak properties, involving only three matches each. 

It is also very time efficient, as it solves cases with n = 20 within some seconds (exhaustive search needs more than 1 hour), and scales well, taking less than one minute for n = 60, a problem size for which the real optimum cannot be computed. 

Once two-view region correspondences have been produced for all ordered pairs of model views (vi, vj), i = j, they can be organized into multi-view region tracks. 

The use of this faster, less powerful version is justified because matching model views is easier than matching to a test image: there is no background clutter, and the object appears at approximately the same scale. 

When evaluating the correctness and interrelations of sets of matches, it is convenient to reason at the higher perceptual grouping level that GAMs offer: no longer consider unrelated region matches, but the collection of GAMs instead. 

The coherence of a configuration of GAMs, possibly originating from different model views, is evaluated using the region tracks that span the model views. 

Refinement raises the similarity of correctly propagated matches much more than the similarity of mispropagated ones, thereby helping correct supports to win. 

When applied to this input, the GAM decomposition is most interesting, because the constructor has enough prime matter to build GAMs covering larger areas, even if curved or deformed. 

The contributions from all model views of a single object are combined by superimposing the area covered by the final set of matched regions (to find the contour), and by summing their number (detection criterion). 

Let’s recall that the image-exploration technique constructs correspondences for many overlapping circular regions, arranged on a grid completely covering the first model view vi (coverage regions, see Section 4.1). 

More precisely, the probability that N mismatches are grouped in the same GAM is expected to decrease roughly exponentially with N. 

Since the model views are interconnected by the model tracks, the authors know the correspondences of the regions on the paw between views 3 and 4.