Why does the method tend to implode in negative cases?

the exploration process tends to implode in negative cases, because the expansion phases can do little and the contraction phases eat up most of the matches.

What is the effect of the large scale change on the image?

The large scale change, combined with the modest resolution (720×576), causes heavy image degradation which corrupts edges and texture.

What is the main advantage of the proposed filter?

The proposed filter does not try to capture the transformations of all matches in a single, overall model, but it relies instead on simpler, weak properties, involving only three matches each.

How does it solve a problem with n = 20?

It is also very time efficient, as it solves cases with n = 20 within some seconds (exhaustive search needs more than 1 hour), and scales well, taking less than one minute for n = 60, a problem size for which the real optimum cannot be computed.

What is the method for constructing two-view region correspondences?

Once two-view region correspondences have been produced for all ordered pairs of model views (vi, vj), i = j, they can be organized into multi-view region tracks.

Why is the image-exploration technique faster and less powerful?

The use of this faster, less powerful version is justified because matching model views is easier than matching to a test image: there is no background clutter, and the object appears at approximately the same scale.

What is the effect of refinement on the similarity of correctly propagated matches?

Refinement raises the similarity of correctly propagated matches much more than the similarity of mispropagated ones, thereby helping correct supports to win.

Why is the decomposition of the GAM interesting?

When applied to this input, the GAM decomposition is most interesting, because the constructor has enough prime matter to build GAMs covering larger areas, even if curved or deformed.

How do the contributions from all model views of a single object be combined?

The contributions from all model views of a single object are combined by superimposing the area covered by the final set of matched regions (to find the contour), and by summing their number (detection criterion).

What is the method for constructing correspondences between two view views?

Let’s recall that the image-exploration technique constructs correspondences for many overlapping circular regions, arranged on a grid completely covering the first model view vi (coverage regions, see Section 4.1).

How is the probability that mismatches are grouped in the same GAM?

More precisely, the probability that N mismatches are grouped in the same GAM is expected to decrease roughly exponentially with N.

What is the relationship between the model views and the region tracks?

Since the model views are interconnected by the model tracks, the authors know the correspondences of the regions on the paw between views 3 and 4.

(Open Access) Simultaneous Object Recognition and Segmentation from Single or Multiple Model Views (2006) | Vittorio Ferrari

Q: What have the authors contributed in "Simultaneous object recognition and segmentation from single or multiple model views" ?

The authors present a novel Object Recognition approach based on affine invariant regions.

Q: What are the future works in "Simultaneous object recognition and segmentation from single or multiple model views" ?

Although the authors plan a number of speedups, the method is unlikely to reach the speed of the fastest other systems ( the system of Lowe ( 2001, 2004 ) is reported to perform recognition within seconds ). Finally, using several types of affine invariant regions simultaneously, rather than only those of Tuytelaars and Van-Gool ( 2000 ), would push the performance further upwards.

Q: What is the way to evaluate the correctness of sets of matches?

When evaluating the correctness and interrelations of sets of matches, it is convenient to reason at the higher perceptual grouping level that GAMs offer: no longer consider unrelated region matches, but the collection of GAMs instead.

ETH Library

Simultaneous object recognition

and segmentation from single or

multiple model views

Journal Article

Author(s):

Ferrari, Vittorio; Tuytelaars, Tinne; Van Gool, Luc

Publication date:

2006-04

Permanent link:

https://doi.org/10.3929/ethz-b-000036902

Rights / license:

In Copyright - Non-Commercial Use Permitted

Originally published in:

International Journal of Computer Vision 67(2), https://doi.org/10.1007/s11263-005-3964-7

This page was generated automatically upon download from the ETH Zurich Research Collection.

For more information, please consult the Terms of use.

International Journal of Computer Vision 67(2), 159–188, 2006



2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

DOI: 10.1007/s11263-005-3964-7

Simultaneous Object Recognition and Segmentation from Single or Multiple

Model Views

∗

VITTORIO FERRARI

Computer Vision Group (BIWI), ETH Zuerich, Switzerland

ferrari@vision.ee.ethz.ch

TINNE TUYTELAARS

ESAT-PSI, University of Leuven, Belgium

Tinne.Tuytelaars@esat.kuleuven.ac.be

LUC VAN GOOL

Computer Vision Group (BIWI), ETH Zuerich, Switzerland ; ESAT-PSI, University of Leuven, Belgium

vangool@vision.ee.ethz.ch

Received September 21, 2004; Revised April 4, 2005; Accepted May 3, 2005

First online version published in January, 2006

Abstract. We present a novel Object Recognition approach based on afﬁne invariant regions. It actively counters

the problems related to the limited repeatability of the region detectors, and the difﬁculty of matching, in the

presence of large amounts of background clutter and particularly challenging viewing conditions. After producing

an initial set of matches, the method gradually explores the surrounding image areas, recursively constructing more

and more matching regions, increasingly farther from the initial ones. This process covers the object with matches,

and simultaneously separates the correct matches from the wrong ones. Hence, recognition and segmentation are

achieved at the same time. The approach includes a mechanism for capturing the relationships between multiple

model views and exploiting these for integrating the contributions of the views at recognition time. This is based on

an efﬁcient algorithm for partitioning a set of region matches into groups lying on smooth surfaces. Integration is

achieved by measuring the consistency of conﬁgurations of groups arising from different model views. Experimen-

tal results demonstrate the stronger power of the approach in dealing with extensive clutter, dominant occlusion,

and large scale and viewpoint changes. Non-rigid deformations are explicitly taken into account, and the approx-

imative contours of the object are produced. All presented techniques can extend any view-point invariant feature

extractor.

1. Introduction

The modern trend in Object Recognition has aban-

doned model-based approaches (e.g. Bebis et al.,

∗

This research was supported by EC project VIBES, the Fund for

Scientiﬁc Research Flanders, and the IST Network of Excellence

PASCAL.

1995), which require a 3D model of the object as in-

put, in favor of appearance-based ones, where some

example images sufﬁce. Two kinds of appearance-

based methods exist: global and local. Global methods

build an object representation by integrating informa-

tion over an entire image (e.g. Cyr and Kimia, 2001;

Murase and Nayar, 1995; Swain and Ballard, 1991),

and are therefore very sensitive to background clutter

160 Ferrari, Tuytelaars and Van Gool

and partial occlusion. Hence, global methods only con-

sider test images without background, or necessitate a

prior segmentation, a task which has proven extremely

difﬁcult. Additionally, robustness to large viewpoint

changes is hard to achieve, because the global object

appearance varies in a complex and unpredictable way

(the object’s geometry is unknown). Local methods

counter problems due to clutter and occlusion by rep-

resenting images as a collection of features extracted

based on local information only (e.g. Selinger and

Nelson, 1999). After the inﬂuential work of Schmid

(1996), who proposed the use of rotation-invariant fea-

tures, there has been important evolution. Feature ex-

tractors have appeared (Lowe, 2004; Mikolajczyk and

Schmid, 2001) which are invariant also under scale

changes, and more recently recognition under gen-

eral viewpoint changes has become possible, thanks

to extractors adapting the complete afﬁne shape of the

feature to the viewing conditions (Baumberg, 2000;

Matas et al., 2002; Mikolajczyk and Schmid, 2002;

Schaffalitzky and Zisserman, 2002; Tuytelaars et al.,

1999; Tuytelaars and Van-Gool, 2000). These afﬁne

invariant features are particularly signiﬁcant: even

though the global appearance variation of 3D objects

is very complex under viewpoint changes, it can be

approximated by simple afﬁne transformations on a

local scale, where each feature is approximately planar

(a region). Local invariant features are used in many

recent works, and provide the currently most success-

ful paradigm for Object Recognition (e.g. Lowe, 2004;

Mikolajczyk and Schmid, 2002; Obrdzalek and Matas,

2002; Rothganger et al., 2005; Tuytelaars and Van-

Gool, 2000). In the basic common scheme a number

of features are extracted independently from both a

model and a test image, then characterized by invari-

ant descriptors and ﬁnally matched.

In spite of their success, the robustness and general-

ity of these approaches are limited by the repeatability

of the feature extraction, and the difﬁculty of matching

correctly, in the presence of large amounts of clutter

and challenging viewing conditions. Indeed,large scale

or viewpoint changes considerably lower the proba-

bility that any given model feature is re-extracted in

the test image. Simultaneously, occlusion reduces the

number of visible model features. The combined effect

is that only a small fraction of model features has a cor-

respondence in the test image. This fraction represents

the maximal number of features that can be correctly

matched. Unfortunately, at the same time extensive

clutter gives rise to a large number of non-object fea-

tures, which disturb the matching process. As a ﬁnal

outcome of these combined difﬁculties, only a few, if

any, correct matches are produced. Because these of-

ten come together with many mismatches, recognition

tends to fail.

Even in easier cases, to suit the needs for repeata-

bility in spite of viewpoint changes, only a sparse set

of distinguished features (Matas et al., 2002)areex-

tracted. As a result, only a small portion of the object

is typically covered with matches. Densely covering

the visible part of the object is desirable, as it increases

the evidence for its presence, which results in higher

detection power. Moreover, it would allow to ﬁnd the

contours of the object, rather than just its location.

Simultaneous recognition and segmentation. In the

ﬁrst part of the paper we tackle these problems with a

new, powerful technique to match a model view to the

test image which no longer relies solely on matching

viewpoint invariant features. We start by producing an

initial large set of unreliable region correspondences,

so as to maximize the number of correct matches, at

the cost of introducing many mismatches. Addition-

ally, we generate a grid of regions densely covering the

model image. The core of the method then iteratively

alternates between expansion phases and contraction

phases. Each expansion phase tries to construct re-

gions corresponding to the coverage ones, based on the

geometric transformation of nearby existing matches.

Contraction phases try to remove incorrect matches,

using ﬁlters that tolerate non-rigid deformations.

This scheme anchors on the initial matches and then

looks around them trying to construct more. As new

matches arise, they are exploited to construct even

more, in a process which gradually explores the test im-

age, recursively constructing more and more matches,

increasingly farther from the initial ones. At each iter-

ation, the presence of the new matches helps the ﬁlter

taking better removal decisions. In turn, the cleaner

set of matches makes the next expansion more effec-

tive. As a result, the number, percentage and extent

of correct matches grow with every iteration. The two

closely cooperating processes of expansion and con-

traction gather more evidence about the presence of

the object and separate correct matches from wrong

ones at the same time. Hence, they achieve simultane-

ous recognition and segmentation of the object.

By constructing matches for the coverage regions,

the system succeedsincovering also image areas which

are not interesting for the feature extractor or not

Simultaneous Object Recognition and Segmentation 161

discriminative enough to be correctly matched by tra-

ditional techniques. During the expansion phases, the

shape of each new region is adapted to the local sur-

face orientation, allowing the exploration process to

follow curved surfaces and deformations (e.g. a folded

magazine).

The basic advantage of our approach is that each sin-

gle correct initial match can expand to cover a smooth

surface with many correct matches, even when start-

ing from a large number of mismatches. This leads to

ﬁlling the visible portion of the object with matches.

Some interesting direct advantages derivefrom it. First,

robustness to scale, viewpoint, occlusion and clutter

are greatly enhanced, because most cases where tradi-

tional approaches generate only a few correct matches

are now solvable. Secondly, discriminative power is in-

creased, because decisions about the object’s identity

are based on information densely distributed over the

entire portion of the object visible in the test image.

Thirdly, the approximate boundary of the object in the

test image is suggested by the ﬁnal set of matches.

Fourthly, non-rigid deformations are explicitly taken

into account.

Integrating multiple model views. When multiple

model views are available, there usually are signiﬁ-

cant overlaps between the object parts seen by different

views. In the second part of the paper, we extend our

method to capture the relationships between the model

views, and to exploit these for integrating the contri-

butions of the views during recognition. The main in-

gredient is the novel concept of a group of aggregated

matches (GAM). A GAM is a set of region matches be-

tween two images, which are distributed over a smooth

surface of the object. A set of matches, including an

arbitrary amount of mismatches, can be partitioned

into GAMs. The more matches there are in a GAM,

the more likely it is that they are correct. Moreover,

the matches in a GAM are most often all correct, or all

incorrect. When evaluating the correctness and inter-

relations of sets of matches, it is convenient to reason at

the higher perceptual grouping level that GAMs offer:

no longer consider unrelated region matches, but the

collection of GAMs instead. Hence, GAMs become

the atomic unit, with their size carrying precious infor-

mation. Moreover, the computational complexity of a

problem can be reduced, because there are consider-

ably fewer relevant GAMs than region matches.

Concretely, multiple-view integration is achieved as

follows. During modeling, the model views are con-

nected by a number of region-tracks. At recognition

time, each model view is matched to the test image,

and the resulting matches are partitioned into GAMs.

The coherence of a conﬁguration of GAMs, possibly

originating from different model views, is evaluated

using the region tracks that span the model views. We

search for the most consistent conﬁguration, covering

the object as completely as possible, and deﬁne a conﬁ-

dence score which strongly increases in the presence of

compatible GAMs. In this fashion, the detection power

improves over the simple approach of considering the

contribution of each model view independently. More-

over, incorrect GAMs are discovered because they do

not belong to the best conﬁguration, thus improving

the segmentation.

Paper structure. Sections 2 to 8 cover the ﬁrst part:

the image-exploration technique to match a model view

to the test image. The integration of multiple model

views is described in the second part, Sections 9 to 12.

A discussion of related work can be found in Section

14, while experimental results are given in Section 13.

Finally, Section 15 closes the paper with conclusions

and possible directions for future research. Preliminary

versions of this work have appeared in Ferrari et al.

(2004a, b).

2. Overview of Part I: Simultaneous Recognition

and Segmentation

Figure 2(a) shows a challenging example, which is

used as case-study throughout the ﬁrst part of the paper.

There is a large scale change (factor 3.3), out-of-plane

rotation, extensive clutter and partial occlusion. All

these factors make the life of the feature extraction and

matching algorithms hard.

A scheme of the approach is illustrated in Fig. 1.

We build upon a multi-scale extension of the extrac-

tor of Tuytelaars and Van-Gool (2000). However, the

method works in conjunction with any afﬁne invariant

region extractor (Baumberg, 2000; Matas et al., 2002;

Mikolajczyk and Schmid, 2002). In the ﬁrst phase (soft

matching), we form a large set of initial region corre-

spondences. The goal is to obtain some correct matches

Figure 1. Phases of the image-exploration technique.

162 Ferrari, Tuytelaars and Van Gool

also in difﬁcult cases, even at the price of including a

large majority of mismatches. Next, a grid of circular

regions covering the model image is generated (coined

coverage regions). The early expansion phase tries to

propagate these coverage regions based on the geomet-

ric transformation of nearby initial matches. By propa-

gating a region, we mean constructing the correspond-

ing one in the test image. The propagated matches and

the initial ones are then passed through a novel local ﬁl-

ter, during the early contraction phase, which removes

some of the mismatches. The processing continues by

alternating faster expansion phases (main expansion),

where coverage regions are propagated over a larger

area, with contraction phases based on a global ﬁlter

(main contraction). This ﬁlter exploits both topological

arrangements and appearance information, and toler-

ates non-rigid deformations.

The ‘early’ phases differ from the ‘main’ phases in

that they are specialized to deal with the extremely

low percentage of correct matches given by the initial

matcher in particularly difﬁcult cases.

3. Soft Matching

The ﬁrst stage is to compute an initial set of region

matches between a model image I

and a test image I

The region extraction algorithm (Tuytelaars and

Van-Gool, 2000) is applied to both images indepen-

dently, producing two sets of regions 

, 

, and a

vector of invariants describing each region (Tuytelaars

and Van-Gool, 2000). Test regions 

are matched

to model regions 

in two steps, explained in the

next two subsections. The matching procedure allows

for soft matches, i.e. more than one model region is

matched to the same test region, or vice versa.

3.1. Tentative Matches

For each test region T ∈

we ﬁrst compute the Maha-

lanobis distance of the descriptors to all model regions

M ∈ 

. Next, the following appearance similarity

measure is computed between T and each of the 10

closest model regions:

sim(M, T) = NCC(M, T ) +



1 −

RGB(M, T)

100



(1)

where NCC is the normalized cross-correlation be-

tween the regions’ greylevel patterns, while d

RGB

is the average pixel-wise Euclidean distance in RGB

color-space after independent normalization of the

3 colorbands (necessary to achieve photometric in-

variance). Before computation, the two regions are

aligned by the afﬁne transformation mapping T to

M. This mixed measure is more discriminative than

NCC alone, which is the most common choice in the

literature (Obrdzalek and Matas, 2002; Mikolajczyk

and Schmid, 2002; Tuytelaars and Van-Gool, 2000).

NCC mostly looks at the pattern structure, and dis-

cards valuable color information. A green disc on a

red background, and a bright blue disc on a dark blue

background would be very similar under NCC. d

RGB

captures complementary properties. As it focuses on

color correspondence, it would correctly score low the

previous disc example. However, it would confuse a

green disc on a bright green background with a green

cross on a bright green background, a difference which

NCC would spot. By summing these two measures, we

obtain a more robust one which alleviates their com-

plementary shortcomings.

Each of the 3 test regions most similar to T above

a low threshold t

are considered tentatively matched

to T. Repeating this operation for all regions T ∈ 

yields a ﬁrst set of tentative matches. At this point,

every test region could be matched to either none, 1, 2

or 3 model regions.

3.2. Reﬁnement and Re-Thresholding

Since all regions are independently extracted from the

two images, the geometric registration of a correct

match is often not optimal. Two matching regions often

do not cover exactly the same physical surface, which

lowers their similarity. The registration of the tentative

matches is now reﬁned using our algorithm (Ferrari

et al., 2003), that efﬁciently looks for the afﬁne trans-

formation that maximizes the similarity. This results

in adjusting the region’s location and shape in one of

the images. Besides raising the similarity of correct

matches, this improves the quality of the forthcoming

expansion stage, where new matches are constructed

based on the afﬁne transformation of the initial ones.

After reﬁnement, the similarity is re-evaluated and

only matches scoring above a second, higher threshold

are kept.

Reﬁnement tends to raise the similarity of

correct matches much more than that of mismatches.

The increased separation between the similarity distri-

butions makes the second thresholding more effective.

At this point, about 1/3 to 1/2 of the tentative matches

are left.

Simultaneous Object Recognition and Segmentation from Single or Multiple Model Views

Figures

Citations

A Comparison of Affine Region Detectors

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

3D Bounding Box Estimation Using Deep Learning and Geometry

3D Bounding Box Estimation Using Deep Learning and Geometry

3D generic object categorization, localization and pose estimation

References

Distinctive Image Features from Scale-Invariant Keypoints

A performance evaluation of local descriptors

Video Google: a text retrieval approach to object matching in videos

Color indexing

Robust wide-baseline stereo from maximally stable extremal regions

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Object recognition from local scale-invariant features

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

A performance evaluation of local descriptors

Multiple view geometry in computer vision

Frequently Asked Questions (17)

Q1. What have the authors contributed in "Simultaneous object recognition and segmentation from single or multiple model views" ?

Q2. What are the future works in "Simultaneous object recognition and segmentation from single or multiple model views" ?

Q3. Why does the method tend to implode in negative cases?

Q4. What is the effect of the large scale change on the image?

Q5. What is the way to cover the visible part of the object?

Q6. What is the main advantage of the proposed filter?

Q7. How does it solve a problem with n = 20?

Q8. What is the method for constructing two-view region correspondences?

Q9. Why is the image-exploration technique faster and less powerful?

Q10. What is the way to evaluate the correctness of sets of matches?

Q11. How is the coherence of a configuration of GAMs evaluated?

Q12. What is the effect of refinement on the similarity of correctly propagated matches?

Q13. Why is the decomposition of the GAM interesting?

Q14. How do the contributions from all model views of a single object be combined?

Q15. What is the method for constructing correspondences between two view views?

Q16. How is the probability that mismatches are grouped in the same GAM?

Q17. What is the relationship between the model views and the region tracks?