Automatic Panoramic Image Stitching using Invariant Features

doi:10.1007/S11263-006-0002-3

Matthew Brown and David G. Lowe

{mbrown|lowe}@cs.ubc.ca

Department of Computer Science,

University of British Columbia,

Vancouver, Canada.

Abstract

This paper concerns the problem of fully automated

panoramic image stitching. Though the 1D problem (single

axis of rotation) is well studied, 2D or multi-row stitching is

more difﬁcult. Previous approaches have used human input

or restrictions on the image sequence in order to establish

matching images. In this work, we formulate stitching as a

multi-image matching problem, and use invariant local fea-

tures to ﬁnd matches between all of the images. Because of

this our method is insensitive to the ordering, orientation,

scale and illumination of the input images. It is also insen-

sitive to noise images that are not part of a panorama, and

can recognise multiple panoramas in an unordered image

dataset. In addition to providing more detail, this paper ex-

tends our previous work in the area [BL03] by introducing

gain compensation and automatic straightening steps.

1 Introduction

Panoramic image stitching has an extensive research lit-

erature [Sze04, Mil75, BL03] and several commercial ap-

plications [Che95, REA, MSF]. The basic geometry of

the problem is well understood, and consists of estimat-

ing a 3 × 3 camera matrix or homography for each image

[HZ04, SS97]. This estimation process needs an initialisa-

tion, which is typically provided by user input to approxi-

mately align the images, or a ﬁxed image ordering. For ex-

ample, the PhotoStitch software bundled with Canon digital

cameras requires a horizontal or vertical sweep, or a square

matrix of images. REALVIZ Stitcher version 4 [REA] has a

user interface to roughly position the images with a mouse,

before automatic registration proceeds. Our work is novel

in that we require no such initialisation to be provided.

In the research literature methods for automatic image

alignment and stitching fall broadly into two categories

– direct [SK95, IA99, SK99, SS00] and feature based

[ZFD97, CZ98, MJ02]. Direct methods have the advan-

tage that they use all of the available image data and hence

can provide very accurate registration, but they require a

close initialisation. Feature based registration does not re-

quire initialisation, but traditional feature matching meth-

ods (e.g., correlation of image patches around Harris cor-

ners [Har92, ST94]) lack the invariance properties needed

to enable reliable matching of arbitrary panoramic image

sequences.

In this paper we describe an invariant feature based ap-

proach to fully automatic panoramic image stitching. This

has several advantages over previous approaches. Firstly,

our use of invariant features enables reliable matching of

panoramic image sequences despite rotation, zoom and illu-

mination change in the input images. Secondly, by viewing

image stitching as a multi-image matching problem, we can

automatically discover the matching relationships between

the images, and recognise panoramas in unordered datasets.

Thirdly, we generate high-quality results using multi-band

blending to render seamless output panoramas. This paper

extends our earlier work in the area [BL03] by introducing

gain compensation and automatic straightening steps. We

also describe an efﬁcient bundle adjustment implementation

and show how to perform multi-band blending for multiple

overlapping images with any number of bands.

The remainder of the paper is structured as follows. Sec-

tion 2 develops the geometry of the problem and motivates

our choice of invariant features. Section 3 describes our im-

age matching methodology (RANSAC) and a probabilistic

model for image match veriﬁcation. In section 4 we de-

scribe our image alignment algorithm (bundle adjustment)

which jointly optimises the parameters of each camera. Sec-

tions 5 - 7 describe the rendering pipeline including au-

tomatic straightening, gain compensation and multi-band

blending. In section 9 we present conclusions and ideas for

future work.

2 Feature Matching

The ﬁrst step in the panoramic recognition algorithm is

to extract and match SIFT [Low04] features between all of

the images. SIFT features are located at scale-space max-

ima/minima of a difference of Gaussian function. At each

feature location, a characteristic scale and orientation is es-

tablished. This gives a similarity-invariant frame in which

to make measurements. Although simply sampling inten-

sity values in this frame would be similarity invariant, the

invariant descriptor is actually computed by accumulating

local gradients in orientation histograms. This allows edges

to shift slightly without altering the descriptor vector, giving

some robustness to afﬁne change. This spatial accumulation

is also important for shift invariance, since the interest point

locations are typically only accurate in the 0-3 pixel range

[BSW05, SZ03]. Illumination invariance is achieved by us-

ing gradients (which eliminates bias) and normalising the

descriptor vector (which eliminates gain).

Since SIFT features are invariant under rotation and scale

changes, our system can handle images with varying orien-

tation and zoom (see ﬁgure 8). Note that this would not be

possible using traditional feature matching techniques such

as correlation of image patches around Harris corners. Or-

dinary (translational) correlation is not invariant under ro-

tation, and Harris corners are not invariant to changes in

scale.

Assuming that the camera rotates about its optical cen-

tre, the group of transformations the images may undergo

is a special group of homographies. We parameterise each

camera by a rotation vector θ = [θ

1

, θ

2

, θ

3

] and focal length

f. This gives pairwise homographies

˜

u

i

= H

ij

˜

u

j

where

H

ij

= K

i

R

i

R

T

j

K

−1

j

(1)

and

˜

u

i

,

˜

u

j

are the homogeneous image positions (

˜

u

i

=

s

i

[u

i

, 1], where u

i

is the 2-dimensional image position).

The 4 parameter camera model is deﬁned by

K

i

=





f

i

0 0

0 f

i

0

0 0 1





(2)

and (using the exponential representation for rotations)

R

i

= e

[θ

i

]

×

, [θ

i

]

×

=





0 −θ

i

3

θ

i

2

θ

i

3

0 −θ

i

1

−θ

i

2

θ

i

1

0





. (3)

Ideally one would use image features that are invariant

under this group of transformations. However, for small

changes in image position

u

i

= u

i

0

+

∂u

i

∂u

j



u

i

0

∆u

j

(4)

or equivalently

˜

u

i

= A

ij

˜

u

j

, where

A

ij

=





a

11

a

12

a

13

a

21

a

22

a

23

0 0 1





(5)

is an afﬁne transformation obtained by linearising the ho-

mography about u

i

0

. This implies that each small image

patch undergoes an afﬁne transformation, and justiﬁes the

use of SIFT features which are partially invariant under

afﬁne change.

Once features have been extracted from all n images (lin-

ear time), they must be matched. Since multiple images

may overlap a single ray, each feature is matched to its k

nearest neighbours in feature space (we use k = 4). This

can be done in O(n log n) time by using a k-d tree to ﬁnd

approximate nearest neighbours [BL97]. A k-d tree is an

axis aligned binary space partition, which recursively par-

titions the feature space at the mean in the dimension with

highest variance.

3 Image Matching

At this stage the objective is to ﬁnd all matching (i.e.

overlapping) images. Connected sets of image matches will

later become panoramas. Since each image could poten-

tially match every other one, this problem appears at ﬁrst to

be quadratic in the number of images. However, it is only

necessary to match each image to a small number of over-

lapping images in order to get a good solution for the image

geometry.

From the feature matching step, we have identiﬁed im-

ages that have a large number of matches between them. We

consider a constant number m images, that have the greatest

number of feature matches to the current image, as potential

image matches (we use m = 6). First, we use RANSAC to

select a set of inliers that are compatible with a homography

between the images. Next we apply a probabilistic model to

verify the match.

3.1 Robust Homography Estimation using

RANSAC

RANSAC (random sample consensus) [FB81] is a robust

estimation procedure that uses a minimal set of randomly

sampled correspondences to estimate image transformation

parameters, and ﬁnds a solution that has the best consensus

with the data. In the case of panoramas we select sets of

r = 4 feature correspondences and compute the homogra-

phy H between them using the direct linear transformation

(DLT) method [HZ04]. We repeat this with n = 500 tri-

als and select the solution that has the maximum number

of inliers (whose projections are consistent with H within

a tolerance  pixels). Given the probability that a feature

match is correct between a pair of matching images (the in-

lier probability) is p

i

, the probability of ﬁnding the correct

transformation after n trials is

p(H is correct) = 1 − (1 − (p

i

)

r

)

n

. (6)

After a large number of trials the probability of ﬁnding the

correct homography is very high. For example, for an in-

lier probability p

i

= 0.5, the probability that the correct

homography is not found after 500 trials is approximately

1 × 10

−14

.

RANSAC is essentially a sampling approach to estimat-

ing H. If instead of maximising the number of inliers one

maximises the sum of the log likelihoods, the result is max-

imum likelihood estimation (MLE). Furthermore, if priors

on the transformation parameters are available, one can

compute a maximum a posteriori estimate (MAP). These

algorithms are known as MLESAC and MAPSAC respec-

tively [Tor02].

3.2 Probabilistic Model for Image Match Veriﬁ-

cation

For each pair of potentially matching images we have

a set of feature matches that are geometrically consistent

(RANSAC inliers) and a set of features that are inside the

area of overlap but not consistent (RANSAC outliers). The

idea of our veriﬁcation model is to compare the probabilities

that this set of inliers/outliers was generated by a correct

image match or by a false image match.

For a given image we denote the total number of features

in the area of overlap n

f

and the number of inliers n

i

. The

event that this image matches correctly/incorrectly is rep-

resented by the binary variable m  {0, 1}. The event that

the i

th

feature match f

(i)

 {0, 1} is an inlier/outlier is as-

sumed to be independent Bernoulli, so that the total number

of inliers is Binomial

p(f

(1:n

f

)

|m = 1) = B(n

i

; n

f

, p

1

) (7)

p(f

(1:n

f

)

|m = 0) = B(n

i

; n

f

, p

0

) (8)

where p

1

is the probability a feature is an inlier given a cor-

rect image match, and p

0

is the probability a feature is an

inlier given a false image match. The set of feature match

variables {f

(i)

, i = 1, 2, . . . , n

f

} is denoted f

(1:n

f

)

. The

number of inliers n

i

=

P

n

f

i=1

f

(i)

and B(.) is the Binomial

distribution

B(x; n, p) =

n!

x!(n − x)!

p

x

(1 − p)

n−x

. (9)

We choose values p

1

= 0.6 and p

0

= 0.1. We can now eval-

uate the posterior probability that an image match is correct

using Bayes’ Rule

p(m = 1|f

(1:n

f

)

) =

p(f

(1:n

f

)

|m = 1)p(m = 1)

p(f

(1:n

f

)

(10)

=

1

1 +

p(f

(1:n

f

)

|m=0)p(m=0)

p(f

(1:n

f

)

|m=1)p(m=1)

(11)

We accept an image match if p(m = 1|f

(1:n

f

)

) > p

min

B(n

i

; n

f

, p

1

)p(m = 1)

B(n

i

; n

f

, p

0

)p(m = 0)

accept

≷

reject

1

p

min

− 1

. (12)

Choosing values p(m = 1) = 10

−6

and p

min

= 0.999

gives the condition

n

i

> α + βn

f

(13)

for a correct image match, where α = 8.0 and β = 0.3.

Though in practice we have chosen values for p

0

, p

1

, p(m =

0), p(m = 1) and p

min

, they could in principle be learnt

from the data. For example, p

1

could be estimated by com-

puting the fraction of matches consistent with correct ho-

mographies over a large dataset.

Once pairwise matches have been established between

images, we can ﬁnd panoramic sequences as connected sets

of matching images. This allows us to recognise multiple

panoramas in a set of images, and reject noise images which

match to no other images (see ﬁgure (2)).

4 Bundle Adjustment

Given a set of geometrically consistent matches between

the images, we use bundle adjustment [TMHF99] to solve

for all of the camera parameters jointly. This is an essen-

tial step as concatenation of pairwise homographies would

cause accumulated errors and disregard multiple constraints

between images, e.g., that the ends of a panorama should

join up. Images are added to the bundle adjuster one by

one, with the best matching image (maximum number of

consistent matches) being added at each step. The new im-

age is initialised with the same rotation and focal length as

the image to which it best matches. Then the parameters are

updated using Levenberg-Marquardt.

The objective function we use is a robustiﬁed sum

squared projection error. That is, each feature is projected

into all the images in which it matches, and the sum of

squared image distances is minimised with respect to the

camera parameters

1

. Given a correspondence u

k

i

↔ u

l

j

(u

k

i

1

Note that it would also be possible (and in fact statistically optimal) to

represent the unknown ray directions X explicitly, and to estimate them

jointly with the camera parameters. This would not increase the com-

plexity of the algorithm if a sparse bundle adjustment method was used

[TMHF99].

(a) Image 1 (b) Image 2

(c) SIFT matches 1 (d) SIFT matches 2

(e) RANSAC inliers 1 (f) RANSAC inliers 2

(g) Images aligned according to a homography

Figure 1. SIFT features are extracted from all of the images. After matching all of the features using a k-d tree, the m

images with the greatest number of feature matches to a given image are checked for an image match. First RANSAC

is performed to compute the homography, then a probabilistic model is invoked to verify the image match based on the

number of inliers. In this example the input images are 517 × 374 pixels and there are 247 correct feature matches.

(a) Image matches

(b) Connected components of image matches

(c) Output panoramas

Figure 2. Recognising panoramas. Given a noisy set of feature matches, we use RANSACanda probabilistic veriﬁcation

procedure to ﬁnd consistent image matches (a). Each arrow between a pair of images indicates that a consistent set of

feature matches was found between that pair. Connected components of image matches are detected (b) and stitched

into panoramas (c). Note that the algorithm is insensitive to noise images that do not belong to a panorama (connected

components of size 1 image).

Automatic Panoramic Image Stitching using Invariant Features

Figures

Citations

Computer Vision: Algorithms and Applications

Globally optimal stitching of tiled 3D microscopic image acquisitions

A Unified Framework for Street-View Panorama Stitching.

MatchNet: Unifying feature and metric learning for patch-based matching

Vehicular vision system

References

Distinctive Image Features from Scale-Invariant Keypoints

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

Good features to track

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Speeded-Up Robust Features (SURF)

SURF: speeded up robust features

Multiple view geometry in computer vision