How long does it take to render the intermediate images?

Once the 3-D model is generated, it takes about 0.16 s to render the intermediate images from the reference images, which includes 0.05 s to access the file and 0.04 s to display the image.

What is the effect of the geometrical settings of the base cameras?

Since the PGS is defined by the basis cameras, the geometrical settings of the base cameras affects the results obtained by the proposed method.

How do the authors select the basis cameras?

The authors select two basis cameras so that the angle of the viewing direction between the basis cameras is close to 90 to make the axes , and almost perpendicular to each other.

What is the voxel density in the PGS of the objective area?

Since the field of view of the cameras used in this experiment is less than 10 , the voxel density in the PGS of the objective area is roughly homogeneous.

What is the method for reconstructing a 3-D shape model in a large target space?

In this section, the authors proposea method for reconstructing precise 3-D shape models in a large target space, which involves dividing the target space intoseveral small subcells and reconstructing a 3-D shape model for every cell.

How do the authors obtain the fundamental matrices between the cameras?

The fundamental matrices between the cameras are obtained by putting a checkerboard pattern at various heights, as depicted in Fig. 10, so that the image feature points can be distributed in the objective space.

How many image feature points are extracted from the two basis cameras?

From such images, about 50 image feature points are extracted, and then the same feature points extracted in the other cameras are manually corresponded.

How many cells are used to synthesize a 3-D shape model?

even when the target space is very large, e.g., a soccer field or an American football field, the authors can synthesize arbitrary view images by dividing the whole target space into several cells and reconstructing a 3-D shape model in each cell separately.

(Open Access) Arbitrary viewpoint video synthesis from multiple uncalibrated cameras (2004) | Satoshi Yaguchi

430 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

Arbitrary Viewpoint Video Synthesis

From Multiple Uncalibrated Cameras

Satoshi Yaguchi and Hideo Saito

Abstract—We propose a method for arbitrary view synthesis

from uncalibrated multiple camera system, targeting large spaces

such as soccer stadiums. In Projective Grid Space (PGS), which is

a three-dimensional space defined by epipolar geometry between

two basis cameras in the camera system, we reconstruct three-di-

mensional shape models from silhouette images. Using the three-

dimensional shape models reconstructed in the PGS, we obtain

a dense map of the point correspondence between reference im-

ages. The obtained correspondence can synthesize the image of

arbitrary view between the reference images. We also propose a

method for merging the synthesized images with the virtual back-

ground scene in the PGS. We apply the proposed methods to image

sequences taken by a multiple camera system, which installed in

a large concert hall. The synthesized image sequences of virtual

camera have enough quality to demonstrate effectiveness of the

proposed method.

Index Terms—Fundamental matrix, projective geometry, pro-

jective grid space, shape from multiple cameras, view interpola-

tion, virtual view synthesis.

I. INTRODUCTION

HE synthesis of new views from images can enhance the

visual-entertainment effect of a movie or broadcast for a

television viewer. One way of enhancing the visual effect is

through virtual movement of viewpoint, which makes viewers

virtually feel that they are in the target scene. Recent applica-

tions of this effect can be found in the futuristic movie “The

Matrix,” and the SuperBowl XXXV broadcast by CBS in 2001

which used the EyeVision system. Virtualized Reality [8], a pi-

oneering project in this field, has achieved virtual viewpoint

movement for dynamic scenes by using computer vision tech-

nology. Whereas “The Matrix” and EyeVision use the switching

effect of real images taken by multiple cameras, computer-vi-

sion-based technology can synthesize arbitrary viewpoint im-

ages to create a virtual viewpoint movement effect.

We aim to apply virtualized reality technology to actual

sporting events. New-view images are generated by rendering

pixel values of input images in accordance with the geometry of

the new view and a three-dimensional (3-D) structure model of

the scene, which is reconstructed from multiple-view images.

The 3-D shape reconstruction from multiple views requires

camera calibration, which is carried out in order to relate the

camera geometry to the object space geometry. For camera cal-

ibration, the 3-D positions of several points in Euclidean space

Manuscript received February 19, 2002; revised July 2, 2002. This paper was

recommended by Associate Editor I. Gu.

The authors are with the Department of Information and Com-

puter Science, Keio University, Yokohama 223-8522, Japan (e-mail:

yagu@ozawa.ics.keio.ac.jp; saito@ozawa.ics.keio.ac.jp).

Digital Object Identifier 10.1109/TSMCB.2003.817108

and 2-D positions of those points on each view image must

be measured precisely. For this reason, when there are many

cameras involved in the production of an event, a lot of effort

must be expended to perform the calibration. This is especially

true in the case of a large space, such as a stadium, where it is

very difficult to set many calibration points whose positions

have to be precisely measured for the entire area. We have

already proposed a new framework for shape reconstruction

from multiple uncalibrated cameras in a projective grid space

(PGS) [15], in which coordinates between cameras are defined

by using epipolar geometry instead of calibration.

In this paper, we present a method for generating arbitrary

views from image sequences taken from multiple uncalibreated

cameras. The shape-from-silhouette (SS) [2], [14] method is ap-

plied to reconstruct the shape model in the PGS. Then, the dense

corresponding relation between the images derived from the

shape model is used to synthesize intermediate appearance view

images. We demonstrate the proposed framework by showing

several virtual image sequences generated from corrected mul-

tiple-camera image sequences captured in a large space

II. R

ELATED WORKS

View synthesis from stereo images has long been a topic of

study [17]. Once the disparity between a pair of stereo images

is obtained, it can be modified to obtain intermediate images.

However, a hole, where no pixel value can be assigned from

the original stereo pair, generally appears in synthesized view

images because of occluded regions.

One method devised for removing such a hole caused by an

occlusion is to use a completely closed 3-D shape model of the

object, which can be obtained by using shape scanning tech-

nology [4], [24] or recovered from multiple-view images [6],

[19], [23]. Such a framework for generating new views from

the recovered 3-D model of an object and its texture map on the

3-D model surface is generally called model-based rendering

(MBR). MBR can handle the occlusion problem, but registra-

tion errors in the texture map on the constructed 3-D model may

cause blurring of the synthesized virtual images.

Alternatively, image-based rendering (IBR) [1], [3], [5], [7],

[10], [11], [18] has recently been developed for generating

new-view images from multiple-view images without using

a 3-D shape model of the object. Because IBR is essentially

based on 2-D image processing (cut, warp, paste, etc.), the

errors in 3-D shape reconstruction do not affect the quality of

the generated images as much as they do for the model-based

rendering method. This implies that the quality of the input

YAGUCHI AND SAITO: ARBITRARY VIEWPOINT VIDEO SYNTHESIS FROM MULTIPLE UNCALIBRATED CAMERAS 431

images can be well preserved in the new view images, however,

we will have to ignore the occlusion effects.

Appearance-based virtual-view synthesis [16] takes into ac-

count the advantages of MBR and IBR. This 3-D shape model,

which is recovered from multiple images, provides the required

information for the IBR process, such as correspondence map,

occluded area, etc., for the input images. The precise and dense

correspondences make it possible to generate virtual views at

arbitrary viewpoints without losing pixels even in partially oc-

cluded regions. Image-based Visual Hull (IBVH) [12] is another

virtual view synthesis method that has the advantages of MBR

and IBR. In IBVH, the hull shape of the object is represented

by the intersection of silhouettes on the epipolar lines of one

base camera. Such image-based representation contributes to

high-speed rendering with conventional image rendering hard-

ware. IBVH is difficult to manipulate reconstructed objects and

virtual objects in the virtual space however, because the explicit

3-D shape model is not represented. The concept of a visual hull

was originally proposed by Laurentini [9]. Although the visual

hull reconstructed from silhouette images cannot represent an

actual 3-D shape, the visual hull can be used as an approxima-

tion of the actual 3-D shape in some cases, such as IBVH [12]

and the method presented in this paper.

The method presented in this paper extends the appear-

ance-based virtual-view synthesis to the projective reconstruc-

tion framework in PGS. By applying the PGS to the similar

virtual view synthesis technique, the strong camera calibration

required in conventional work can be avoided.

III. P

ROJECTIVE GRID

SPACE

Reconstructing a 3-D shape model from multiple-view im-

ages requires a relationship between the 3-D coordinate of the

object scene and the 2-D coordinate of the camera-image plane.

Projection matrices that represent this relationship are estimated

from measurements of 3-D/2-D correspondences obtained at a

set of sample points. Since the 3-D coordinates are defined in-

dependently from the camera geometries, the 3-D positions of

the sample points must be measured independently from each

camera geometry. This procedure is called camera calibration

[22]. Calibrating all of the each camera in a multiple-camera

system requires a lot of work [8], [23]. Reconstructing a 3-D

shape model from multiple-view images requires a relationship

between the 3-D coordinate of the object scene and the 2-D co-

ordinate of the camera image plane.

In our method, a 3-D point is related to a 2-D image point

without estimating the projection matrices in a PGS [15], which

is determined by using only the fundamental matrices [25] rep-

resenting the epipolar geometry between two basis cameras. Be-

cause the 3-D coordinate in a PGS is dependently defined from

the camera-image coordinates, the 3-D position of the sample

points does not have to be measured. Therefore, the PGS en-

ables 3-D reconstruction from multiple images without the need

to estimate the projection matrices of each camera.

Fig. 1 shows the PGS scheme. The PGS is defined by the

camera coordinates of the two basis cameras. Each pixel point

in the first basis camera image defines one grid line in

the space. On the grid line, grid-node points are defined by the

horizontal position

in the second image. Thus, the coordinates

P and Q of PGS are decided by the horizontal coordinate and the

vertical coordinate of the first basis image, and the coordinate

R of the PGS is decided by the horizontal coordinate. Since the

fundamental matrix

limits the position in the second basis

view on the epipolar line

, is sufficient for defining the grid

point. In this way, the projective grid space can be defined by

two basis view images, whose node points are represented by

We should note here the potential problem in this PGS frame-

work. If the epipolar lines are nearly parallel, the epipolar lines

transferring scheme fails to determine accurate intersection

points. In such a case, we cannot recover a correct 3-D shape

model and synthesize intermediate view images based on this

PGS framework. This situation can be avoided by distributing

the camera system so that the epipolar lines are not parallel

between cameras.

IV. M

ODEL

RECONSTRUCTION

Under the PGS framework, we reconstruct a 3-D shape model

of the dynamic object by using the SS method. (We assume

that the silhouette has been previously extracted by background

subtraction.)

In the conventional SS method, each voxel in a certain Eu-

clidean space is projected onto every silhouette image with pro-

jection matrices (which are calculated by accurately calibrating

every camera [2], [14]) to check whether it is included in the ob-

ject region. In applying the SS method in the PGS, every point

in the PGS must be projected onto each silhouette image. As de-

scribed in the previous section, the PGS is defined by two basis

views, and a point in the PGS is represented as

. Point

is projected onto and in the first basis

image and the second basis image, respectively. Point

is pro-

jected as the epipolar line

on the second basis view. Point

on the projected line (Fig. 1), is expressed as

(1)

where

represents the fundamental matrix between the first

and second basis images.

The projected point in an

th arbitrary real image is deter-

mined from two fundamental matrices,

, between two

basis images and the

th image. Since is projected

onto

in the first basis image, the projected point in the

th image must be on the epipolar line of , which is

derived by

(2)

In the same way, the projected point in the

th image must be on

epipolar line

of in the basis image, which is derived

by the

(3)

432 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

Fig. 1. Definition of projective grid. Point

(

p; q; r

)

on the projective grid

space is projected to

(

p; q

)

and

(

r; s

)

on the first and second basis images.

Fig. 2. Projection of point in space onto an image. Point

(

p; q;r

)

on the

projective grid space is projected to the intersection of two epipolar lines in the

image of other view

The point where epipolar lines and intersect is the pro-

jected point of

onto the th image (Fig. 2). In this way,

every projective grid point is projected onto every image, where

the relationship can be represented by only the fundamental ma-

trices between the image and two basis images.

The process for reconstructing 3-D shape model is outlined

as follows.

First, two cameras are selected as basis cameras, and then the

coordinate of the PGS is determined. Every voxel in a certain re-

gion is projected onto each silhouette image with the proposed

scheme, as shown in Figs. 1 and 2. The voxel that is projected

onto the object silhouette for all images is considered an ex-

istent voxel, while others are considered nonexistent. Thus the

volume of the object can be determined in the voxel represented

in the PGS. In this process, the order in which the existence of

a voxel is checked is important for reducing the computational

complexity, because the cost of computing the projection of a

voxel onto an image is not the same for all the images in the

proposed scheme. Since the vertical and horizontal coordinate

of the first basis-view image are equivalent to P and Q coordi-

nates in the PGS, projecting each voxel onto the first basis view

image requires no calculation involving a fundamental matrix.

In the second basis view image, the projected point is decided

by calculating only one multiplication of a fundamental matrix

to determine the epipolar line. This implies that the calculation

for projection onto the second basis view becomes half com-

pared with projecting the other images. Therefore, the order in

which the existence of a voxel is checked should be Basis view

1, Basis view 2, and so on.

After the voxel existence determination, the implicit surface

of the voxel representation of the object is extracted by using

the Marching Cubes algorithm. Finally, the object model is re-

constructed as a surface representation in the PGS.

V. V

IRTUAL VIEW SYNTHESIS

An arbitrary view image from a 3-D shape model can be gen-

erated by texture mapping onto the 3-D shape model [8], [23]

or by morphing from the point correspondence of some refer-

ence images calculated using the model [1], [3], [16], [18]. In

the former, the texture of the images are projected onto the 3-D

shape model, and then re-projected onto the image. In this pro-

cedure, however, the generated images are likely to suffer from

rendering artifacts caused by the inaccuracy of the 3-D shape.

Therefore, we apply the latter procedure to generate arbitrary

view images.

A. Arbitrary View Synthesis

Arbitrary view images are synthesized as intermediate images

of two or three real neighboring reference images. If two refer-

ence images are selected, a virtual viewpoint can be taken on

the line between the two real reference viewpoints. If three are

selected, the virtual viewpoint can be taken from the inside of

the triangle formed by the three real viewpoints. Therefore, if a

number of cameras are mounted on the surface of a hemisphere

enclosing the target space and any three of them form a triangle

effectively, the virtual viewpoint can be moved freely all around

the half sphere.

For the synthesis of arbitrary view images, intermediate im-

ages are synthesized by interpolating two or three reference im-

ages. The interpolation is based on the related concepts of view

interpolation [3]. First, an image depicting the depth (a depth

image) of the 3-D model is rendered on each reference image.

To render the depth image, the 3-D positions of all the vertices

on the surface representation of the 3-D shape model in the

PGS are projected onto each reference viewpoint by applying

the smallest depth value to the points projected onto the depth

image. The depth

of the surface point from the reference view-

point can be calculated by the following equation:

(4)

where

and represent the 3-D position on the

surface in PGS and the viewpoint of the reference image, re-

spectively. The 3-D position of the viewpoint can be determined

by using the epipolar geometry of the cameras in the following

procedure.

As shown in Fig. 3, the viewpoints of the two basis cameras

and the other cameras are indicated by

, , and , respec-

tively. Since the first basis camera viewpoint

can be pro-

jected onto everywhere of the first basis images (Image 1), the

and components of can not uniquely be determined.

Thus, we take the center point of the first basis camera, such

YAGUCHI AND SAITO: ARBITRARY VIEWPOINT VIDEO SYNTHESIS FROM MULTIPLE UNCALIBRATED CAMERAS 433

Fig. 3. Position of the viewpoint of each camera in PGS.

Fig. 4. Synthesis of desired view from three neighboring view images.

as ( , ). The component of is the component

of the projected point of

onto the second basis image, thus

the epipole of the first basis camera in the second basis camera

determines the component of . Therefore, the 3-D po-

sition of the viewpoint of the first basis camera is defined as

In the same way, if

is the epipole of the second basis

camera in the first basis camera, then the

and components

of the 3-D position of the second basis camera viewpoint

are represented as , , respectively. We also define the

component of by the center position of the second basis

camera. Then the 3-D position of the viewpoint of the second

basis camera is defined as

. On the other

hand, the viewpoint of the other cameras

can be represented

, by using the epipoles of the two basis

cameras in camera

After rendering these depth images of the reference view-

points, an intermediate viewpoint image is synthesized as fol-

lows. Let

, , represent the weighted values of the inter-

polation of the reference view images. First, each vertex on the

3-D surface model is projected onto all the reference viewpoints,

and the projected points are indicated as

, and , which

are shown in Fig. 4. Then, the pixel position on the interpolated

view image

for the vertex is calculated by the following equa-

tion:

(5)

Next, visibility of the vertex from each reference viewpoints

is checked by comparing the depth from the reference viewpoint

Fig. 5. Extracting the correspondence points between the two basis view

images.

to the vertex with the depth value in the depth image at the refer-

ence viewpoint. If they are not equal, the vertex can be regarded

as invisible from the reference viewpoint. Let

, , repre-

sent the visibility (1: visible, 0: invisible) of the reference view-

point. If the vertex is visible from at least one reference view-

point, the color value is determined as the following equation:

(6)

where

, and are the colors of the projected

points, and

is the interpolated color.

By changing the weighting ratio, the virtual viewpoint can be

moved inside of the triangle.

The interpolation strategy presented here can synthesize geo-

metrically correct intermediate images only if the viewing direc-

tions of the reference cameras are parallel to each other. Actu-

ally, they cannot be parallel because we assume all the cameras

are directed at the common objective space. Thus, the interpo-

lation strategy implies the geometrical approximation such that

the reference cameras are parallel. In this paper, we assume that

the distortion caused by the approximation is not obvious in the

camera placement shown in Fig. 9. If we need to synthesize a

geometrically correct interpolation between the reference view-

points, we need to take into account the homographic transfor-

mation of the image plane that occurs among the reference cam-

eras and the interpolated viewpoint, as Seitz

et al. proposed in

[18].

B. Synthesizing the Floor Plane

We also propose here a method to synthesize a floor plane in

this projective method. Since the floor plane is removed at the

step where the silhouette image is made, the SS method only

provides the 3-D shape of an object without its background. We

generate more realistic images, by synthesizing a floor plane

image.

Since the coordinate axes of the PGS are defined by two basis

cameras, a line and a plane can not be represented in the PGS

by the same form of the equation in Euclidean space. Therefore,

we need to represent the floor plane by using basis views. We

synthesize the floor plane from more than three points extracted

from the real background image. In the following, we explain

the details of the procedure.

Several correspondence points on the floor region between

the two basis view images are extracted as shown in Fig. 5.

Those points are picked out manually in our experiment. From

434 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

Fig. 6. Projecting the vertex of Delauney triangles onto the other view image

using fundamental matrices.

Fig. 7. Synthesizing the floor plane on the arbitrary view image.

Fig. 8. Event hall at B-con Plaza.

the definition of the PGS, the coordinate of correspondence

point

is on the first basis view, and

on the second, thus the coordinate of in the PGS becomes

. Since the coordinate of a point in the PGS is

fixed, the point can be projected onto every input view image

with the fundamental matrices in the same way as stated before.

Fig. 9. Camera placement in our system.

Fig. 10. Feature point extraction for estimating fundamental matrices.

The points on the floor are triangulated so that the floor plane

can be represented by Delauney triangulation in the first basis

view image. The vertices of the triangle mesh are projected onto

two interpolating background images using fundamental ma-

trices as shown in Fig. 6.

Synthesizing the background image of an arbitrary viewpoint

is done using the same interpolation strategy described in the

previous section. The background images require the correspon-

dence of all the points between the two references. According to

the correspondence of the vertices of each triangle mesh, affine

transforms between the two background images are calculated

for each triangle mesh. Since the affine transforms of all the tri-

angle meshes provide pixel-wise correspondence between the

two reference background images, all the pixel positions and

values are determined in accord with to the way expressed in

(5) and (6), as shown in Fig. 7.

Although the use of affine transform is not perspectively cor-

rect, we ignore such perspective errors because the distance be-

tween the object and the camera is relatively large in the present

experiment. In the case of such an approximation that cannot

Arbitrary viewpoint video synthesis from multiple uncalibrated cameras

Figures

Citations

Systems and methods for the autonomous production of videos from multi-sensored data

Identification of 3d objects from multiple silhouettes using quadtrees/octrees.

Virtual Viewpoint Replay for a Soccer Match by View Interpolation From Multiple Cameras

BISi: a blended interaction space

Personalized production of basketball videos from multi-sensored data under limited display resolution

References

A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses

A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses

Light field rendering

A volumetric method for building complex models from range images

The lumigraph

Related Papers (5)

View interpolation for image synthesis

High-quality video view interpolation using a layered representation

Appearance-based virtual view generation from multicamera videos captured in the 3-D room

Free-viewpoint image synthesis from multiple-view images taken with uncalibrated moving cameras

View morphing

Frequently Asked Questions (21)

Q1. How long does it take to render the intermediate images?

Q2. What is the SS method for generating a floor plane?

Q3. What is the advantage of image-based virtual view synthesis?

Q4. What is the effect of the geometrical settings of the base cameras?

Q5. What is the way to reconstruct a 3-D shape model from multiple-view images?

Q6. What is the way to represent a 3-D shape?

Q7. Why does the error in 3-D shape reconstruction affect the quality of the generated images?

Q8. How do the authors select the basis cameras?

Q9. How can the PGS be used to reconstructed objects?

Q10. What is the voxel density in the PGS of the objective area?

Q11. How can the point be projected onto every input view image?

Q12. Why is the 3-D coordinate in a PGS dependent on the camera coordinates?

Q13. What is the method for reconstructing a 3-D shape model in a large target space?

Q14. What is the SS method used to determine the shape of the object?

Q15. How do the authors obtain the fundamental matrices between the cameras?

Q16. How many cameras can be mounted on the surface of a hemisphere?

Q17. How many image feature points are extracted from the two basis cameras?

Q18. Why is the distance between the object and the camera relatively large?

Q19. How is the 3-D position of the viewpoint of the first basis camera determined?

Q20. What is the order in which the existence of a voxel is checked?

Q21. How many cells are used to synthesize a 3-D shape model?