What are the contributions mentioned in the paper "A case study evaluation: perceptually accurate textured surface models" ?

This paper evaluates a new method for capturing surfaces with variations in albedo, height, and local orientation using a standard digital camera with three flash units. The authors present a case study of naïve subjects who found that surfaces captured with their method, when rendered under novel lighting and view conditions, were statistically indistinguishable from photographs.

(Open Access) A case study evaluation: perceptually accurate textured surface models (2009) | Greg Ward

This item was submitted to Loughborough’s Institutional Repository

(https://dspace.lboro.ac.uk/) by the author and is made available under the

following Creative Commons Licence conditions.

For the full text of this licence, please go to:

http://creativecommons.org/licenses/by-nc-nd/2.5/

A Case Study Evaluation: Perceptually Accurate Textured Surface Models

Greg Ward, Dolby Canada, greg.ward@acm.org

Mashhuda Glencross, University of Manchester, mashhud a@manchester.ac.uk

Figure 1. Left is the depth hallucination method of Glencross et al. [2008]; Right is our improved, three-flash method; Center is a photograph.

ABSTRACT

This paper evaluates a new method for capturing surfaces with

variations in albedo, height, and local orientation using a standard

digital camera with three flash units. Similar to other approaches,

captured areas are assumed to be globally flat and largely diffuse.

Fortunately, this encompasses a wide array of interesting surfaces,

including most materials found in the built environment, e.g.,

masonry, fabrics, floor coverings, and textured paints. We present

a case study of naïve subjects who found that surfaces captured

with our method, when rendered under novel lighting and view

conditions, were statistically indistinguishable from photographs.

This is a significant improvement over previous methods, to

which our results are also compared.

Index Terms—Lighting, shading and textures, Perceptual

validation, Computer vision, Texture.

1 I

NTRODUCTION

Photographic textures have been applied to geometric models

to enhance realism for decades, and are an integral part of every

modern rendering engine. However, two-dimensional textures

have a tendency to resemble wallpaper at oblique angles, and are

unable to produce realistic silhouettes or change appearance under

different lighting. Displacement mapping or relief mapping

methods [Oliveira 2000] can overcome these limitations, but full

reflectance and geometry model data are difficult to capture from

real surfaces, requiring expensive scanning equipment and

subsequent manual alignment with photographically acquired

textures [Rushmeier et al. 2003; Lensch et al. 2003], a large set of

data, and/or complicated rigs [Dana et al. 1999; Marschner et al.

1999]. Games companies often employ skilled artists to create

texture model data for displacement mapping using 3D modeling

packages, which is a laborious process. Glencross et al.

introduced a simple and inexpensive shape-from-shading

technique for “hallucinating” depth information from a pair of

photographs taken from the same viewpoint, one captured with

diffuse lighting and another taken with a flash [Glencross et al.

2008].

Since the method captures albedo simultaneously, no alignment

steps are needed. Although the authors do not claim absolute

accuracy in terms of reproducing depth values, user studies

showed that subjects found it difficult to distinguish the

plausibility of hallucinated depth relative to ground truth data,

adequately demonstrating the technique’s value for realistic

computer graphics.

In this paper, we ask the question “what level of additional

captured model accuracy will result in synthetic images that are

indistinguishable from photographs?” To answer this question, we

extend the depth hallucination method to include photometrically

measured surface orientation [8]. By adding two additional flash

units to the one employed by [Glencross et al. 2008], we are able

to derive accurate surface orientations at most pixels in our

captures. Our validation studies demonstrate that the addition of

measured surface orientation results in no statistically significant

differences in perception between photographs and captured, re-

rendered images.

The entire process has been automated, with capture taking a

few seconds and model extraction less than a minute.

Since the focus of this work is the evaluation of captured

model fidelity and its impact on the visual accuracy of the results,

we begin by first briefly discussing related work, and then give an

overview of the photometric method used. We evaluate the visual

impact of measured surface orientation on computer-generated

imagery through an experimental study. Finally, we conclude by

discussing the limitations and suggesting future directions.

2 RELATED WORK

Besides the aforementioned work of Glencross et al. [2008],

our method is closely related to that of Rushmeier and Bernardini,

who used a comparable multi-source arrangement to recover

surface normal information [Rushmeier and Bernardini 1999].

This is similarly built on the photometric stereo technique of

Woodham [1980]. Rushmeier and Bernardini also employ a

separate shape camera with a structured light source to obtain

large-scale geometry, which they went to considerable effort to

align with the captured texture information. Their system

employed 5 tungsten-halogen sources, so they could dismiss up to

2 lights that were shadowed or caused specular reflection and still

have enough information to recover the surface normal at a pixel.

Ours is not so much an improvement on their method, as a

simplified approach for a different application. Since our goal is

local depth and surface normal variations, we do not require the 3-

D geometry capture equipment or registration software, and our

single-perspective diffuse plus flash images are sufficient for us to

hallucinate depth at each pixel. To avoid specular highlights, we

employ crossed polarizers as suggested by [Glencross et al. 2008],

and interpolate normals over pixels that are shadowed in one or

more captures.

Our technique also bears close resemblance to the material

capture work of Paterson et al. [2005]. Using photometric stereo

in combination with surface normal integration and multiple view

captures, these researchers were able to recover displacement

maps plus inhomogeneous BRDFs over nearly planar sample

surfaces using a simple flash plus camera arrangement. Their

method incorporates a physical calibration frame around the

captured surface to recover camera pose and flash calibration

data. In contrast, our method uses only single-view capture, and

flash/lens calibration is performed in advance, thus avoiding any

restrictions of surface dimensions. Since we do not rely on

surface normal integration to derive height information, our

method is more robust to flash shadowing and irregular or spiky

terrain. Similar to their technique, we assume a nearly planar

surface with primarily diffuse reflection, and capture under

ambient conditions. However, we make no attempt to recover

specular characteristics in our method, which would be difficult

from a single view.

Figure 2. Three-flash capture system mounted on a tripod with a digital

SLR camera.

Multiple flashes have also been used to produce non-

photorealistic imagery. Specifically, Raskar et al. developed a

method for enhancing photographic illustrations exploiting the

shadows cast by multiple flashes [Raskar et al. 2004]. Toler-

Franklin et al. employed photometric stereo to capture surface

normals, then applied these to enhance and annotate photo-based

renderings [Toler-Franklin et al. 2007]. With the additional depth

information our technique provides from the same data, it could

be applied in a similar way to the problem of non-photorealistic

rendering, though that is not our focus.

METHOD

Our technique borrows from and improves upon previous

methods by employing a digital camera with three external flash

units. We build on the flash/no-flash depth hallucination method

of Glencross et al. [2008] by capturing two additional flash

images to derive surface normal information and overcome

limitations in their original albedo estimation. Employing three

flashes virtually guarantees that every point on the surface will be

illuminated in at least one image, and for points lit by all three

flashes, we can accurately measure the surface normal as well.

This normal map is used to correct the albedo estimate and further

enhance re-rendering under different lighting conditions.

We begin by describing our three-flash capture system,

followed by a description of the capture process and how the

images are processed into a detailed surface model.

Figure 3. Circuit diagram for our three-flash controller.

3.1 Three-Flash Controller

To automatically sequence each flash, we built the simple

controller circuit shown in Figure 3 to fire each flash in sequence,

followed by a no-flash capture. In our configuration, we cycle the

power to a shoe-mounted flash to force the camera into ambient

exposure mode for the no-flash capture. This avoids having to

touch the camera or control it via a USB tether – a tripod and a

remote release cable are the only additional equipment required.

The hot-shoe flash sync is controlled by the camera, so it fires

while it has power. Therefore, some additional image processing

is required for this set-up, which we explain in Section C, below.

A full cycle is achieved after 4 shutter releases. The first

shutter release fires Flash 1 mounted on the hot-shoe only. The

second shutter release fires Flash 2 as well, and the third shutter

release fires Flashes 1 and 3. After three firings, power is turned

off to Flash 1 mounted on the hot-shoe, thus putting the camera

into ambient exposure mode, and none of the flashes fire. Once

this final no-flash image has been captured, the cycle repeats.

Figure 2 shows our capture system mounted on a tripod. An

amber LED indicates the controller is powered in its initial state,

ready to begin a capture sequence. Linear polarizers are placed

over each flash unit and aligned 90° out-of-phase with a polarizer

filter mounted on the lens in order to reduce specular reflections

as suggested in [Glencross et al. 2008].

3.2

Capture Process

The hot-shoe mounted flash is set to half its maximum output

in manual mode, while the other two flashes are set to maximum.

Since the hot-shoe flash fires every time, setting its output to half

prevents it from drowning out the other flashes when they fire.

Sufficient time is allowed between shutter releases for the flashes

to fully recharge, ensuring that they produce roughly the same

output each time. A cable release is used to avoid any camera

movement, which would make subsequent image processing more

difficult. After the full sequence of 4 images is captured and the

histograms are checked to ensure a good set of exposures, the

capture process is complete.

Figure 4. Diagram of RAW capture processing with dark subtraction

used to obtain three separate flash no-flash images.

Figure 5. Our three separate flash images with the no-flash image in the

lower right, all after RAW processing.

3.3 Image Processing

The first stage of our image-processing pipeline converts RAW

captures to 16-bit/channel linear encoded TIFF. Taking

advantage of the dark subtraction feature of dcraw [Coffin], we

eliminate the effect of ambient lighting on our Flash 1 capture by

subtracting the no-flash capture after applying the appropriate

scale factor to account for differences in exposure time. We use

this same trick to separate flash images by subtracting the Flash1-

only capture from the Flash 1+2 and Flash 1+3 captures. Since

Flash 1 also includes the ambient lighting, this takes care of the

whole process for Flashes 2 and 3. This conversion is illustrated

in Figure 4, with results shown in Figure 5.

The second stage of our image processing applies a calibration

to the flash images to correct for vignetting and other uniformity

issues. Since this correction varies with distance, aperture, lens

and focal length, we capture a set of 50 to 100 reference flash

images of a white, diffuse wall, then interpolate these calibration

images to obtain a more accurate result. This interpolation process

pulls out the six nearest flash triplets from our set and applies a

weighted average to these. We then divide each flash image by its

interpolated calibration image as in [Glencross et al. 2008] in

preparation for the next processing stage.

In the third image processing stage, we simultaneously obtain

local surface orientation (normals) and albedo (reflectance) by

solving the following 3x3 matrix equation at each pixel

illuminated by all three flashes [Rushmeier and Bernardini 1999]:

n =

(1)

where:

V = illumination direction matrix

n = normal vector times albedo

i = adjusted flash pixel values

The computed adjusted flash pixel values are the corrected

luminance values for each flash capture, multiplied again by the

cosine of the incident angle, which was undone by our flash

calibration. We compute the illumination direction matrix V by

subtracting the estimated 3-D pixel positions given by our lens

focal length and focus distance (recorded in the image metadata)

from the known flash positions. We normalize each of these

vectors, thus our measured pixels in

are proportional to the dot

product of the illumination vectors with the surface normal, times

albedo. Solving for

at each pixel, we take this vector length as

our local variation in albedo. In shadow regions where only two

flashes illuminate the surface, a technique such as [Hernández et

al. 2008] could be used to resolve normals via an integrability

constraint. We found that a simple hole-filling algorithm that

averaged the four closest neighbors worked well enough in

shadow regions, thanks to the masking from texture complexity

that hides small artifacts.

A global scale factor may be applied to ensure an expected

range of albedo values as a final step if necessary. Similarly, we

found that applying a global flattening of the derived surface

normals improves later rendering. This is accomplished by

subtracting a low-frequency (blurred) version of the normal map

from the high-resolution original, providing local detail while

suppressing systematic errors due to imperfect calibration.

The fourth and final stage exactly follows the method laid out

by [Glencross et al. 2008] to hallucinate depth using a multi-scale

model based on the no-flash image divided by the albedo image.

The important differences here are that we have a better estimate

of albedo based on our knowledge of local surface orientation,

and our multiple flashes avoid areas of complete shadow.

Figure 6. Left image contains depth hallucinated from a single flash/no-flash pair. Right image shows results of 3-flash system. Center is a photograph.

Figure 7. Comparison of hallucination and rendering methods showing the original diffuse photo, single-flash re-rendered result, three-flash depth result, and

finally the three-flash result with derived normals.

4 RESULTS

4.1 Comparison to Single-flash Method

Figure 7 shows a side-by-side comparison between our three-

flash method and the previous method of Glencross et al. [2008].

The upper-left image shows the original no-flash (diffusely lit)

photograph. The upper-right image shows a rendering under

simulated daylight using depth hallucinated with a single flash

image and this diffuse photo. The lower-left image shows the same

rendering using depth hallucinated from all three flashes, but

without taking advantage of the derived surface normals. The final

image on the lower-right shows the same improved depth map with

derived normal information.

While we expected some slight improvements to the depth

hallucination using three flashes, we found that most of the visible

differences in the result came when we applied the derived surface

A case study evaluation: perceptually accurate textured surface models

Figures

Citations

Tactile perceptions of digital textiles: a design research approach

Relightable Buildings from Images

Transfer of albedo and local depth variation to photo-textures

Image based surface reflectance remapping for consistent and tool independent material appearence

Why a Single Measure of Photorealism Is Unrealistic

References

Illustration of complex real-world objects using images with normals

A perceptually validated model for surface depth hallucination

Design and Use of an In-Museum System for Artifact Capture

Related Papers (5)

Transfer of albedo and local depth variation to photo-textures

Using photometric stereo for 3d environment modeling

Inverse rendering for computer graphics

Realistic virtual reproductions. Image-based modelling of geometry and appearance

Dynamic Omnidirectional Texture Synthesis for Photorealistic Virtual Content Creation

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "A case study evaluation: perceptually accurate textured surface models" ?