Algorithms and design for a second-order automatic differentiation module

doi:10.1145/258726.258770

c

1

&O

MF-

9

74

7

J?/-

-1

~

-

~ ~~

Algorithms and Design for

a

Second-Order Automatic Differentiation Module*

Jason Abate+

E!

1

1997

Texas Institute

for

Computational and Applied Mathematics

University

of

Texas at Austin

abateQticam.utexas. edu

http://www.ticam.utexas.edu/"abate

Christian Bischof and Lucas Roh

Mathematics and Computer Science Division

Argonne National Laboratory

{bischof ,roh}@mcs

.an1

.gov

;-e

E

Alan Carle

Center

for

Research on Parallel Computation

Rice University

carleQcs.rice.edu

http://www.cs.rice.edu/"carle

Abstract

This article describes approaches

to

computing second-order

derivatives with automatic differentiation (AD) based on

the forward mode and the propagation of univariate Tay-

lor series. Performance results are given that show the

speedup possible with these techniques relative to existing

approaches. We

also

describe a new source transformation

AD module for computing second-order derivatives of C and

Fortran codes and the underlying infrastructure used to cre-

ate

a

language-independent translation tool.

1

Introduction

Automatic differentiation (AD) provides an efficient and ac-

curate method to obtain derivatives for use in sensitivity

analysis, parameter identification and optimization. Cur-

rent tools are targeted primarily

at

computing first-order

derivatives, namely gradients and Jacobians. Prior to AD,

derivative values were obtained through divided difference

methods, symbolic manipulation or hand-coding,

all

of which

have drawbacks when compared with AD (see

[3]

for a dis-

'This work was supported by the Mathematical, Information, and

Computational Sciences Division subprogram

of

the Office

of

Compu-

tational and Technology Research,

U.S.

Department

of

Energy, under

Contract W-31-109-Eng-38; and by the National Science Foundation,

through the Center for Research

on

Parallel Computation, under Co-

operative Agreement

No.

CCR-9120008.

tThis work

was

partially performed while the author

was

a

re-

search associate at Argonne National Laboratory.

ENT

IS

UNLIMITED

xe

1

a

cussion). Accurate second-order derivatives are even harder

to obtain than first-order ones;

it

is possible to end up

with no accurate digits in the derivative value when using

a

divided-difference scheme.

One can repeatedly apply first-order derivative tools

to

obtain higher-order derivatives, but this approach

is

compli-

cated and ignores structural information about higher-order

derivatives such

as

symmetry. Additionally, in cases where

a full Hessian,

H,

is

not required, such

as

with Hessian-

vector products

(H

+

V)

and projected Hessians

(VT

.

H.

W)

where

V

and

W

are matrices with many fewer columns than

rows, it is possible to compute the desired values much more

efficiently than with the repeated differentiation approach.

There is no "best" approach to computing Hessians; the

most efficient approach to computing second-order deriva-

tives depends on the specifics of the code and, to a lesser ex-

tent, the target platform on which the code will be run

[4,

81.

In

all

cases, however, derivative values computed by AD are

computed to machine precision, without the roundoff errors

inherent in divided difference techniques.

AD via source transformation provides great flexibility

in implementing sophisticated algorithms that exploit the

associativity

of

the chain

rule

of calculus

(see

[6]

for

a

dis-

cussion).

Unfortunately, the development of robust source

transformation tools is a substantial effort. ADIFOR

[3]

and ADIC

[6],

source transformation tools for Fortran and

C respectively, both implement relatively simple algorithms

for propagating derivatives. Most of the development time

so

far has concentrated on producing tools that handle the

full range of the language, rather than on developing more

efficient algorithms to propagate derivatives.

To make it easier to experiment with algorithmic tech-

niques, we have developed AIF, the Automatic Differenti-

ation Intermediate Form. AIF acts

as

the glue layer be-

tween

a

language-specific front-end and

a

largely language-

independent transformation module that implements AD

The submitted manuscript has been authored

by a contractor of the

U.

S.

Government

under contract No. W-31-104ENG-38.

Accordingly, the

U.

S.

Government retains

a

nonexclusive. royalty-free license to publish

or reproduce the published form of this

contribution, or allow others to do

so,

for

U.

S.

Government

purposes.

DISCLAIMER

This report was prepared

as

an account of work spo~ored by an agency

of

the United

States Government. Neither the United States Government nor any agency thereof, nor

any

of

their employees, make any warranty, express

or

implied,

or

assumes

any

legal

iiabfii-

ty

or

responsibility for the

accuracy,

completeness,

or

usefulness

of

any

information, appa-

ratus,

product,

or

process

disclosed,

or

represents that its

use

would not infringe privately

owned

rights.

Reference herein

to

any specific commercial product, process,

or

service

by

trade

name,

trademark, manufacturer,

or

otherwise does not necessariiy constitute

or

imply

its

endorsement,

recommendation,

or

favoring

by

the United States Government

or

any agency thereof. The views and opinions of authors expressed herein do not necessar-

ily state

or

reflect those of

the

United States Government

or

any agency thereof.

Portions

of

this

document

mrry

be

iiicglble

in

tlectronic

image

products

Images

are

pduced

fmm

the

best

available

original

dOCRIIlalL

transformations at a high level of abstraction.

We have implemented an AIF-based module for comput-

ing second-order derivatives. The Hessian module,

as

we call

it, implements several different algorithms and selectively

chooses them in a fashion that is determined by the code

presented

to

it.

However, this context-sensitive logic, which

is based on machine-specific performance models, is trans-

parent to the AD front-end. The Hessian module currently

interfaces with ADIFOR and ADIC. First experimental re-

sults show that the resulting codes outperform the recur-

sive application of first-order tools by a factor of two when

computing full, dense Hessians and are able to compute full,

sparse Hessians and partial Hessians

at

significantly reduced

expense.

Section

2

outlines the two derivative propagation strate-

gies that we have explored for Hessians, including cost es-

timates for computing various types of Hessians. Section

3

shows the performance of the various approaches for

a

sam-

ple code, and Section

4

describes the infrastructure that was

used to develop the Hessian augmentation tool. Lastly, we

summarize our results and discuss future work.

2

2.1

Forward Mode Strategies

The standard forward mode of automatic differentiation can

easily be expanded to second order to compute Hessians.

For

z

=

f

(2,

y),

we can compute

Vz

and

V2z,

the gradient and

Hessian of

z

respectively, as

Strategies for Computing Second Derivatives

vz

=

v2z

=

(1)

(2)

dZ

az

-vx

+

-vy

dX dY

-v2x

dZ

+

-v2y

az

+

-(Vx

a2

z

*

VxT)

+

-(Vy.

d2

Z

+

-(VX.

d2

z

VyT)

822

dY2

dxdy

VyT

+

vy

.

vx

T

)

This approach is conceptually simple and produces efficient

results for small numbers

of

independent variables.

For

n

independent variables, gradients are stored in arrays of

length

n

and Hessians, because of their symmetric nature,

are stored using the LAPACK

[l]

packed symmetric scheme,

which reduces the storage requirements from

n2

to

fn(n+l).

The cost of computing a full Hessian using the forward mode

is

O(n2)

relative to the cost of computing the original func-

tion.

Many algorithms do not need full knowledge of the Hes-

sian but require only

a

Hessian-vector product,

H.

V,

or

a

projected Hessian,

VT.

H.

W,

where

V

and

W

are matrices

with

nv

and

nw

columns, respectively. Rather than com-

puting the full Hessian at a cost of

O(n2)

followed by one

or

two matrix multiplications, we can multiply Equation

(2)

on

the left and/or right by

VT

and

W,

respectively, to produce

new propagation rules. By modifying the derivative objects

that get propagated, we can perform the required computa-

tions at a much lower cost. These costs are summarized in

Table

1.

In the case of large Hessians and relatively small

values of

nv

or

nw,

the savings can be significant. Addi-

tionally, the coloring techniques that have been applied to

structured Jacobians

[a]

can be applied to Hessians for a

significant savings.

'The cost

of

the symmetric and unsymmetric projected Hessians

Table

1:

Summary of Hessian costs using the forward mode

relative to the cost of computing the original function.

n

is

the number of independent variables,

nv

and

nw

are the

number of columns of

V

and

W,

respectively.

2.2

Taylor Series Strategies

As an alternative to the forward mode propagation of gra-

dients and Hessians, we can propagate two-term univariate

Taylor series expansions about each of the nonzero direc-

tions in the Hessian

[4].

To compute derivatives at

a

point

xo

in the direction

u,

we consider

f

as

a

scalar function

f

(xo

+

tu)

of

t.

Its Taylor series, up to second order, is

=

f

+

ftt

+

fttt2

(3)

where

ft

and

ftt

are the first and second Taylor coefficients.

The uniqueness of the Taylor series implies that for

u

=

ei,

the ith basis vector, we obtain

(4)

(5)

That

is,

we computed

a

scaled version of the ith diagonal

element in the Hessian. Similarly, to compute the

(i,j)

off-

diagonal entry in the Hessian, we set

u

=

e;

+

ej.

The

uniqueness of Taylor expansion implies

(7)

If Taylor expansions are also computed for the

i

and

j

di-

agonal elements, the off-diagonal Hessian entries can be re-

covered by interpolation. As with the forward mode, simple

rules specify the propagation of the expansions for all arith-

metic and intrinsic operators

[lo,

131.

To compute

a

full

gradient of length

n

and

t

Hessian entries above the diago-

nal, the cost of the Taylor series mode is

O(n+k).

If the full

gradient is not needed, this cost can be reduced somewhat.

The Taylor series approach can compute any set of Hes-

sian entries without computing the entire Hessian. This

technique

is

ideal for sparse Hessians when the sparsity pat-

tern

is

known in advance and

for

situations where only cer-

tain elements (such

as

the diagonal entries) are desired. Ad-

ditionally, each Taylor series expansion

is

independent. This

(VT

V2f.

V

and

VT

.

V2f

.

W)

are of the same order, but due

to symmetry, the storage and computation costs of

VT

.

V2f

.

V

are

roughly half of the costs of

VT

.

V2f.

W.

2

allows very large Hessians, which can easily overwhelm the

available memory, to be computed in

a

stripmined fashion by

partitioning the expansion directions and computing them

independently with multiple sweeps through the code in a

fashion that

is

similar

to

the stripmining technique described

in

[SI.

2.3

Preaccumulation

The associativity of the chain rule allows derivative prop-

agation to be performed

at

arbitrary levels of abstraction.

At the simplest, the forward mode works at the scope

of

a

single binary operation. By expanding the scope to a higher

level, such as an assignment statement,

a

loop body

or

a

subroutine,

it

is

possible to decrease the amount of work

necessary to propagate derivatives, as shown in

[7,

91.

A preaccumulation technique we employ in

our

work

computes the gradient and Hessian of the variable on the

left side of the assignment statement in two steps. Assume

that for the statement

z

=

f(x1,

x2,.

.

XN),

we have

Vxi

and

V2xz,

i

=

1,.

. .

N,

the global gradient and Hessian of

xi

and that we wish to compute,

for

z,

the global gradient

Vz

and the global Hessian

V2z.

Step

1:

Preaccumulation

of

local derivatives

The variables on the right side of the statement are

considered to be independent, and we compute ‘local”

derivative objects,

e

and

G,

z,j

=

1,.

.

N,

with

respect to the right-hand side variables. This can be

done using either the forward or Taylor series mode.

We accumulate the global gradient and Hessian of

z.

When using the forward mode for global propagation

of derivatives, this is done as follows:

82.2

’

.

Step

2:

Accumulation

of

global derivatives

N

vz

=

%VXi

i=l

The rules for Taylor series expansions can be general-

ized in

a

similar fashion.

Gradient codes produced by ADIFOR and ADIC cur-

rently employ statement-level preaccumulation

for

all

as-

signment statements more complicated than a single binary

operation. Experiments with similar “global” preaccumula-

tion strategies for computing Hessians have produced incon-

sistent results across various codes and machines.

No

global

strategy outperformed all other strategies on

all

test codes

and all machines.

Thus, we have developed an adaptive strategy where the

costs of using and not using statement level preaccumulation

are computed and compared when the derivative code is gen-

erated. These costs are estimated based on machine-specific

performance models of the actual propagation code. Thus,

the Hessian module decides which strategy to use based on

the structure of a particular computation. We believe that

such context-sensitive strategies are crucial for future im-

provement of AD tools.

3

-

Forward

Adaptive

Forward

Sparse

Taylor

Series

i

1.I)I.I.’.

0

100

200

300

400

HessianlFunction Execution Time Ratio

Figure

1:

Hessian performance of the Shubin Hessian test

code with

20

independent variables.

3

Hessian Performance on a

CFD

Code

Hessian code was generated for a steady shock tracking code

provided by Greg Shubin of the Boeing Company [14].

Be-

cause of memory constraints,

a

20

x

20

section

of

the full

190

x

190

Hessian was computed for each of the 190 de-

pendent variables. The section of the Hessian being studied

exhibits some sparsity, with

72

nonzero entries on

or

above

the diagonal.

Hessian codes were generated using four different strate-

gies. Figure

1

shows the ratio of the Hessian computation

time to the function computation time, while Figure

2

shows

the memory requirements of the augmented Hessian codes

on

a

Sun UltraSparc

1.

The original code required

8.0

x

seconds of execution time and used

360

kB of memory. The

first strategy, labeled “Twice ADIFOR”, was generated by

first producing a gradient code with ADIFOR

2.0,

and then

running the gradient code through ADIFOR again. The

“Forward” case implements the forward mode on a binary

operation level. The “Adaptive Forward” code uses the

for-

ward mode, with preaccumulation

at

a statement level where

deemed appropriate. The ‘Sparse Taylor Series” mode uses

the Taylor series mode to compute just the entries which are

known to be zero.

Clearly, the “Twice ADIFOR” scheme can be easily

beaten by exploiting the symmetry of the Hessian, both in

terms of execution speed and memory usage,

as

is done in

both the “Forward” and “Adaptive Forward” codes. This

result

also

shows that the use

of

an adaptive preaccumu-

lation strategy can outperform the operation-level forward

mode. Improvements in the strategy used to decide when to

use preaccumulation should further increase the efficiency

of the adaptive scheme. Finally, the “Sparse Taylor Series”

code shows that, if the sparsity structure of

a

problem is

known,

it

can be exploited for additional savings.

4

The algorithms of automatic differentiation are, for the most

part, independent of the language to which they are applied.

For example, the Fortran assignment statement

Language and

Tool

Independence

with

AIF

z

=

2.0

4.

x

*

y

Algorithms and design for a second-order automatic differentiation module

Citations

Automatic Differentiation Of Algorithms: From Simulation To Optimization

Modern map methods in particle beam physics

Second-Order Information in Data Assimilation*

Implementation of automatic differentiation tools

On the implementation of automatic differentiation tools

References

Automatic Differentiation: Techniques and Applications

Automatic differentiation: techniques and applications

Adifor 2.0: automatic differentiation of Fortran 77 programs

ADIC: an extensible automatic differentiation tool for ANSI-C

The ADIFOR 2.0 system for the automatic differentiation of Fortran 77 programs

Related Papers (5)

Algorithm 755: ADOL-C: a package for the automatic differentiation of algorithms written in C/C++

Adifor 2.0: automatic differentiation of Fortran 77 programs

Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation

Computational differentiation : techniques, applications, and tools

ADIC: an extensible automatic differentiation tool for ANSI-C