What are the contributions mentioned in the paper "Multiobjective optimization of fpga-based medical image registration" ?

This paper presents a multiobjective optimization strategy developed in the context of fieldprogrammable gate array–based implementation of medical image registration. Within this framework, the authors compare several search methods and demonstrate the applicability of an evolutionary algorithm–based search for efficiently identifying superior multiobjective tradeoff curves.

What is the first step in the calculation of a voxel?

The initial step in MI calculation involves applying a candidate transformation (T), to each voxel coordinate ( rv ) in the RI to find the corresponding voxel coordinates in the FI ( fv ).

What other heuristic techniques are limited to software implementations?

Other heuristic techniques that take into account tradeoffs between hardware cost and implementation error and enable automatic conversion from floating-point to fixed-point representations are limited to software implementations only [26].

What is the MI value of the FPGA-based architecture?

The authors have developed a parameterized, bit-true emulation of the FPGA-based architecture that is capable of calculating the MI valuecorresponding to any feasible configuration for a given image transformation.

What is the importance of comparing the Pareto-optimized solution sets?

Quantitative comparison of the Pareto-optimized solution sets is essential in order to compare more precisely the effectiveness of various search methods.

What is the effect of the MI calculation on registration accuracy?

These factors, along with its execution time, in their experience, may render registration accuracy as an unsuitable objective function, especially if there is nonmonotonic behavior with respect to the wordlength of design variables.

(Open Access) Multiobjective Optimization of FPGA-Based Medical Image Registration (2008) | Omkar Dandekar

Multiobjective Optimization of FPGA-Based Medical Image Registration

Omkar Dandekar

1,2

, William Plishker

1,2

Shuvra Bhattacharyya

Department of Electrical and Computer Engineering,

University of Maryland,

College Park, MD 20742

{omkar, plishker, ssb}@umd.edu

Raj Shekhar

1,2

Department of Diagnostic Radiology,

University of Maryland,

Baltimore, MD 21201

rshekhar@umm.edu

Abstract

With a multitude of technological innovations, one

emerging trend in image processing, and medical image

processing, in particular, is custom hardware

implementation of computationally intensive algorithms in

the quest to achieve real-time performance. For reasons of

area-efficiency and performance, these implementations

often employ limited-precision datapaths. Identifying

effective wordlengths for these datapaths while accounting

for tradeoffs between design complexity and accuracy is a

critical and time consuming aspect of this design process.

Having access to optimized tradeoff curves can equip

designers to adapt their designs to different performance

requirements and target specific devices while reducing

design time. This paper presents a multiobjective

optimization strategy developed in the context of field-

programmable gate array–based implementation of

medical image registration. Within this framework, we

compare several search methods and demonstrate the

applicability of an evolutionary algorithm–based search

for efficiently identifying superior multiobjective tradeoff

curves. This strategy can easily be adapted to a wide

range of signal processing applications, including areas

of image and video processing beyond the medical

domain.

1. Introduction

An emerging trend in real-time signal processing

systems is to accelerate computationally intensive

algorithmic components by mapping them to custom or

reconfigurable hardware platforms, such as application-

specific integrated circuits (ASICs) and field-

programmable gate arrays (FPGAs). Most of these

algorithms are initially developed in software using

floating-point representation and later migrated to

hardware using finite precision (e.g., fixed-point

representation) for achieving improved computational

performance and reduced hardware cost. These

implementations are often parameterized, so that a wide

range of finite precision representations can be supported

[1] by choosing an appropriate wordlength for each

internal variable. As a consequence, the accuracy and

hardware resource requirements of such a system are

functions of the wordlengths used to represent the internal

variables. Determining an optimal wordlength

configuration has been shown to be NP-hard [2] and can

take up to 50% of the design time for complex systems

[3]. Moreover, a single optimal solution may not exist,

especially in the presence of multiple conflicting

objectives. In addition, a new configuration generally

needs to be derived when the design constraints are

altered.

The problem of finding optimal wordlength

configurations can be formulated as a multiobjective

optimization, where different objectives — for example,

accuracy and area — generally conflict with one another.

Although this approach increases the complexity of the

search, it can find a set of Pareto-optimized configurations

representing strategically-chosen tradeoffs among the

various objectives. This allows a designer to choose an

efficient configuration that satisfies given design

constraints and provides ease and flexibility in modifying

the design configuration as the constraints change.

An optimum wordlength configuration can be

identified by analytically solving the quantization error

equation as described in [4-8]. This analytical

representation, however, can be difficult to obtain for

complex systems. Techniques based on local search or

gradient-based search [9] have also been employed, but

these methods are limited to finding a single feasible

solution as opposed to an optimized tradeoff curve. An

exhaustive search of the entire design space is guaranteed

16th International Symposium on Field-Programmable Custom Computing Machines

DOI 10.1109/FCCM.2008.50

183

Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

to find Pareto-optimal configurations. Execution time for

such exhaustive search, however, increases exponentially

with the number of design parameters, making it

unfeasible for most practical systems. Methods that

transform this problem into a linear programming problem

have also been reported [4], but these techniques are

limited to cases in which the objectives can be modeled as

linear functions of the design parameters. Other

approaches based on linear aggregation of objectives may

not find proper Pareto-optimal solutions when the search

space is nonconvex [10]. Techniques based on

evolutionary methods have been shown to be effective in

searching large search spaces in an efficient manner [11,

12]. Furthermore, these techniques are inherently capable

of performing multipoint searches. As a result, techniques

based on evolutionary algorithms (EA) have been

employed in the context of multiobjective optimization

(SPEA2 [13], NSGA-II [14]).

This article presents a novel multiobjective

optimization strategy developed in the context of FPGA-

based implementation of medical image registration. The

tradeoff between FPGA resources (area and memory) and

implementation accuracy is explored, and Pareto-

optimized solutions are identified. This analysis is

performed by treating the wordlengths of the internal

variables as design variables. We also compare several

search methods for finding Pareto-optimized solutions and

demonstrate the applicability of search based on

evolutionary techniques for efficiently identifying superior

multiobjective tradeoff curves. This optimization strategy

can easily be adapted to a wide range of signal and image

processing applications.

This paper is organized as follows. Section 2 provides

background on image registration and outlines an

architecture for its FPGA-based implementation. The

formulation of the multiobjective optimization and various

search methods to find Pareto-optimized solutions are

described in Section 3. Section 4 presents experimental

results and compares various search methods. In Section 5,

related work for optimum wordlength search and

multiobjective optimization is presented. Section 6

concludes the paper.

2. Image registration

Medical image registration is the process of aligning

two images that represent the same anatomy at different

times, from different viewing angles, or using different

imaging modalities. This method attempts to find the

transformation (

) that optimally aligns a reference image

(RI) with coordinates x, y, and z and a floating image (FI)

under an image similarity measure (

F ):

arg max ( ( , , ), ( ( , , ))).

TRIxyzFITxyz= F (1)

Many image similarity measures, such as the sum of

squared differences and cross correlation, have been used,

but mutual information (MI) has recently emerged as the

preferred similarity measure. MI-based image registration

has been shown to be robust and effective in

multimodality image registration [15]. However, this form

of registration typically requires thousands of iterations

(MI evaluations), depending on image complexity and the

degree of initial misalignment between images. Castro-

Pareja et al. [16] have shown that, calculation of MI for

different candidate transformations is a factor limiting the

performance of MI-based image registration. We have,

therefore, developed an FPGA-based architecture for

accelerated calculation of MI [17], which is capable of

computing MI 40-times faster as compared to software

implementation.

2.1. FPGA-based implementation of mutual

information calculation

During the execution of image registration using this

architecture, the optimization process is executed from a

host workstation. The host provides a candidate

transformation, while the FPGA-based implementation

applies it to the images and performs the corresponding

MI computation. The computed MI value is then further

used by the host to update the candidate transformation

and eventually find the optimal alignment between the RI

and FI. Figure 1 shows the top-level block diagram of the

aforementioned architecture. The important modules in

this design are described in the following subsections.

2.1.1. Voxel counter. Calculation of MI requires

processing each voxel in the RI. In addition, because the

implemented algorithm processes the images on a

subvolume basis, RI voxels within a 3D neighborhood

corresponding to an individual subvolume must be

processed sequentially. The host programs the FPGA-

based MI calculator with subvolume start and end

addresses, and the voxel counter computes the address

corresponding to each voxel within that subvolume in

−

x order.

2.1.2. Coordinate transformation. The initial step in MI

calculation involves applying a candidate transformation

(T), to each voxel coordinate (



) in the RI to find the

corresponding voxel coordinates in the FI (



). This is

mathematically expressed as:

vTv

⋅



(2)

The deformation model employed is a six-parameter

rigid transformation model and is represented using a

4 × 4 matrix. The host calculates this matrix based on the

current candidate transformation provided by the

184

Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

optimization routine and sends it to the MI calculator. A

fixed-point representation is used to store the individual

elements of this matrix. The coordinate transformation is

accomplished by a simple matrix multiplication.

2.1.3 Partial volume interpolation. The coordinates

mapped in the FI space (



) do not normally coincide

with a grid point (integer location), thus requiring

interpolation. Nearest neighbor and trilinear interpolation

schemes have been used most often for this purpose;

however, partial volume (PV) interpolation, introduced by

Maes et al. [15] has been shown to provide smooth

changes in the histogram values with small changes in

transformation. The reported architecture consequently

implements PV interpolation as the choice of interpolation

scheme.



in general, will have both fractional and

integer components and will land within an FI

neighborhood of size 2 × 2 × 2. The interpolation weights

required for the PV interpolation are calculated using the

fractional components of



. Fixed-point arithmetic is

used to compute these interpolation weights. The

corresponding floating voxel intensities are fetched by the

image controller in parallel using the integer components



. The image controller also fetches the voxel

intensity corresponding to



. The MH then must be

updated for each pair of reference and floating voxel

intensities (eight in all), using the corresponding weights

computed by the PV interpolator.

2.1.4. Image memory access. The typical size of 3D

medical images prevents the use of high-speed memory

internal to the FPGA for their storage. Between the two

images, the RI has more relaxed access requirements,

because it is accessed in a sequential manner (in

−

order). This kind of access benefits from burst accesses

and memory caching techniques, allowing the use of

modern dynamic random access memories (DRAMs) for

image storage. For the architecture presented, both the RI

and FI are stored in separate logical partitions of the same

DRAM module. Because the access to the RI is sequential

and predictable, the architecture uses internal memory to

cache a block of RI voxels. Thus, during the processing of

that block of RI voxels, the image controller has parallel

access to both RI and FI voxels. The RI voxels are fetched

from the internal FPGA memory, whereas the FI voxels

are fetched directly from the external memory.

The FI, however, must be accessed randomly

(depending on the current transformation

T) and eight FI

voxels (a 2 × 2 × 2 neighborhood) must be fetched for

every RI image voxel to be processed. To meet this

memory access requirement, the reported architecture

employs a memory addressing scheme similar to the cubic

addressing technique reported in the context of volume

rendering [18]. A salient feature of this technique is that it

allows simultaneous access to the entire 2 × 2 × 2 voxel

neighborhood. The reported architecture implements this

technique by storing four copies of the FI and taking

advantage of the burst mode accesses native to modern

DRAMs. The image voxels are arranged sequentially such

that, performing a size two burst fetches two adjacent

2 × 2 neighborhood planes, thus making the entire

neighborhood available simultaneously. The image

intensities of this neighborhood are then further used for

updating the MH.

2.1.5. Updating the mutual histogram. For a given RI

voxel (

RV), there are eight intensity pairs (RV, FV

: FV

)

and corresponding interpolation weights. Because the MH

must be updated (read–modify–write) at these eight

locations, this amounts to 16 accesses to MH memory for

each RI voxel. This high memory access requirement is

handled by using the high-speed, dual-ported memories

internal to the FPGA to store the MH. The operation of

updating the MH is pipelined and, hence, read-after-write

(RAW) hazards can arise if consecutive transactions

attempt to update identical locations within the MH. The

reported design addresses this issue by introducing pre-

accumulate buffers, which aggregate the weights from all

conflicting transactions. Thus, all the transactions leading

to a RAW hazard are converted into a single update to the

MH, thereby eliminating any RAW hazards.

While the MH is being computed, the individual

histogram accumulator unit computes the histograms for

the RI and FI. These individual histograms are also stored

using internal, dual-ported memories. The valid voxel

Figure 1: Top-level block diagram of FPGA-based

architecture for MI calculation

185

Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

counter module keeps track of the number of valid voxels

accumulated in the MH and calculates its reciprocal value.

The resulting value is then used by the entropy calculation

unit for calculating the individual and joint probabilities.

2.1.6. Entropy Calculation. The final step in MI

calculation is to compute joint and individual entropies

using the joint and individual probabilities, respectively.

To calculate entropy, it is necessary to evaluate the

function

f(p) = p·ln(p) for all the probabilities. As each

probability

p takes values between [0,1], the

corresponding range for the function

f(p) is [–e

–1

,0]. Thus,

f(p) has a finite dynamic range and is defined for all values

p. Several methods for calculating logarithmic functions

in hardware have been reported [19], but of particular

interest is the multiple lookup table (LUT)–based

approach introduced by Castro-Pareja et al. [20]. This

approach minimizes the error in representing

f(p) for a

given number and size of LUTs and, hence, is accurate

and efficient. Following this approach, the reported design

implements

f(p) using multiple LUT–based piecewise

polynomial approximation.

3. Multiobjective optimization

The aforementioned architecture is designed to

accelerate the calculation of MI for performing medical

image registration. We have demonstrated this architecture

to be capable of offering execution performance superior

to that of a software implementation [17]. The accuracy of

MI calculation (and by extension, that of image

registration) offered by this implementation, however, is a

function of wordlengths chosen for the internal variables

of the design. Similarly, these wordlengths also control the

hardware implementation cost of the design. For medical

applications, the ability of an implementation to achieve

the desired level of accuracy is of paramount importance.

It is, therefore, necessary to understand the tradeoff

between accuracy and hardware implementation cost for

this design and to identify wordlength configurations that

provide effective tradeoffs between these conflicting

criteria. This multiobjective optimization will allow a

designer to systematically maximize accuracy for a given

hardware cost limitation (imposed by a target device, for

example) or minimize hardware resources to meet the

accuracy requirements of a medical application.

The following section provides a formal definition of

this problem and the subsequent section describes a

framework for multiobjective optimization of FPGA-

based medical image registration.

3.1. Problem statement

Consider a system

Q that is parameterized by N

parameters

(i = 1, 2, …, N), where each parameter can

take a single value from a corresponding set of valid

values (

). Let the design configuration space

corresponding to this system be

S, which is defined by a

set consisting of all

N-tuples generated by the Cartesian

product of the sets

∀i :

123

Svvv v

×××× (3)

The size of this design configuration space is then equal

to the cardinality of the set

S or, in other words, the

product of cardinalities of the sets

123

Svvv v=×××× (4)

For most systems, not all configurations that belong to

may be valid or practical. We therefore define a subset ℑ

(ℑ ⊂

S), such that it contains all the feasible system

configurations. Now consider

m objective functions (f

, f

…,

) defined for system Q, such that each function

associates a real value for every feasible configuration

c∈ℑ.

The problem of multiobjective optimization is then to

find a set of solutions that simultaneously optimize the

objective functions according to an appropriate criterion.

The most commonly adopted notion of optimality in

multiobjective optimization is that of Pareto optimality.

According to this notion, a solution

∗

is Pareto optimal if

there does not exist another solution

c∈ℑ such that

∗

), for all i, and f

∗

), for at least one j.

Given a multiobjective optimization problem and a

heuristic technique for this problem that attempts to derive

Pareto-optimal or near-Pareto-optimal solutions, we refer

to solutions derived by the heuristic as “Pareto-optimized”

solutions.

3.2. Multiobjective optimization framework

Figure 2 illustrates the framework that we have

developed for multiobjective optimization of the

aforementioned architecture. There are two basic

components of this framework. The first component is the

search algorithm that explores the design space and

generates feasible candidate solutions; and the second

component is the objective function evaluation module

that evaluates candidate solutions. The solutions and

associated objective values are fed back to the search

algorithm, so that they can be used to refine the search.

These two components are loosely coupled so that

different search algorithms can be easily incorporated into

the framework. Moreover, the objective function

evaluation module is parallelized using a message passing

interface (MPI) on a 32-processor cluster. With this

parallel implementation, multiple solutions can be

evaluated in parallel, thereby increasing search

186

Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

performance. These components are described in detail in

the following sections.

3.2.1. Design parameters.

As described in the earlier

section, the architecture performs MI calculation using a

fixed-point datapath. As a result, the accuracy of MI

calculation depends on the precision (wordlength) offered

by this datapath. The design parameters in this datapath

define the design space and are identified and listed along

with the corresponding design module (see Figure 1) in

Table 1.

A fixed-point representation consists of an integer part

and a fractional part. The number of bits assigned to these

two parts are called the integer wordlength (IWL) and

fractional wordlength (FWL), respectively. The collective

number of bits allocated to these parts control the range

and precision of the fixed-point representation. For this

architecture, the IWL required for each design parameter

can be deduced from the range information specific to the

image registration application. For example, in order to

support translations in the range of [–64, 63] voxels, 7 bits

of IWL (with 1 bit assigned as a sign bit) are required for

the translation parameter. We used similar range

information to choose the IWL for all the parameters, and

these values are reported in Table 1. The precision

required for each parameter, which is determined by its

FWL, is not known a priori. We, therefore, determine this

by performing multiobjective optimization using the FWL

of each parameter as a design variable. In our experiments,

we used the design range of [1, 32] bits for FWLs of all

the parameters. The optimization framework can support

different wordlength ranges for different parameters,

which can be used to account for additional design

constraints, such as, for example, certain kinds of

constraints imposed by third-party intellectual property.

The entropy calculation module is implemented using a

multiple LUT–based approach and also employs fixed-

point arithmetic. However, this module has already been

optimized for accuracy and hardware resources, as

described in [20]. The optimization strategy employed in

[20] uses an analytical approach that is specific to entropy

calculation and is distinct from the strategy presented in

this work. This module, therefore, does not participate in

the multiobjective optimization framework of this paper,

and we simply use the optimized configuration identified

earlier. This further demonstrates the flexibility of our

optimization framework to accommodate arbitrary

designer- or externally-optimized modules.

3.2.2. Search algorithms.

An exhaustive search that

explores the entire design space is guaranteed to find all

Pareto-optimal solutions. However, this search can lead to

unreasonable execution time, especially when the

objective function evaluation is computationally intensive.

For example, with four design variables, each taking one

of 32 possible values, the design space consists of 32

solutions. If the objective function evaluation takes 1

minute per trial (which is quite realistic for multiple MI

calculation using large images), the exhaustive search will

take 2 years. Consequently, we considered other search

methods as described below.

The first method is

partial search, which explores only

a portion of the entire design space. For every design

variable, the number of possible values it can take is

reduced by half by choosing every alternate value. A

complete search is then performed in this reduced search

space. This method, although not exhaustive, can

effectively sample the breadth of the design space. The

second method is

random search, which involves

randomly generating a fixed number of feasible solutions.

For both of these methods, Pareto-optimized solutions are

identified from the set of solutions explored.

Figure 2: Framework for multiobjective optimization of

FPGA-based image registration

Table 1: Design variables for FPGA-based architecture. Integer wordlengths are determined based on application-specific

range information, and fractional wordlengths are used as parameters in the multiobjective optimization framework

Architectural

Module

Design

Variable

Integer wordlength

(IWL ) (bits)

Fractional wordlength (FWL)

range (bits)

Translation vecto

7 [1,32] Voxel coordinate

transformation

Rotation matrix 4 [1,32]

Partial volume interpolation Floating image address 27 [1,32]

Mutual histogram accumulation Mutual histogram bin 25 [1,32]

187

Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

Multiobjective Optimization of FPGA-Based Medical Image Registration

Figures

Citations

Reconfigurable acceleration of 3D image registration

An Indexed Bibliography of Genetic Algorithms in Image Registration

A Review of Medical Applications of Genetic and Evolutionary Computation

High-Performance 3D Image Processing Architectures for Image-Guided Interventions

References

A fast and elitist multiobjective genetic algorithm: NSGA-II

Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach

Multimodality image registration by maximization of mutual information

SPEA2: Improving the Strength Pareto Evolutionary Algorithm For Multiobjective Optimization

Discrete-time signal processing (2nd ed.)

Related Papers (5)

Model-based synthesis and optimization of static multi-rate image processing algorithms

An unified approach for convolution-based image filtering on reconfigurable systems

Using C-to-gates to program streaming image processing kernels efficiently on FPGAs

Synthesis and optimization of image processing accelerators using domain knowledge

Parameterized design framework for hardware implementation of particle filters

Frequently Asked Questions (8)

Q1. What are the contributions mentioned in the paper "Multiobjective optimization of fpga-based medical image registration" ?

Q2. What is the trend in real-time signal processing systems?

Q3. What is the first step in the calculation of a voxel?

Q4. What other heuristic techniques are limited to software implementations?

Q5. What is the MI value of the FPGA-based architecture?

Q6. What is the importance of comparing the Pareto-optimized solution sets?

Q7. What is the effect of the MI calculation on registration accuracy?

Q8. How many accesses to MH memory for each RI voxel?