scispace - formally typeset
Open AccessProceedings ArticleDOI

Multiobjective Optimization of FPGA-Based Medical Image Registration

TLDR
This paper presents a multiobjective optimization strategy developed in the context of field-programmable gate array-based implementation of medical image registration that can easily be adapted to a wide range of signal processing applications, including areas of image and video processing beyond the medical domain.
Abstract
With a multitude of technological innovations, one emerging trend in image processing, and medical image processing, in particular, is custom hardware implementation of computationally intensive algorithms in the quest to achieve real-time performance. For reasons of area-efficiency and performance, these implementations often employ limited-precision datapaths. Identifying effective wordlengths for these datapaths while accounting for tradeoffs between design complexity and accuracy is a critical and time consuming aspect of this design process. Having access to optimized tradeoff curves can equip designers to adapt their designs to different performance requirements and target specific devices while reducing design time. This paper presents a multiobjective optimization strategy developed in the context of field-programmable gate array-based implementation of medical image registration. Within this framework, we compare several search methods and demonstrate the applicability of an evolutionary algorithm-based search for efficiently identifying superior multiobjective tradeoff curves. This strategy can easily be adapted to a wide range of signal processing applications, including areas of image and video processing beyond the medical domain.

read more

Content maybe subject to copyright    Report

Multiobjective Optimization of FPGA-Based Medical Image Registration
Omkar Dandekar
1,2
, William Plishker
1,2
,
Shuvra Bhattacharyya
1
1
Department of Electrical and Computer Engineering,
University of Maryland,
College Park, MD 20742
{omkar, plishker, ssb}@umd.edu
Raj Shekhar
1,2
2
Department of Diagnostic Radiology,
University of Maryland,
Baltimore, MD 21201
rshekhar@umm.edu
Abstract
With a multitude of technological innovations, one
emerging trend in image processing, and medical image
processing, in particular, is custom hardware
implementation of computationally intensive algorithms in
the quest to achieve real-time performance. For reasons of
area-efficiency and performance, these implementations
often employ limited-precision datapaths. Identifying
effective wordlengths for these datapaths while accounting
for tradeoffs between design complexity and accuracy is a
critical and time consuming aspect of this design process.
Having access to optimized tradeoff curves can equip
designers to adapt their designs to different performance
requirements and target specific devices while reducing
design time. This paper presents a multiobjective
optimization strategy developed in the context of field-
programmable gate array–based implementation of
medical image registration. Within this framework, we
compare several search methods and demonstrate the
applicability of an evolutionary algorithm–based search
for efficiently identifying superior multiobjective tradeoff
curves. This strategy can easily be adapted to a wide
range of signal processing applications, including areas
of image and video processing beyond the medical
domain.
1. Introduction
An emerging trend in real-time signal processing
systems is to accelerate computationally intensive
algorithmic components by mapping them to custom or
reconfigurable hardware platforms, such as application-
specific integrated circuits (ASICs) and field-
programmable gate arrays (FPGAs). Most of these
algorithms are initially developed in software using
floating-point representation and later migrated to
hardware using finite precision (e.g., fixed-point
representation) for achieving improved computational
performance and reduced hardware cost. These
implementations are often parameterized, so that a wide
range of finite precision representations can be supported
[1] by choosing an appropriate wordlength for each
internal variable. As a consequence, the accuracy and
hardware resource requirements of such a system are
functions of the wordlengths used to represent the internal
variables. Determining an optimal wordlength
configuration has been shown to be NP-hard [2] and can
take up to 50% of the design time for complex systems
[3]. Moreover, a single optimal solution may not exist,
especially in the presence of multiple conflicting
objectives. In addition, a new configuration generally
needs to be derived when the design constraints are
altered.
The problem of finding optimal wordlength
configurations can be formulated as a multiobjective
optimization, where different objectives — for example,
accuracy and area — generally conflict with one another.
Although this approach increases the complexity of the
search, it can find a set of Pareto-optimized configurations
representing strategically-chosen tradeoffs among the
various objectives. This allows a designer to choose an
efficient configuration that satisfies given design
constraints and provides ease and flexibility in modifying
the design configuration as the constraints change.
An optimum wordlength configuration can be
identified by analytically solving the quantization error
equation as described in [4-8]. This analytical
representation, however, can be difficult to obtain for
complex systems. Techniques based on local search or
gradient-based search [9] have also been employed, but
these methods are limited to finding a single feasible
solution as opposed to an optimized tradeoff curve. An
exhaustive search of the entire design space is guaranteed
16th International Symposium on Field-Programmable Custom Computing Machines
978-0-7695-3307-0/08 $25.00 © 2008 IEEE
DOI 10.1109/FCCM.2008.50
183
Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

to find Pareto-optimal configurations. Execution time for
such exhaustive search, however, increases exponentially
with the number of design parameters, making it
unfeasible for most practical systems. Methods that
transform this problem into a linear programming problem
have also been reported [4], but these techniques are
limited to cases in which the objectives can be modeled as
linear functions of the design parameters. Other
approaches based on linear aggregation of objectives may
not find proper Pareto-optimal solutions when the search
space is nonconvex [10]. Techniques based on
evolutionary methods have been shown to be effective in
searching large search spaces in an efficient manner [11,
12]. Furthermore, these techniques are inherently capable
of performing multipoint searches. As a result, techniques
based on evolutionary algorithms (EA) have been
employed in the context of multiobjective optimization
(SPEA2 [13], NSGA-II [14]).
This article presents a novel multiobjective
optimization strategy developed in the context of FPGA-
based implementation of medical image registration. The
tradeoff between FPGA resources (area and memory) and
implementation accuracy is explored, and Pareto-
optimized solutions are identified. This analysis is
performed by treating the wordlengths of the internal
variables as design variables. We also compare several
search methods for finding Pareto-optimized solutions and
demonstrate the applicability of search based on
evolutionary techniques for efficiently identifying superior
multiobjective tradeoff curves. This optimization strategy
can easily be adapted to a wide range of signal and image
processing applications.
This paper is organized as follows. Section 2 provides
background on image registration and outlines an
architecture for its FPGA-based implementation. The
formulation of the multiobjective optimization and various
search methods to find Pareto-optimized solutions are
described in Section 3. Section 4 presents experimental
results and compares various search methods. In Section 5,
related work for optimum wordlength search and
multiobjective optimization is presented. Section 6
concludes the paper.
2. Image registration
Medical image registration is the process of aligning
two images that represent the same anatomy at different
times, from different viewing angles, or using different
imaging modalities. This method attempts to find the
transformation (
ˆ
T
) that optimally aligns a reference image
(RI) with coordinates x, y, and z and a floating image (FI)
under an image similarity measure (
F ):
ˆ
arg max ( ( , , ), ( ( , , ))).
T
TRIxyzFITxyz= F (1)
Many image similarity measures, such as the sum of
squared differences and cross correlation, have been used,
but mutual information (MI) has recently emerged as the
preferred similarity measure. MI-based image registration
has been shown to be robust and effective in
multimodality image registration [15]. However, this form
of registration typically requires thousands of iterations
(MI evaluations), depending on image complexity and the
degree of initial misalignment between images. Castro-
Pareja et al. [16] have shown that, calculation of MI for
different candidate transformations is a factor limiting the
performance of MI-based image registration. We have,
therefore, developed an FPGA-based architecture for
accelerated calculation of MI [17], which is capable of
computing MI 40-times faster as compared to software
implementation.
2.1. FPGA-based implementation of mutual
information calculation
During the execution of image registration using this
architecture, the optimization process is executed from a
host workstation. The host provides a candidate
transformation, while the FPGA-based implementation
applies it to the images and performs the corresponding
MI computation. The computed MI value is then further
used by the host to update the candidate transformation
and eventually find the optimal alignment between the RI
and FI. Figure 1 shows the top-level block diagram of the
aforementioned architecture. The important modules in
this design are described in the following subsections.
2.1.1. Voxel counter. Calculation of MI requires
processing each voxel in the RI. In addition, because the
implemented algorithm processes the images on a
subvolume basis, RI voxels within a 3D neighborhood
corresponding to an individual subvolume must be
processed sequentially. The host programs the FPGA-
based MI calculator with subvolume start and end
addresses, and the voxel counter computes the address
corresponding to each voxel within that subvolume in
z
y
x order.
2.1.2. Coordinate transformation. The initial step in MI
calculation involves applying a candidate transformation
(T), to each voxel coordinate (
r
v
) in the RI to find the
corresponding voxel coordinates in the FI (
f
v
). This is
mathematically expressed as:
.
f
r
vTv
=
(2)
The deformation model employed is a six-parameter
rigid transformation model and is represented using a
4 × 4 matrix. The host calculates this matrix based on the
current candidate transformation provided by the
184
Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

optimization routine and sends it to the MI calculator. A
fixed-point representation is used to store the individual
elements of this matrix. The coordinate transformation is
accomplished by a simple matrix multiplication.
2.1.3 Partial volume interpolation. The coordinates
mapped in the FI space (
f
v
) do not normally coincide
with a grid point (integer location), thus requiring
interpolation. Nearest neighbor and trilinear interpolation
schemes have been used most often for this purpose;
however, partial volume (PV) interpolation, introduced by
Maes et al. [15] has been shown to provide smooth
changes in the histogram values with small changes in
transformation. The reported architecture consequently
implements PV interpolation as the choice of interpolation
scheme.
f
v
in general, will have both fractional and
integer components and will land within an FI
neighborhood of size 2 × 2 × 2. The interpolation weights
required for the PV interpolation are calculated using the
fractional components of
f
v
. Fixed-point arithmetic is
used to compute these interpolation weights. The
corresponding floating voxel intensities are fetched by the
image controller in parallel using the integer components
of
f
v
. The image controller also fetches the voxel
intensity corresponding to
r
v
. The MH then must be
updated for each pair of reference and floating voxel
intensities (eight in all), using the corresponding weights
computed by the PV interpolator.
2.1.4. Image memory access. The typical size of 3D
medical images prevents the use of high-speed memory
internal to the FPGA for their storage. Between the two
images, the RI has more relaxed access requirements,
because it is accessed in a sequential manner (in
z
y
x
order). This kind of access benefits from burst accesses
and memory caching techniques, allowing the use of
modern dynamic random access memories (DRAMs) for
image storage. For the architecture presented, both the RI
and FI are stored in separate logical partitions of the same
DRAM module. Because the access to the RI is sequential
and predictable, the architecture uses internal memory to
cache a block of RI voxels. Thus, during the processing of
that block of RI voxels, the image controller has parallel
access to both RI and FI voxels. The RI voxels are fetched
from the internal FPGA memory, whereas the FI voxels
are fetched directly from the external memory.
The FI, however, must be accessed randomly
(depending on the current transformation
T) and eight FI
voxels (a 2 × 2 × 2 neighborhood) must be fetched for
every RI image voxel to be processed. To meet this
memory access requirement, the reported architecture
employs a memory addressing scheme similar to the cubic
addressing technique reported in the context of volume
rendering [18]. A salient feature of this technique is that it
allows simultaneous access to the entire 2 × 2 × 2 voxel
neighborhood. The reported architecture implements this
technique by storing four copies of the FI and taking
advantage of the burst mode accesses native to modern
DRAMs. The image voxels are arranged sequentially such
that, performing a size two burst fetches two adjacent
2 × 2 neighborhood planes, thus making the entire
neighborhood available simultaneously. The image
intensities of this neighborhood are then further used for
updating the MH.
2.1.5. Updating the mutual histogram. For a given RI
voxel (
RV), there are eight intensity pairs (RV, FV
0
: FV
7
)
and corresponding interpolation weights. Because the MH
must be updated (read–modify–write) at these eight
locations, this amounts to 16 accesses to MH memory for
each RI voxel. This high memory access requirement is
handled by using the high-speed, dual-ported memories
internal to the FPGA to store the MH. The operation of
updating the MH is pipelined and, hence, read-after-write
(RAW) hazards can arise if consecutive transactions
attempt to update identical locations within the MH. The
reported design addresses this issue by introducing pre-
accumulate buffers, which aggregate the weights from all
conflicting transactions. Thus, all the transactions leading
to a RAW hazard are converted into a single update to the
MH, thereby eliminating any RAW hazards.
While the MH is being computed, the individual
histogram accumulator unit computes the histograms for
the RI and FI. These individual histograms are also stored
using internal, dual-ported memories. The valid voxel
Figure 1: Top-level block diagram of FPGA-based
architecture for MI calculation
185
Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

counter module keeps track of the number of valid voxels
accumulated in the MH and calculates its reciprocal value.
The resulting value is then used by the entropy calculation
unit for calculating the individual and joint probabilities.
2.1.6. Entropy Calculation. The final step in MI
calculation is to compute joint and individual entropies
using the joint and individual probabilities, respectively.
To calculate entropy, it is necessary to evaluate the
function
f(p) = p·ln(p) for all the probabilities. As each
probability
p takes values between [0,1], the
corresponding range for the function
f(p) is [e
1
,0]. Thus,
f(p) has a finite dynamic range and is defined for all values
of
p. Several methods for calculating logarithmic functions
in hardware have been reported [19], but of particular
interest is the multiple lookup table (LUT)–based
approach introduced by Castro-Pareja et al. [20]. This
approach minimizes the error in representing
f(p) for a
given number and size of LUTs and, hence, is accurate
and efficient. Following this approach, the reported design
implements
f(p) using multiple LUT–based piecewise
polynomial approximation.
3. Multiobjective optimization
The aforementioned architecture is designed to
accelerate the calculation of MI for performing medical
image registration. We have demonstrated this architecture
to be capable of offering execution performance superior
to that of a software implementation [17]. The accuracy of
MI calculation (and by extension, that of image
registration) offered by this implementation, however, is a
function of wordlengths chosen for the internal variables
of the design. Similarly, these wordlengths also control the
hardware implementation cost of the design. For medical
applications, the ability of an implementation to achieve
the desired level of accuracy is of paramount importance.
It is, therefore, necessary to understand the tradeoff
between accuracy and hardware implementation cost for
this design and to identify wordlength configurations that
provide effective tradeoffs between these conflicting
criteria. This multiobjective optimization will allow a
designer to systematically maximize accuracy for a given
hardware cost limitation (imposed by a target device, for
example) or minimize hardware resources to meet the
accuracy requirements of a medical application.
The following section provides a formal definition of
this problem and the subsequent section describes a
framework for multiobjective optimization of FPGA-
based medical image registration.
3.1. Problem statement
Consider a system
Q that is parameterized by N
parameters
n
i
(i = 1, 2, …, N), where each parameter can
take a single value from a corresponding set of valid
values (
v
i
). Let the design configuration space
corresponding to this system be
S, which is defined by a
set consisting of all
N-tuples generated by the Cartesian
product of the sets
v
i,
i :
123
.
N
Svvv v
=
×××× (3)
The size of this design configuration space is then equal
to the cardinality of the set
S or, in other words, the
product of cardinalities of the sets
v
i
:
123
.
N
Svvv v=×××× (4)
For most systems, not all configurations that belong to
S
may be valid or practical. We therefore define a subset
(
S), such that it contains all the feasible system
configurations. Now consider
m objective functions (f
1
, f
2
,
…,
f
m
) defined for system Q, such that each function
associates a real value for every feasible configuration
c∈ℑ.
The problem of multiobjective optimization is then to
find a set of solutions that simultaneously optimize the
m
objective functions according to an appropriate criterion.
The most commonly adopted notion of optimality in
multiobjective optimization is that of Pareto optimality.
According to this notion, a solution
c
is Pareto optimal if
there does not exist another solution
c∈ℑ such that
f
i
(c) f
i
(c
), for all i, and f
j
(c) < f
j
(c
), for at least one j.
Given a multiobjective optimization problem and a
heuristic technique for this problem that attempts to derive
Pareto-optimal or near-Pareto-optimal solutions, we refer
to solutions derived by the heuristic as “Pareto-optimized”
solutions.
3.2. Multiobjective optimization framework
Figure 2 illustrates the framework that we have
developed for multiobjective optimization of the
aforementioned architecture. There are two basic
components of this framework. The first component is the
search algorithm that explores the design space and
generates feasible candidate solutions; and the second
component is the objective function evaluation module
that evaluates candidate solutions. The solutions and
associated objective values are fed back to the search
algorithm, so that they can be used to refine the search.
These two components are loosely coupled so that
different search algorithms can be easily incorporated into
the framework. Moreover, the objective function
evaluation module is parallelized using a message passing
interface (MPI) on a 32-processor cluster. With this
parallel implementation, multiple solutions can be
evaluated in parallel, thereby increasing search
186
Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

performance. These components are described in detail in
the following sections.
3.2.1. Design parameters.
As described in the earlier
section, the architecture performs MI calculation using a
fixed-point datapath. As a result, the accuracy of MI
calculation depends on the precision (wordlength) offered
by this datapath. The design parameters in this datapath
define the design space and are identified and listed along
with the corresponding design module (see Figure 1) in
Table 1.
A fixed-point representation consists of an integer part
and a fractional part. The number of bits assigned to these
two parts are called the integer wordlength (IWL) and
fractional wordlength (FWL), respectively. The collective
number of bits allocated to these parts control the range
and precision of the fixed-point representation. For this
architecture, the IWL required for each design parameter
can be deduced from the range information specific to the
image registration application. For example, in order to
support translations in the range of [–64, 63] voxels, 7 bits
of IWL (with 1 bit assigned as a sign bit) are required for
the translation parameter. We used similar range
information to choose the IWL for all the parameters, and
these values are reported in Table 1. The precision
required for each parameter, which is determined by its
FWL, is not known a priori. We, therefore, determine this
by performing multiobjective optimization using the FWL
of each parameter as a design variable. In our experiments,
we used the design range of [1, 32] bits for FWLs of all
the parameters. The optimization framework can support
different wordlength ranges for different parameters,
which can be used to account for additional design
constraints, such as, for example, certain kinds of
constraints imposed by third-party intellectual property.
The entropy calculation module is implemented using a
multiple LUT–based approach and also employs fixed-
point arithmetic. However, this module has already been
optimized for accuracy and hardware resources, as
described in [20]. The optimization strategy employed in
[20] uses an analytical approach that is specific to entropy
calculation and is distinct from the strategy presented in
this work. This module, therefore, does not participate in
the multiobjective optimization framework of this paper,
and we simply use the optimized configuration identified
earlier. This further demonstrates the flexibility of our
optimization framework to accommodate arbitrary
designer- or externally-optimized modules.
3.2.2. Search algorithms.
An exhaustive search that
explores the entire design space is guaranteed to find all
Pareto-optimal solutions. However, this search can lead to
unreasonable execution time, especially when the
objective function evaluation is computationally intensive.
For example, with four design variables, each taking one
of 32 possible values, the design space consists of 32
4
solutions. If the objective function evaluation takes 1
minute per trial (which is quite realistic for multiple MI
calculation using large images), the exhaustive search will
take 2 years. Consequently, we considered other search
methods as described below.
The first method is
partial search, which explores only
a portion of the entire design space. For every design
variable, the number of possible values it can take is
reduced by half by choosing every alternate value. A
complete search is then performed in this reduced search
space. This method, although not exhaustive, can
effectively sample the breadth of the design space. The
second method is
random search, which involves
randomly generating a fixed number of feasible solutions.
For both of these methods, Pareto-optimized solutions are
identified from the set of solutions explored.
Figure 2: Framework for multiobjective optimization of
FPGA-based image registration
Table 1: Design variables for FPGA-based architecture. Integer wordlengths are determined based on application-specific
range information, and fractional wordlengths are used as parameters in the multiobjective optimization framework
Architectural
Module
Design
Variable
Integer wordlength
(IWL ) (bits)
Fractional wordlength (FWL)
range (bits)
Translation vecto
r
7 [1,32] Voxel coordinate
transformation
Rotation matrix 4 [1,32]
Partial volume interpolation Floating image address 27 [1,32]
Mutual histogram accumulation Mutual histogram bin 25 [1,32]
187
Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 17, 2009 at 16:06 from IEEE Xplore. Restrictions apply.

Citations
More filters
Proceedings ArticleDOI

Reconfigurable acceleration of 3D image registration

TL;DR: It is shown that a single core on 412MHz XC5VLX330T FPGA can evaluate a rigid transformation of a 3D image with 16 million voxels in 35ms, over 108 times faster than a multi-threaded implementation running on a 2.5GHz Intel Quad-Core Xeon platform.
ReportDOI

High-Performance 3D Image Processing Architectures for Image-Guided Interventions

TL;DR: This dissertation is geared toward developing high-performance 3D image processing architectures, which will enable improved intraprocedural visualization and navigation capabilities during IGIs.
References
More filters
Journal ArticleDOI

A fast and elitist multiobjective genetic algorithm: NSGA-II

TL;DR: This paper suggests a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties, and modify the definition of dominance in order to solve constrained multi-objective problems efficiently.
Journal ArticleDOI

Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach

TL;DR: The proof-of-principle results obtained on two artificial problems as well as a larger problem, the synthesis of a digital hardware-software multiprocessor system, suggest that SPEA can be very effective in sampling from along the entire Pareto-optimal front and distributing the generated solutions over the tradeoff surface.
Journal ArticleDOI

Multimodality image registration by maximization of mutual information

TL;DR: The results demonstrate that subvoxel accuracy with respect to the stereotactic reference solution can be achieved completely automatically and without any prior segmentation, feature extraction, or other preprocessing steps which makes this method very well suited for clinical applications.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What are the contributions mentioned in the paper "Multiobjective optimization of fpga-based medical image registration" ?

This paper presents a multiobjective optimization strategy developed in the context of fieldprogrammable gate array–based implementation of medical image registration. Within this framework, the authors compare several search methods and demonstrate the applicability of an evolutionary algorithm–based search for efficiently identifying superior multiobjective tradeoff curves. 

An emerging trend in real-time signal processing systems is to accelerate computationally intensive algorithmic components by mapping them to custom or reconfigurable hardware platforms, such as applicationspecific integrated circuits (ASICs) and fieldprogrammable gate arrays (FPGAs). 

The initial step in MI calculation involves applying a candidate transformation (T), to each voxel coordinate ( rv ) in the RI to find the corresponding voxel coordinates in the FI ( fv ). 

Other heuristic techniques that take into account tradeoffs between hardware cost and implementation error and enable automatic conversion from floating-point to fixed-point representations are limited to software implementations only [26]. 

The authors have developed a parameterized, bit-true emulation of the FPGA-based architecture that is capable of calculating the MI valuecorresponding to any feasible configuration for a given image transformation. 

Quantitative comparison of the Pareto-optimized solution sets is essential in order to compare more precisely the effectiveness of various search methods. 

These factors, along with its execution time, in their experience, may render registration accuracy as an unsuitable objective function, especially if there is nonmonotonic behavior with respect to the wordlength of design variables. 

Because the MH must be updated (read–modify–write) at these eight locations, this amounts to 16 accesses to MH memory for each RI voxel.