scispace - formally typeset
Open AccessProceedings ArticleDOI

Reconfigurable acceleration of 3D image registration

TLDR
It is shown that a single core on 412MHz XC5VLX330T FPGA can evaluate a rigid transformation of a 3D image with 16 million voxels in 35ms, over 108 times faster than a multi-threaded implementation running on a 2.5GHz Intel Quad-Core Xeon platform.
Abstract
This paper proposes techniques for accelerating a software based image registration algorithm for 3D medical images targeting a reconfigurable hardware platform. Various methods, including dedicated fixed point arithmetic, error model based bit width analysis, architecture exploration and application-specific memory modules, are applied to address issues from the software algorithm and to maximize the performance of FPGA technology. Based on the reconfigurability of FPGA devices, the system can be extended to swap modules optimized for different parameters, and to adopt more advanced registration algorithms. We show that a single core on 412MHz XC5VLX330T FPGA can evaluate a rigid transformation of a 3D image with 16 million voxels in 35ms. With 30 cores on an FPGA, it is over 108 times faster than a multi-threaded implementation running on a 2.5GHz Intel Quad-Core Xeon platform.

read more

Content maybe subject to copyright    Report

RECONFIGURABLE ACCELERATION OF 3D IMAGE REGISTRATION
Kuen Hung Tsoi, Daniel Rueckert, Chun Hok Ho and Wayne Luk
Department of Computing
Imperial College London, UK
{khtsoi,dr,cho,wl}@doc.ic.ac.uk
ABSTRACT
This paper proposes techniques for accelerating a soft-
ware based image registration algorithm for 3D medical im-
ages targeting a reconfigurable hardware platform. Various
methods, including dedicated fixed point arithmetic, error
model based bit width analysis, architecture exploration and
application-specific memory modules, are applied to address
issues from the software algorithm and to maximize the per-
formance of FPGA technology. Based on the reconfigurabil-
ity of FPGA devices, the system can be extended to swap
modules optimized for different parameters, and to adopt
more advanced registration algorithms. We show that a sin-
gle core on 412MHz XC5VLX330T FPGA can evaluate a
rigid transformation of a 3D image with 16 million voxels in
35ms. With 30 cores on an FPGA, it is over 108 times faster
than a multi-threaded implementation running on a 2.5GHz
Intel Quad-Core Xeon platform.
1. Introduction
Reconfigurable platforms have been widely adopted in ac-
celerating different computational intensive digital signal pro-
cessing algorithms. One example of suitable applications in
reconfigurable computing is medical image analysis. Med-
ical images can be obtained from various sources includ-
ing X-rays, radionuclide scans such as Single Photon Emis-
sion Computed Tomography (SPECT) and Positron Emis-
sion Tomography (PET) scans, CT scans, ultrasound and
magnetic resonance image (MRI). In practice, images ac-
quired by sampling the subject in different time may have
different coordinate systems due to axis rotation, subject po-
sition change or equipment variation. The process of trans-
forming these images into an unified coordinate system is
called image registration (IR). Comparison or integration of
different images of the subject can be performed after the IR
process.
The main objective of the IR process is not extracting
the features from images directly but to optimize the set of
parameters used in transforming target images to the coordi-
nate system of source images. During the optimization pro-
cess, parameters of translation, rotation and scaling are tuned
iteratively and the cost function is based on the similarity of
the transformed images.
The Image Registration Toolkit (ITK) [1] is an image
registration framework which provides a board range of reg-
istration solutions for both 2D and 3D MRIs in object ori-
ented C++ design. Despite the flexibility in software im-
plementations, the improving resolution in imaging equip-
ments, the increasing amount of images to be processed and
the need of real time diagnosing ability make dedicated hard-
ware platforms increasingly attractive in practical IR appli-
cations as higher computing power is required.
Studies have been conducted in the IR implementations
on reconfigurable platforms. In 2003, a field programmable
gate array (FPGA) based IR implementation using a B-spline
free-form deformation algorithm is presented [2]. The de-
sign is implemented using the Handel-C language on a Xil-
inx Virtex II device (XC2V6000). Clocked at 67MHz, it
achieves 3.2× speedup over software version on a 2.6GHz
Xeon CPU. In 2006, Altera releases a video and image pro-
cessing suite for common image functions on FPGAs [3].
With a streaming interface and coupling with high level de-
scriptions in MATLAB, this developing environment enables
fast and optimized implementations on medical image pro-
cessing. An FPGA based mutual information evaluation sys-
tem for IR is proposed in [4]. The system utilizes a data flow
model to improve hardware parallelism by sub-volume di-
vision. In 2007, reconfigurable computing platform is used
for classifying brain tissues [5]. With the help of four Xil-
inx Virtex-4 LX200 FPGAs running at 100MHz, the soft-
ware/hardware co-designed system achieves 3.5× speedup
over a pure software implementation on SGI Altix 350 sys-
tem. In 2008, a multi-objective optimization framework for
precision and resource trade off is proposed [6]. The sys-
tem uses multiple copies of image in external memory for
simultaneous voxels access. However, these studies do not
provide a detailed error analysis on the accuracy. They also
do not consider the reconfigurability of FPGA devices which
can facilitate optimization for specific and changing param-
eters.
This paper describes a novel approach for optimizing a
reconfigurable accelerator for an IR kernel derived from the
ITK package, that takes into account data and platform de-
pendent parameters. The contributions are:
Transforming a software floating point 3D IR kernel
to a fixed point multiple bit-width reconfigurable sys-

tem. Optimization on operator bit width is carried out
based on an analytic error model. The new design is
more cost effective in terms of resource utilization and
can be achieved without sacrificing precision and ac-
curacy.
Classes of application specific cache systems to ad-
dress the large memory bandwidth requirement. Based
on the transformation parameters, different levels of
reduction in external memory references can be achieved
with different on-chip memory capacities.
A modularized framework for accelerating 3D IR on
reconfigurable platform. Utilizing the reconfigurabil-
ity of FPGA devices, users can select and load designs
optimized for different system environments during
the IR process.
The remainder of the paper is organized as follows. In
Section 2, the algorithm of 3D IR and the associated chal-
lenges of system design are introduced. In Section 3, the
computational requirements in the IR algorithm is analyzed
and the proposed arithmetic scheme is discussed. In Sec-
tion 4, the problem of external memory reference is ana-
lyzed and a set of caching scheme for different transforma-
tion parameters are introduced. In Section 5, the modularize
framework for image registration is presented. Implementa-
tion and results are shown in Section 6 and followed by the
conclusion drawn in Section 7.
2. 3D Image Registration
Inputs to IR process are specially formated 3D medical im-
ages which have (X × Y × Z) pixels. Common CT and
MRI formats are 256
3
or 512
3
pixels in size [7]. A 16-bit
integer value is assigned to each pixel to represent its gray
scale intensity.
The registration process is to find the optimal transform
for aligning two 3D images. In our example, the Rigid Reg-
istration kernel (RReg) [1] aligns 3D images through rota-
tions and transitions. After alignment, the similarity of im-
ages is measured by the averaged Sum of Square of Differ-
ence (SSD) of all interested pixels. Using the SSD value, a
monitoring program iterates the RReg kernel for a new set of
transformation parameters generated by optimization algo-
rithm. Operations in the RReg kernel include transforming
pixel coordinates, fetching reference pixel values, interpo-
lating the expected values of the transformed pixels. Fig. 1
abstracts the process of the RReg kernel.
A homogeneous transformation function, T , is applied
to the coordinate in the target image to find the projected
coordinate, (x, y, z) = T(i, j, k), in the source image. In
the RReg kernel, the T function is simplified to three vec-
tors:
xd,
yd and
zd. These vectors are added to the current
projected coordinate along the corresponding traversed di-
rections to generate the next projected coordinate. Since the
transformation parameters are floating point values, the pro-
jected coordinate is also represented in floating point format.
Thus tri-linear interpolation is used to estimate the gray scale
value at the projected position using eight neighboring pixel
values, P 1 to P 8 and two weight vectors, w
1
and w
2
, as
shown in Fig. 1. The SSD value is updated by comparing
this result to the pixel value in target image.
For medical purposes, only certain areas of the images
are need for diagnosis. Thus not all pixels in the target image
are involved in the similarity measurement. To avoid unnec-
essary computation and memory access, a skipping scheme
is encoded in the image. For all pixels with a negative value,
s, the kernel will skip the following |s| pixels along the X
direction in the target image.
The original RReg kernel is implemented in objective
oriented C++ code using IEEE754 single precision floating
point arithmetics. While this software implementation pro-
vides a convenience mean for researchers to test comprehen-
sive registration algorithms, it suffers from long processing
time which is common in most software based IR implemen-
tations. In general, it takes the ITK software tens of minutes
to align two images on PC platform. We focus on improving
this kernel since it is in the main loop which iterated several
hundred times, e.g. over 300 times in the example.
Analysis of the software kernel indicates that 92% of
computation are floating point add and multiply operations
which are expensive in reconfigurable hardware in terms of
resource consumption and critical path length. We observed
the fact that half of the inputs to the tri-linear interpolation
tree are 16-bit non-negative integer (the pixel values) and the
other half are values within the range of [0, 1] (the weight
vectors). This suggests that more efficient dedicated oper-
ators can be used instead of the original floating point ver-
sion. It is also important to maintain correct outputs after
applying the dedicated operators. Due to the read-only pixel
memory access and independent transformation and interpo-
lation of each pixel, it is possible to distribute workloads to
parallel kernels in hardware and deeply pipeline each ker-
nel for maximum throughput rate. The limiting factor in the
hardware kernel is the memory bandwidth.
Since the fast on-chip memory in modern FPGA devices
are insufficient to store the 32M to 256M bytes 3D images,
it is unavoidable to store the pixel data in off-chip mem-
ory which is relatively slow and requires multi-cycle access
time. The solution is to reduce off-chip memory access us-
ing on-chip memory blocks as caches. The memory access
pattern in IR process is regular, noncontinuous and highly
dependent on the transformation parameters. This suggests
that customized caching systems are needed to achieve better
performance over traditional CPU caches.
There are two domains to be optimized: (a) parameter-
ized kernels and memory systems with different throughputs,
and (b) platform dependent constraints including logic re-
sources and memory bandwidth. The challenges here are to
analysis the relationship between the two domains and pro-

P1v2 v1 P5 P3 P7 P2 P6 P4 P8v2 v1 v2 v1 v2 v1
u2 u1 u2 u1
t1 t2
I(x, y, z)
T(i, j, k)
x’ = Rx + T
offset2=1
P2P1
offset1=0
P5
offset5=X*Y offset6=X*Y+1
P6
P8
offset8=X*Y+X+1
P7
offset7=X*Y+X
P3
offset3=X offset4=X+1
P4
val < 0
val >= 0
(i, j, k)
0
Y
Z
X (sx2, sy2, sz2)
(sx1, sy1, sz1)
(x, y, z)
VoxelsPtr
w1 = [t1, u1, v1]
(x, y, z)
w2 = [t2, u2, v2]
(a)
(b) (c)
Target 3D Image
Source 3D Image
Fig. 1. RReg kernel operations. (a): Coordinate transformation from target image to source image; (b): Finding the weight
vectors and values of neighboring pixels; (c): Tri-linear interpolation.
vide a framework to utilize suitable designs under given con-
ditions. So the mapping from various algorithms and prob-
lem sizes to various target reconfigurable platforms will have
predictable performance and can be automated.
3. Kernel Architecture
One critical component of the hardware IR kernel is the tree
structure for interpolation as shown on the right of Fig. 1.
It is built by recursively applying the basic sum-of-product
structure as shown in Fig. 2. Here, the c
1
and c
2
inputs are
components from the weight vectors. The p
1
and p
2
inputs
are from pixel values or previous sum-of-product structures
with range [p
min
, p
max
]. h and k are the outputs of the mul-
tipliers and r is the output of the adder. The main objec-
tive of the accelerator is to improve the performance of these
multiply and add operators.
p2p1
k
r
c1 + c2 = 1
p1, p2 in [p , p ]
min
max
c1 c2
h
a
r : [i : f ]
a
h, k : [i : f ]
m m
c1, c2 : [0 : f ]w
p pp1, p2 : [i : f ]
Fig. 2. Basic structure of interpolation tree.
Although fixed point arithmetic has clear advantage on
area utilization over the floating point counterpart in recon-
figurable logics, it has the weakness of lacking dynamic range.
In order to employ fixed point arithmetic in the IR kernel
without creating errors, it is necessary to characterize the
inputs, the internal tasks performed and the output require-
ments. In this paper, we use the notation [i : f] for fixed
point number format, where i is the number of bits for the
integral part and f is the number of bits for fractional part.
The format of pixel values is [16 : 0]. The projected co-
ordinate is the form of ([log
2
(X) : f], [log
2
(Y ) : f], [log
2
(Z) :
f]). Coordinates outside the X, Y and Z boundaries of source
image are discarded. Using this representation, the coordi-
nate of the base pixel P 1 and the first weight vector w1 are
simply the integral and fractional parts of the projected co-
ordinate. The second weight vector can be computed using
an f-bit fixed point subtraction as w2 = 1 w1.
The value of f affects size and speed of the operators
in the interpolation tree. In this design, it is determined by
applying precision analysis of Affine Arithmetic (AA) [8]
recursively on the tree structure. Let
E( ec
1
) = 2
f
w
1
ε
1
, E( ec
2
) = 2
f
w
1
ε
2
,
E( ep
1
) = 2
f
p
1
ε
3
, E( ep
2
) = 2
f
p
1
ε
4
,
E(
˜
h) = w
1
E( ep
1
) + p
1
E( ec
1
) + E( ep
1
)E( ec
1
)
+2
f
m
1
ε
5
,
E(
˜
k) = w
2
E( ep
2
) + p
2
E( ec
2
) + E( ep
2
)E( ec
2
)
+2
f
m
1
ε
6
,
E(˜r) = E(
˜
h) + E(
˜
k) + 2
f
a
1
ε
7
(1)
be the error model of each variable (edges in the tree). Here
f
w
, f
p
, f
m
and f
a
are the fraction size in bits for the in-
put w, input p, the multipliers and the adder. h and k use
the same fraction size due to the symmetric tree structure.
ε
1
..ε
7
[1, 1] are the error source from the round to the
nearest process. To preserve accuracy, the required output
error,
ˆ
E(˜r), should be less than or equal to 1 ulp. That is,
ˆ
E(˜r) 2
f
r
. (2)
Given the symmetric structure and the fact that w
1
+
w
2
= 1, the relation between fraction size in every edge
of tree structure can be found by substitution of Equation 1,
Equation 2 and taking the maximum value of ε to 1 and the
maximum value of p to P . Thus
2
f
a
1
2
f
p
1
+ P 2
f
w
+2
f
p
f
w
2
+ 2
f
m
. (3)
This method is applied recursively to the interpolation
tree to find the relation between fraction size of every vari-
ables in the tree and the final rounding precision. More detail
error analysis can be found in [8]. In this application, sev-
eral attributes of the structure can help to simplify the total
error analysis. By partitioning the interpolation tree to three
levels, L1, L2 and L3, of sub-trees with basic architecture
as shown in Fig. 2, we have:

All values from the weight vectors have the same for-
mat and f
w
> max(log
2
(X), log
2
(Y ), log
2
(Z)).
All p inputs in the first (leaf) level are the exact integer
values from the 3D image array. So E(p
L1
) = 0.
All operators on the same level of the interpolation
tree have the same input and output format.
In all sub-tree structure, w1 + w2 = 1 and the two
p inputs have same range. Thus the output r has the
same range limit of the p inputs.
Final output of interpolation will be rounded to 16-bit
integer for the SSD computation. i.e. f
a3
= 0
Thus the error model in each level can be constructed as
ˆ
E(L1) = P2
f
w
+ 2
f
m1
+ 2
f
a1
1
, (4)
ˆ
E(L2) =
ˆ
E(L1) + P 2
f
w
+
ˆ
E(L1)2
f
w
1
+2
f
m2
+ 2
f
a2
1
, (5)
ˆ
E(L3) =
ˆ
E(L2) + P 2
f
w
+
ˆ
E(L2)2
f
w
1
+2
f
m3
+ 2
1
and (6)
ˆ
E(L3) 1. (7)
Substitution and expansion of the above inequalities show
the relationship between system parameters and the preci-
sion of the interpolated output. The interpolation tree can
be optimized, for more efficient area utilization, by reducing
the size of fa, fm and fw respectively. The following char-
acteristics of reconfigurable platforms are considered when
evaluating the performance.
Multiplier is more expensive than adder when imple-
mented using LUT primitives.
Dedicated multiplier blocks in modern FPGA devices
have fixed bit width. Thus the cost of multiplier in-
creases discretely with increasing bit width.
The impact on area improvement by reducing bit width
of nodes near the leaves is more significant than that
of the nodes near the root of the tree structure.
Assuming X = Y = Z = 2
8
, one set of parameters op-
timized for area and fulfilling the precision requirements in
Equation 7 is: f
w
= 22; f
a1
= 6; f
m1
= 2; f
a2
= 7; f
m2
=
3; f
m4
= 4.
4. Memory System
Speeding up memory access is essential in this design since
the interpolation process will start only after all the eight
pixel values from the source image are ready. Let ρ be the
percentage of pixels interested for registration in an image
with N pixels. The time for an evaluation process in a fully
pipelined architecture is
T
eva
= N × ρ × (M
target
+ M
source
)/f, (8)
where M
target
and M
source
are the number of clock cycles
for memory access reading data from target and source im-
ages. f is the working frequency of the RReg kernel. The
memory access required for skipping uninterested region is
ignored here as it contributes less than 1% in T
eva
. In our
analysis, we assume 2
24
pixels in each image with 95% of
them are interested for registration. The memory bottleneck
is the result of asymmetric input and output data rate. For
single interpolated output value which updates the final SSD,
1 pixel from target image and 8 pixels from source image
are required. To prevent the hardware from being idle, the
memory bandwidth in random access mode must be at least
9 times of the processing speed which is not realistic in most
environments.
In a system with SDRAM, the CAS latency, T
CL
, dom-
inates the evaluation time. For example, when using single
32-bit channel DDR2 SDRAM for both target and source
image, M
target
= T
CL
and M
source
= T
CL
× 4. Follow-
ing Equation 8, a DDR2 based design at 200MHz and with
T
CL
= 3 can complete an evaluation in 598ms.
Multiple memory channels can be used to improve per-
formance of reconfigurable platform. For example, four in-
dependent DDR2 SDRAM banks are associated with a Vir-
tex 5 FPGA in the Alpha Data ADM-XRC-5T2 platform [9].
To utilize this bandwidth, the source image is split into four
groups according to the even and odd position in the Y and
Z directions. Thus the four edges, P 1P 2, P3P 4, P 5P 6 and
P 7P 8, in the neighboring pixel cube in Fig. 1(c) will reside
in the four groups evenly. Mapping the groups to the mem-
ory banks in the 5T2 system can reduce the source image
access time, i.e. M
source
= T
CL
. The pixels of target image
can be streamed from the host system to the FPGA internal
memory as an extra memory channel. A fully pipelined de-
sign in 5T2 at 200MHz can complete an evaluation process
in 120ms.
The fast internal memory of FPGA can be used as a
cache layer to further improve the overall performance. The
small cache size in the FPGA is a limiting factor of this im-
provement. Under the limited cache size, we can improve
cache efficiency by capturing application specific informa-
tion, such as custom cache design for predictable access pat-
tern. In this work, a optimized cache system is constructed to
reduce external memory references with increasing on-chip
memory requirements.
This cache system is to provide faster access for the neigh-
boring pixel values in the interpolation process. The effi-
ciency of the cache depends on the number of pixels that can
be retrieved from it in each interpolation. Since the RReg
kernel traverses along the X direction by accumulating the
xd vector for over 99.5% of the transformation, we will an-
alyze the cache performance along the X direction as an
example. In the following discussion, we present different
cache systems which are suitable for transformation vector
values in different ranges as shown in Fig. 3.
For |
xd| 0.5, 1/|
xd| accumulation steps are required
to advance the projected coordinated to the next integral pixel

pixels cached from previous interpolationpixels need to be fetch pixels cached in line or wall
x
y
z
(b)(a) (c) (d)
Fig. 3. Dedicated Cache Systems for RReg Kernel: (a) Uncached System; (b) Previous Half Cache; (c) Bottom Line Cache;
(d) Side Wall Cache.
cube. Thus a cache system keeping the current eight pixel
values will have a miss rate |
xd|, which is always less than
50%. That means the M
source
term in Equation 8 is reduced
at least by half.
For 0.5 < |
xd| < 1, the probability of advancing to
the next integral pixel cube along the X direction after each
accumulation is |
xd|. A cache system is proposed to cache
the four pixel values near the next integral pixel cube. As
the solid circles shown in Fig. 3(b), p2, p4, p6 and p8 in
current interpolation are stored and may be used as p1, p3,
p5 and p7 in the next interpolation process. In cache hit
situation, the external memory reference to source image can
be reduced by half. Thus the source image memory cycles is
now (1 |
xd|/2) × M
source
. This analysis is also applied
when 1 < |
xd| < 1.5 while the cache hit rate is now 2|
xd|.
The above analysis is based on the assumption that the Y
and Z components in
xd are close to zero. The actual cache
performance will decrease when the Y and Z components
increase. Larger caches can further reduce the M
source
term
by caching immediate lines and planes along the Y and Z
directions as shown in Fig. 3(c) and Fig. 3(d). These systems
are applied when 0.5 < |
yd| < 1.5 and/or 0.5 < |
zd| <
1.5. The cache hit situations in line and wall caches reduce
the external memory references in source image to 25% and
12.5%.
For 1.5 < |
xd|, the probability of advancing to next in-
tegral pixel cube after each accumulation is 2 |
xd|. As it
is impossible to stay in the previously projected pixel cube,
the cache system for |
xd| < 0.5 is not applicable in this
case. The cache systems in Fig. 3 will have less than 50%
hit rate and this reduces as |
xd| increases. When 2 < |
xd|,
no previously fetched pixel values can be reused along the
X direction and thus no cache can help reducing the external
memory reference.
5. Reconfigurable Framework
Custom ASIC designs usually provide better die size uti-
lization and higher working frequency over FPGA designs.
While reconfigurable devices have the ability of dynamically
adapting changes in system parameters. In IR applications,
the software optimizer constantly adjusts the transformation
vectors and sends them to the hardware RReg kernel after
each similarity evaluation. From the analysis in Section 4,
different types of cache systems should be used according to
the values of these vectors.
In the proposed framework, the optimizer program has
the ability to detect the changes in system parameters and
load the corresponding cache modules into hardware dur-
ing the optimization process. The transformation vectors are
registered and updated by the optimizer after each similarity
check and then a suitable caching scheme is selected based
on the analysis in Section 4. Only the cache system is re-
configured and the other parts of the RReg kernel stay un-
changed. To facilitate this adaptive feature, the RReg kernel
is designed in a modular structure with fixed interface be-
tween modules. The regions of partial reconfigurable mod-
ules (RMs) and the locations of interface bus macros are de-
fined by the Xilinx PlanAhead tool. We also implement all
the cache systems as RMs within the defined region. Finally,
the iMPACT programming tool, which can identify a partial
bitstream, is called as an external program to reconfigure the
FPGA as needed [10].
Time for hardware reconfigurations occurred between eval-
uation processes is considered as overhead. The Virtex 5
device has lower reconfiguration latency than earlier genera-
tions of FPGA devices and less constrains on the shape and
location of the RMs [11]. Reconfiguring less than 5% of the
XC5VLX330T device, the overhead time, T
cfg
, is less than
10ms.
This overhead can be justified by the reduction of exter-
nal memory references in successive evaluation processes as
shown below.
T
cfg
E
successive
× M
reduced
/f, (9)
where E
successive
is the number of successive evaluation be-
fore next reconfiguration, M
reduced
is the averaged number
of external memory references reduced in each evaluation
and f the working frequency of RReg kernel. For example, a
200MHz kernel reconfigured to have 50% external memory
reduction will save over 36ms in a single evaluation iteration.

Citations
More filters
DissertationDOI

A Hybrid Similarity Measure Framework for Multimodal Medical Image Registration

TL;DR: The thesis findings conclusively confirm the hybrid SM framework offers an accurate and robust 2D registration solution for challenging multimodal medical imaging datasets, while its inherent flexibility means it can also be extended to the 3D registration domain.

Customisable Multi-Processor Acceleration of Inductive Logic Programming.

TL;DR: This paper offers a means of achieving high performance by producing parallel architectures adapted both to the problem domain and to specific problem instances by exploiting user-customisable parallelism available in advanced reconfigurable devices such as Field-Programmable Gate Arrays.
Dissertation

Architectural and algorithmic design for embedded medical imaging

TL;DR: In this paper, the authors proposed a framework for real-time registration-rendering alongside acquisition, which is targeted ultimately at application-specific integrated circuit design, by integrating the three pillars of modern confocal imaging, including image acquisition, image registration and volume rendering into a single FPGA architecture for customizable prototyping.
DissertationDOI

Hybrid FPGA : architecture and interface

Chi Wai Yu
TL;DR: An optimised coarse-grained FPU is developed, taking into account both architectural and system-level issues, and the trade-offs between granularities and performance by composing small FPUs into a large FPU are investigated.
References
More filters
Journal ArticleDOI

Nonrigid registration using free-form deformations: application to breast MR images

TL;DR: The results clearly indicate that the proposed nonrigid registration algorithm is much better able to recover the motion and deformation of the breast than rigid or affine registration algorithms.
Journal ArticleDOI

Medical image registration

TL;DR: Applications of image registration include combining images of the same subject from different modalities, aligning temporal sequences of images to compensate for motion of the subject between scans, image guidance during interventions and aligning images from multiple subjects in cohort studies.
Journal ArticleDOI

Accuracy-Guaranteed Bit-Width Optimization

TL;DR: An automated static approach for optimizing bit widths of fixed-point feedforward designs with guaranteed accuracy, called MiniBit, is presented and is demonstrated with polynomial approximation, RGB-to-YCbCr conversion, matrix multiplication, B-splines, and discrete cosine transform placed and routed on a Xilinx Virtex-4 FPGA.

Difference-Based Partial Reconfiguration

Emi Eto
TL;DR: An important feature in the VirtexTM architectures is the ability to reconfigure a portion of the FPGA while the remainder of the design is still operational, useful for applications that require the flexibility to change portions of a design without having to completely reconfigure the entire device.
Proceedings ArticleDOI

FPGA-based computation of free-form deformations in medical image registration

TL;DR: This paper describes techniques for producing FPGA-based designs that support free-form deformation in medical image processing using a B-spline algorithm for modelling three-dimensional deformable objects and adopts a customised number representation format in the implementation.
Related Papers (5)
Frequently Asked Questions (12)
Q1. Why is it possible to distribute workloads to parallel kernels?

Due to the read-only pixel memory access and independent transformation and interpolation of each pixel, it is possible to distribute workloads to parallel kernels in hardware and deeply pipeline each kernel for maximum throughput rate. 

For single interpolated output value which updates the final SSD, 1 pixel from target image and 8 pixels from source image are required. 

With 30 parallel cores on chip, the authors can evaluate 30 different transformations concurrently, resulting in a throughput of 1.168ms per evaluation. 

Despite the flexibility in software implementations, the improving resolution in imaging equipments, the increasing amount of images to be processed and the need of real time diagnosing ability make dedicated hardware platforms increasingly attractive in practical IR applications as higher computing power is required. 

3.For | −→ xd| ≤ 0.5, 1/| −→ xd| accumulation steps are required to advance the projected coordinated to the next integral pixelcube. 

Reconfigurable platforms have been widely adopted in accelerating different computational intensive digital signal processing algorithms. 

To prevent the hardware from being idle, the memory bandwidth in random access mode must be at least 9 times of the processing speed which is not realistic in most environments. 

Under the limited cache size, the authors can improve cache efficiency by capturing application specific information, such as custom cache design for predictable access pattern. 

The time for an evaluation process in a fully pipelined architecture isTeva = N × ρ × (Mtarget + Msource)/f, (8)where Mtarget and Msource are the number of clock cycles for memory access reading data from target and source images. 

When 2 < | −→ xd|, no previously fetched pixel values can be reused along the X direction and thus no cache can help reducing the external memory reference. 

As it is impossible to stay in the previously projected pixel cube, the cache system for | −→ xd| < 0.5 is not applicable in this case. 

This paper describes a novel approach for optimizing a reconfigurable accelerator for an IR kernel derived from the ITK package, that takes into account data and platform dependent parameters.