How many pixels are required for the final SSD?

For single interpolated output value which updates the final SSD, 1 pixel from target image and 8 pixels from source image are required.

How many parallel cores can the authors use to evaluate 30 different transformations?

With 30 parallel cores on chip, the authors can evaluate 30 different transformations concurrently, resulting in a throughput of 1.168ms per evaluation.

What can be done to improve the efficiency of the cache?

Under the limited cache size, the authors can improve cache efficiency by capturing application specific information, such as custom cache design for predictable access pattern.

(Open Access) Reconfigurable acceleration of 3D image registration (2009) | Kuen Hung Tsoi

Q: What are the main reasons why IR is becoming more popular in practical applications?

Despite the flexibility in software implementations, the improving resolution in imaging equipments, the increasing amount of images to be processed and the need of real time diagnosing ability make dedicated hardware platforms increasingly attractive in practical IR applications as higher computing power is required.

Q: How many steps are required to advance the projected coordinated?

3.For | −→ xd| ≤ 0.5, 1/| −→ xd| accumulation steps are required to advance the projected coordinated to the next integral pixelcube.

Q: What are the main applications of reconfigurable platforms?

Reconfigurable platforms have been widely adopted in accelerating different computational intensive digital signal processing algorithms.

Q: How many pixel values can be cached?

When 2 < | −→ xd|, no previously fetched pixel values can be reused along the X direction and thus no cache can help reducing the external memory reference.

RECONFIGURABLE ACCELERATION OF 3D IMAGE REGISTRATION

Kuen Hung Tsoi, Daniel Rueckert, Chun Hok Ho and Wayne Luk

Department of Computing

Imperial College London, UK

{khtsoi,dr,cho,wl}@doc.ic.ac.uk

ABSTRACT

This paper proposes techniques for accelerating a soft-

ware based image registration algorithm for 3D medical im-

ages targeting a reconﬁgurable hardware platform. Various

methods, including dedicated ﬁxed point arithmetic, error

model based bit width analysis, architecture exploration and

application-speciﬁc memory modules, are applied to address

issues from the software algorithm and to maximize the per-

formance of FPGA technology. Based on the reconﬁgurabil-

ity of FPGA devices, the system can be extended to swap

modules optimized for different parameters, and to adopt

more advanced registration algorithms. We show that a sin-

gle core on 412MHz XC5VLX330T FPGA can evaluate a

rigid transformation of a 3D image with 16 million voxels in

35ms. With 30 cores on an FPGA, it is over 108 times faster

than a multi-threaded implementation running on a 2.5GHz

Intel Quad-Core Xeon platform.

1. Introduction

Reconﬁgurable platforms have been widely adopted in ac-

celerating different computational intensive digital signal pro-

cessing algorithms. One example of suitable applications in

reconﬁgurable computing is medical image analysis. Med-

ical images can be obtained from various sources includ-

ing X-rays, radionuclide scans such as Single Photon Emis-

sion Computed Tomography (SPECT) and Positron Emis-

sion Tomography (PET) scans, CT scans, ultrasound and

magnetic resonance image (MRI). In practice, images ac-

quired by sampling the subject in different time may have

different coordinate systems due to axis rotation, subject po-

sition change or equipment variation. The process of trans-

forming these images into an uniﬁed coordinate system is

called image registration (IR). Comparison or integration of

different images of the subject can be performed after the IR

process.

The main objective of the IR process is not extracting

the features from images directly but to optimize the set of

parameters used in transforming target images to the coordi-

nate system of source images. During the optimization pro-

cess, parameters of translation, rotation and scaling are tuned

iteratively and the cost function is based on the similarity of

the transformed images.

The Image Registration Toolkit (ITK) [1] is an image

registration framework which provides a board range of reg-

istration solutions for both 2D and 3D MRIs in object ori-

ented C++ design. Despite the ﬂexibility in software im-

plementations, the improving resolution in imaging equip-

ments, the increasing amount of images to be processed and

the need of real time diagnosing ability make dedicated hard-

ware platforms increasingly attractive in practical IR appli-

cations as higher computing power is required.

Studies have been conducted in the IR implementations

on reconﬁgurable platforms. In 2003, a ﬁeld programmable

gate array (FPGA) based IR implementation using a B-spline

free-form deformation algorithm is presented [2]. The de-

sign is implemented using the Handel-C language on a Xil-

inx Virtex II device (XC2V6000). Clocked at 67MHz, it

achieves 3.2× speedup over software version on a 2.6GHz

Xeon CPU. In 2006, Altera releases a video and image pro-

cessing suite for common image functions on FPGAs [3].

With a streaming interface and coupling with high level de-

scriptions in MATLAB, this developing environment enables

fast and optimized implementations on medical image pro-

cessing. An FPGA based mutual information evaluation sys-

tem for IR is proposed in [4]. The system utilizes a data ﬂow

model to improve hardware parallelism by sub-volume di-

vision. In 2007, reconﬁgurable computing platform is used

for classifying brain tissues [5]. With the help of four Xil-

inx Virtex-4 LX200 FPGAs running at 100MHz, the soft-

ware/hardware co-designed system achieves 3.5× speedup

over a pure software implementation on SGI Altix 350 sys-

tem. In 2008, a multi-objective optimization framework for

precision and resource trade off is proposed [6]. The sys-

tem uses multiple copies of image in external memory for

simultaneous voxels access. However, these studies do not

provide a detailed error analysis on the accuracy. They also

do not consider the reconﬁgurability of FPGA devices which

can facilitate optimization for speciﬁc and changing param-

eters.

This paper describes a novel approach for optimizing a

reconﬁgurable accelerator for an IR kernel derived from the

ITK package, that takes into account data and platform de-

pendent parameters. The contributions are:

• Transforming a software ﬂoating point 3D IR kernel

to a ﬁxed point multiple bit-width reconﬁgurable sys-

tem. Optimization on operator bit width is carried out

based on an analytic error model. The new design is

more cost effective in terms of resource utilization and

can be achieved without sacriﬁcing precision and ac-

curacy.

• Classes of application speciﬁc cache systems to ad-

dress the large memory bandwidth requirement. Based

on the transformation parameters, different levels of

reduction in external memory references can be achieved

with different on-chip memory capacities.

• A modularized framework for accelerating 3D IR on

reconﬁgurable platform. Utilizing the reconﬁgurabil-

ity of FPGA devices, users can select and load designs

optimized for different system environments during

the IR process.

The remainder of the paper is organized as follows. In

Section 2, the algorithm of 3D IR and the associated chal-

lenges of system design are introduced. In Section 3, the

computational requirements in the IR algorithm is analyzed

and the proposed arithmetic scheme is discussed. In Sec-

tion 4, the problem of external memory reference is ana-

lyzed and a set of caching scheme for different transforma-

tion parameters are introduced. In Section 5, the modularize

framework for image registration is presented. Implementa-

tion and results are shown in Section 6 and followed by the

conclusion drawn in Section 7.

2. 3D Image Registration

Inputs to IR process are specially formated 3D medical im-

ages which have (X × Y × Z) pixels. Common CT and

MRI formats are 256

or 512

pixels in size [7]. A 16-bit

integer value is assigned to each pixel to represent its gray

scale intensity.

The registration process is to ﬁnd the optimal transform

for aligning two 3D images. In our example, the Rigid Reg-

istration kernel (RReg) [1] aligns 3D images through rota-

tions and transitions. After alignment, the similarity of im-

ages is measured by the averaged Sum of Square of Differ-

ence (SSD) of all interested pixels. Using the SSD value, a

monitoring program iterates the RReg kernel for a new set of

transformation parameters generated by optimization algo-

rithm. Operations in the RReg kernel include transforming

pixel coordinates, fetching reference pixel values, interpo-

lating the expected values of the transformed pixels. Fig. 1

abstracts the process of the RReg kernel.

A homogeneous transformation function, T , is applied

to the coordinate in the target image to ﬁnd the projected

coordinate, (x, y, z) = T(i, j, k), in the source image. In

the RReg kernel, the T function is simpliﬁed to three vec-

tors:

−→

xd,

−→

yd and

−→

zd. These vectors are added to the current

projected coordinate along the corresponding traversed di-

rections to generate the next projected coordinate. Since the

transformation parameters are ﬂoating point values, the pro-

jected coordinate is also represented in ﬂoating point format.

Thus tri-linear interpolation is used to estimate the gray scale

value at the projected position using eight neighboring pixel

values, P 1 to P 8 and two weight vectors, w

and w

, as

shown in Fig. 1. The SSD value is updated by comparing

this result to the pixel value in target image.

For medical purposes, only certain areas of the images

are need for diagnosis. Thus not all pixels in the target image

are involved in the similarity measurement. To avoid unnec-

essary computation and memory access, a skipping scheme

is encoded in the image. For all pixels with a negative value,

s, the kernel will skip the following |s| pixels along the X

direction in the target image.

The original RReg kernel is implemented in objective

oriented C++ code using IEEE754 single precision ﬂoating

point arithmetics. While this software implementation pro-

vides a convenience mean for researchers to test comprehen-

sive registration algorithms, it suffers from long processing

time which is common in most software based IR implemen-

tations. In general, it takes the ITK software tens of minutes

to align two images on PC platform. We focus on improving

this kernel since it is in the main loop which iterated several

hundred times, e.g. over 300 times in the example.

Analysis of the software kernel indicates that 92% of

computation are ﬂoating point add and multiply operations

which are expensive in reconﬁgurable hardware in terms of

resource consumption and critical path length. We observed

the fact that half of the inputs to the tri-linear interpolation

tree are 16-bit non-negative integer (the pixel values) and the

other half are values within the range of [0, 1] (the weight

vectors). This suggests that more efﬁcient dedicated oper-

ators can be used instead of the original ﬂoating point ver-

sion. It is also important to maintain correct outputs after

applying the dedicated operators. Due to the read-only pixel

memory access and independent transformation and interpo-

lation of each pixel, it is possible to distribute workloads to

parallel kernels in hardware and deeply pipeline each ker-

nel for maximum throughput rate. The limiting factor in the

hardware kernel is the memory bandwidth.

Since the fast on-chip memory in modern FPGA devices

are insufﬁcient to store the 32M to 256M bytes 3D images,

it is unavoidable to store the pixel data in off-chip mem-

ory which is relatively slow and requires multi-cycle access

time. The solution is to reduce off-chip memory access us-

ing on-chip memory blocks as caches. The memory access

pattern in IR process is regular, noncontinuous and highly

dependent on the transformation parameters. This suggests

that customized caching systems are needed to achieve better

performance over traditional CPU caches.

There are two domains to be optimized: (a) parameter-

ized kernels and memory systems with different throughputs,

and (b) platform dependent constraints including logic re-

sources and memory bandwidth. The challenges here are to

analysis the relationship between the two domains and pro-

P1v2 v1 P5 P3 P7 P2 P6 P4 P8v2 v1 v2 v1 v2 v1

u2 u1 u2 u1

t1 t2

I(x, y, z)

T(i, j, k)

x’ = Rx + T

offset2=1

P2P1

offset1=0

offset5=X*Y offset6=X*Y+1

offset8=X*Y+X+1

offset7=X*Y+X

offset3=X offset4=X+1

val < 0

val >= 0

(i, j, k)

X (sx2, sy2, sz2)

(sx1, sy1, sz1)

(x, y, z)

VoxelsPtr

w1 = [t1, u1, v1]

(x, y, z)

w2 = [t2, u2, v2]

(a)

(b) (c)

Target 3D Image

Source 3D Image

Fig. 1. RReg kernel operations. (a): Coordinate transformation from target image to source image; (b): Finding the weight

vectors and values of neighboring pixels; (c): Tri-linear interpolation.

vide a framework to utilize suitable designs under given con-

ditions. So the mapping from various algorithms and prob-

lem sizes to various target reconﬁgurable platforms will have

predictable performance and can be automated.

3. Kernel Architecture

One critical component of the hardware IR kernel is the tree

structure for interpolation as shown on the right of Fig. 1.

It is built by recursively applying the basic sum-of-product

structure as shown in Fig. 2. Here, the c

and c

inputs are

components from the weight vectors. The p

and p

inputs

are from pixel values or previous sum-of-product structures

with range [p

min

, p

max

]. h and k are the outputs of the mul-

tipliers and r is the output of the adder. The main objec-

tive of the accelerator is to improve the performance of these

multiply and add operators.

p2p1

c1 + c2 = 1

p1, p2 in [p , p ]

min

max

c1 c2

r : [i : f ]

h, k : [i : f ]

m m

c1, c2 : [0 : f ]w

p pp1, p2 : [i : f ]

Fig. 2. Basic structure of interpolation tree.

Although ﬁxed point arithmetic has clear advantage on

area utilization over the ﬂoating point counterpart in recon-

ﬁgurable logics, it has the weakness of lacking dynamic range.

In order to employ ﬁxed point arithmetic in the IR kernel

without creating errors, it is necessary to characterize the

inputs, the internal tasks performed and the output require-

ments. In this paper, we use the notation [i : f] for ﬁxed

point number format, where i is the number of bits for the

integral part and f is the number of bits for fractional part.

The format of pixel values is [16 : 0]. The projected co-

ordinate is the form of ([log

(X) : f], [log

(Y ) : f], [log

(Z) :

f]). Coordinates outside the X, Y and Z boundaries of source

image are discarded. Using this representation, the coordi-

nate of the base pixel P 1 and the ﬁrst weight vector w1 are

simply the integral and fractional parts of the projected co-

ordinate. The second weight vector can be computed using

an f-bit ﬁxed point subtraction as w2 = 1 − w1.

The value of f affects size and speed of the operators

in the interpolation tree. In this design, it is determined by

applying precision analysis of Afﬁne Arithmetic (AA) [8]

recursively on the tree structure. Let

E( ec

) = 2

−f

−1

, E( ec

) = 2

−f

−1

E( ep

) = 2

−f

−1

, E( ep

) = 2

−f

−1

h) = w

E( ep

) + p

E( ec

) + E( ep

)E( ec

)

−f

−1

k) = w

E( ep

) + p

E( ec

) + E( ep

)E( ec

)

−f

−1

E(˜r) = E(

h) + E(

k) + 2

−f

−1

(1)

be the error model of each variable (edges in the tree). Here

, f

and f

are the fraction size in bits for the in-

put w, input p, the multipliers and the adder. h and k use

the same fraction size due to the symmetric tree structure.

..ε

∈ [−1, 1] are the error source from the round to the

nearest process. To preserve accuracy, the required output

error,

E(˜r), should be less than or equal to 1 ulp. That is,

E(˜r) ≤ 2

−f

. (2)

Given the symmetric structure and the fact that w

= 1, the relation between fraction size in every edge

of tree structure can be found by substitution of Equation 1,

Equation 2 and taking the maximum value of ε to 1 and the

maximum value of p to P . Thus

−f

−1

≥ 2

−f

−1

+ P 2

−f

−2

+ 2

−f

. (3)

This method is applied recursively to the interpolation

tree to ﬁnd the relation between fraction size of every vari-

ables in the tree and the ﬁnal rounding precision. More detail

error analysis can be found in [8]. In this application, sev-

eral attributes of the structure can help to simplify the total

error analysis. By partitioning the interpolation tree to three

levels, L1, L2 and L3, of sub-trees with basic architecture

as shown in Fig. 2, we have:

• All values from the weight vectors have the same for-

mat and f

> max(log

(X), log

(Y ), log

(Z)).

• All p inputs in the ﬁrst (leaf) level are the exact integer

values from the 3D image array. So E(p

) = 0.

• All operators on the same level of the interpolation

tree have the same input and output format.

• In all sub-tree structure, w1 + w2 = 1 and the two

p inputs have same range. Thus the output r has the

same range limit of the p inputs.

• Final output of interpolation will be rounded to 16-bit

integer for the SSD computation. i.e. f

= 0

Thus the error model in each level can be constructed as

E(L1) = P2

−f

+ 2

−f

+ 2

−f

−1

, (4)

E(L2) =

E(L1) + P 2

−f

E(L1)2

−f

−1

−f

+ 2

−f

−1

, (5)

E(L3) =

E(L2) + P 2

−f

E(L2)2

−f

−1

−f

+ 2

−1

and (6)

E(L3) ≤ 1. (7)

Substitution and expansion of the above inequalities show

the relationship between system parameters and the preci-

sion of the interpolated output. The interpolation tree can

be optimized, for more efﬁcient area utilization, by reducing

the size of fa, fm and fw respectively. The following char-

acteristics of reconﬁgurable platforms are considered when

evaluating the performance.

• Multiplier is more expensive than adder when imple-

mented using LUT primitives.

• Dedicated multiplier blocks in modern FPGA devices

have ﬁxed bit width. Thus the cost of multiplier in-

creases discretely with increasing bit width.

• The impact on area improvement by reducing bit width

of nodes near the leaves is more signiﬁcant than that

of the nodes near the root of the tree structure.

Assuming X = Y = Z = 2

, one set of parameters op-

timized for area and fulﬁlling the precision requirements in

Equation 7 is: f

= 22; f

= 6; f

= 2; f

= 7; f

3; f

= 4.

4. Memory System

Speeding up memory access is essential in this design since

the interpolation process will start only after all the eight

pixel values from the source image are ready. Let ρ be the

percentage of pixels interested for registration in an image

with N pixels. The time for an evaluation process in a fully

pipelined architecture is

eva

= N × ρ × (M

target

+ M

source

)/f, (8)

where M

target

and M

source

are the number of clock cycles

for memory access reading data from target and source im-

ages. f is the working frequency of the RReg kernel. The

memory access required for skipping uninterested region is

ignored here as it contributes less than 1% in T

eva

. In our

analysis, we assume 2

pixels in each image with 95% of

them are interested for registration. The memory bottleneck

is the result of asymmetric input and output data rate. For

single interpolated output value which updates the ﬁnal SSD,

1 pixel from target image and 8 pixels from source image

are required. To prevent the hardware from being idle, the

memory bandwidth in random access mode must be at least

9 times of the processing speed which is not realistic in most

environments.

In a system with SDRAM, the CAS latency, T

, dom-

inates the evaluation time. For example, when using single

32-bit channel DDR2 SDRAM for both target and source

image, M

target

= T

and M

source

= T

× 4. Follow-

ing Equation 8, a DDR2 based design at 200MHz and with

= 3 can complete an evaluation in 598ms.

Multiple memory channels can be used to improve per-

formance of reconﬁgurable platform. For example, four in-

dependent DDR2 SDRAM banks are associated with a Vir-

tex 5 FPGA in the Alpha Data ADM-XRC-5T2 platform [9].

To utilize this bandwidth, the source image is split into four

groups according to the even and odd position in the Y and

Z directions. Thus the four edges, P 1P 2, P3P 4, P 5P 6 and

P 7P 8, in the neighboring pixel cube in Fig. 1(c) will reside

in the four groups evenly. Mapping the groups to the mem-

ory banks in the 5T2 system can reduce the source image

access time, i.e. M

source

= T

. The pixels of target image

can be streamed from the host system to the FPGA internal

memory as an extra memory channel. A fully pipelined de-

in 120ms.

The fast internal memory of FPGA can be used as a

cache layer to further improve the overall performance. The

small cache size in the FPGA is a limiting factor of this im-

provement. Under the limited cache size, we can improve

cache efﬁciency by capturing application speciﬁc informa-

tion, such as custom cache design for predictable access pat-

tern. In this work, a optimized cache system is constructed to

reduce external memory references with increasing on-chip

memory requirements.

This cache system is to provide faster access for the neigh-

boring pixel values in the interpolation process. The efﬁ-

ciency of the cache depends on the number of pixels that can

be retrieved from it in each interpolation. Since the RReg

kernel traverses along the X direction by accumulating the

−→

xd vector for over 99.5% of the transformation, we will an-

alyze the cache performance along the X direction as an

example. In the following discussion, we present different

cache systems which are suitable for transformation vector

values in different ranges as shown in Fig. 3.

For |

−→

xd| ≤ 0.5, 1/|

−→

xd| accumulation steps are required

to advance the projected coordinated to the next integral pixel

pixels cached from previous interpolationpixels need to be fetch pixels cached in line or wall

(b)(a) (c) (d)

Fig. 3. Dedicated Cache Systems for RReg Kernel: (a) Uncached System; (b) Previous Half Cache; (c) Bottom Line Cache;

(d) Side Wall Cache.

cube. Thus a cache system keeping the current eight pixel

values will have a miss rate |

−→

xd|, which is always less than

50%. That means the M

source

term in Equation 8 is reduced

at least by half.

For 0.5 < |

−→

xd| < 1, the probability of advancing to

the next integral pixel cube along the X direction after each

accumulation is |

−→

xd|. A cache system is proposed to cache

the four pixel values near the next integral pixel cube. As

the solid circles shown in Fig. 3(b), p2, p4, p6 and p8 in

current interpolation are stored and may be used as p1, p3,

p5 and p7 in the next interpolation process. In cache hit

situation, the external memory reference to source image can

be reduced by half. Thus the source image memory cycles is

now (1 − |

−→

xd|/2) × M

source

. This analysis is also applied

when 1 < |

−→

xd| < 1.5 while the cache hit rate is now 2−|

−→

xd|.

The above analysis is based on the assumption that the Y

and Z components in

−→

xd are close to zero. The actual cache

performance will decrease when the Y and Z components

increase. Larger caches can further reduce the M

source

term

by caching immediate lines and planes along the Y and Z

directions as shown in Fig. 3(c) and Fig. 3(d). These systems

are applied when 0.5 < |

−→

yd| < 1.5 and/or 0.5 < |

−→

zd| <

1.5. The cache hit situations in line and wall caches reduce

the external memory references in source image to 25% and

12.5%.

For 1.5 < |

−→

xd|, the probability of advancing to next in-

tegral pixel cube after each accumulation is 2 − |

−→

xd|. As it

is impossible to stay in the previously projected pixel cube,

the cache system for |

−→

xd| < 0.5 is not applicable in this

case. The cache systems in Fig. 3 will have less than 50%

hit rate and this reduces as |

−→

xd| increases. When 2 < |

−→

xd|,

no previously fetched pixel values can be reused along the

X direction and thus no cache can help reducing the external

memory reference.

5. Reconﬁgurable Framework

Custom ASIC designs usually provide better die size uti-

lization and higher working frequency over FPGA designs.

While reconﬁgurable devices have the ability of dynamically

adapting changes in system parameters. In IR applications,

the software optimizer constantly adjusts the transformation

vectors and sends them to the hardware RReg kernel after

each similarity evaluation. From the analysis in Section 4,

different types of cache systems should be used according to

the values of these vectors.

In the proposed framework, the optimizer program has

the ability to detect the changes in system parameters and

load the corresponding cache modules into hardware dur-

ing the optimization process. The transformation vectors are

registered and updated by the optimizer after each similarity

check and then a suitable caching scheme is selected based

on the analysis in Section 4. Only the cache system is re-

conﬁgured and the other parts of the RReg kernel stay un-

changed. To facilitate this adaptive feature, the RReg kernel

is designed in a modular structure with ﬁxed interface be-

tween modules. The regions of partial reconﬁgurable mod-

ules (RMs) and the locations of interface bus macros are de-

ﬁned by the Xilinx PlanAhead tool. We also implement all

the cache systems as RMs within the deﬁned region. Finally,

the iMPACT programming tool, which can identify a partial

bitstream, is called as an external program to reconﬁgure the

FPGA as needed [10].

Time for hardware reconﬁgurations occurred between eval-

uation processes is considered as overhead. The Virtex 5

device has lower reconﬁguration latency than earlier genera-

tions of FPGA devices and less constrains on the shape and

location of the RMs [11]. Reconﬁguring less than 5% of the

XC5VLX330T device, the overhead time, T

cfg

, is less than

10ms.

This overhead can be justiﬁed by the reduction of exter-

nal memory references in successive evaluation processes as

shown below.

cfg

≤ E

successive

× M

reduced

/f, (9)

where E

successive

is the number of successive evaluation be-

fore next reconﬁguration, M

reduced

is the averaged number

of external memory references reduced in each evaluation

and f the working frequency of RReg kernel. For example, a

200MHz kernel reconﬁgured to have 50% external memory

reduction will save over 36ms in a single evaluation iteration.

Reconfigurable acceleration of 3D image registration

Figures

Citations

A Hybrid Similarity Measure Framework for Multimodal Medical Image Registration

Customisable Multi-Processor Acceleration of Inductive Logic Programming.

Architectural and algorithmic design for embedded medical imaging

Hybrid FPGA : architecture and interface

References

Nonrigid registration using free-form deformations: application to breast MR images

Medical image registration

Accuracy-Guaranteed Bit-Width Optimization

Difference-Based Partial Reconfiguration

FPGA-based computation of free-form deformations in medical image registration

Related Papers (5)

An NProd Algorithm IP Design for Real-Time Image Matching Application onto FPGA

Parameterized hardware design on reconfigurable computers: an image processing case study

FPGA-based compact and flexible architecture for real-time embedded vision systems

An FPGA-Based Hardware Accelerator for Real-Time Block-Matching and 3D Filtering

Implementation of Reconfigurable Adaptive Filtering Algorithms

Frequently Asked Questions (12)

Q1. Why is it possible to distribute workloads to parallel kernels?

Q2. How many pixels are required for the final SSD?

Q3. How many parallel cores can the authors use to evaluate 30 different transformations?

Q4. What are the main reasons why IR is becoming more popular in practical applications?

Q5. How many steps are required to advance the projected coordinated?

Q6. What are the main applications of reconfigurable platforms?

Q7. How many times of the processing speed is required to prevent the hardware from being idle?

Q8. What can be done to improve the efficiency of the cache?

Q9. What is the time for an evaluation process in a fully pipelined architecture?

Q10. How many pixel values can be cached?

Q11. What is the probability of advancing to the next integral pixel cube?

Q12. What is the main purpose of this paper?