scispace - formally typeset

Proceedings ArticleDOI

FPGA Acceleration of the Phylogenetic Parsimony Kernel

05 Sep 2011-pp 417-422

TL;DR: A versatile FPGA implementation of the phylogenetic parsimony function is presented and its performance is compared to a highly optimized SSE3- and AVX-vectorized software implementation and it is concluded that, a competitive spirit between SW and HW application developers can contribute toward obtaining more objective performance comparisons.
Abstract: The phylogenetic parsimony function is a popular, discrete criterion for reconstructing evolutionary trees based on molecular sequence data. Parsimony strives to find the phylogenetic tree that explains the evolutionary history of organisms by the least number of mutations. Because parsimony is a discrete function, it should fit well to FPGAs. We present a versatile FPGA implementation of the parsimony function and compare its performance to a highly optimized SSE3- and AVX-vectorized software implementation. We find that, because of a particular constellation in our lab, the speedups that can be achieved by using an FPGA, are substantially less impressive, than usually reported in papers on FPGA acceleration of bioinformatics kernels. We conclude that, a competitive spirit between SW and HW application developers can contribute toward obtaining more objective performance comparisons.
Topics: Tree rearrangement (61%), Maximum parsimony (56%)

Content maybe subject to copyright    Report

FPGA Acceleration of the Phylogenetic Parsimony Kernel?
Nikolaos Alachiotis, Alexandros Stamatakis
The Exelixis Lab, Scientific Computing Group
Heidelberg Institute for Theoretical Studies
Heidelberg, Germany
Emails: {Nikolaos.Alachiotis,Alexandros.Stamatakis}@h-its.org
Abstract—The phylogenetic parsimony function is a pop-
ular, discrete criterion for reconstructing evolutionary trees
based on molecular sequence data. Parsimony strives to find
the phylogenetic tree that explains the evolutionary history
of organisms by th e least number of mutations. Because par-
simony is a discrete function, it should fit well to FPGAs. We
present a versatile FPGA implementation of t he parsimony
function and compare its performance to a highly optimized
SSE3- and AVX-vectorized software implementation. We find
that, because of a particular constellation in our lab, the
speedups that can be achieved by using an FPGA, are
substantially less impressive, than usually reported in papers
on FPGA acceleration of bioinformatics kernels. We conclude
that, a competitive spirit between SW and HW application
devel op ers can contribute toward obtaining more objective
performance comparisons.
Keywords-FPGA; SIMD; parsimony; performance analysis
I. INTRODUCTION
The inference of evolutionary (phylogenetic) trees from
molecular sequence data has many important applications
in biological and medical research (e.g., [1]).
Input: The input for a phylogenetic analysis is a list
of organism names and their associated DNA sequence
data. Since DNA sequences for distinct organisms typi-
cally have different lengths, a so-called multiple sequence
alignment (MSA) of the DNA sequences is computed prior
to conducting a phylogenetic analysis using character-
based methods (Maximum Parsimony [2] or Maximum
Likelihood [3]). The goal of MSA is to determine which
nucleotides of the organisms share a common evolutionary
history. Because nucleotide insertions or deletions may
have occured during the evolutionary history of the or-
ganisms, deletion events are denoted by inserting the gap
symbol - into the sequences during the MSA process.
After the MSA step, all n sequences have the same length
m, that is, the MSA has m alignment columns (also called:
characters, sites, positions).
Output: The output of a phylogenetic analysis is an
unrooted binary tree topology. The present-day organisms
under study (for which DNA data can be sequenced) are
assigned to the leaves (tips) of such a tree, whereas the
inner nodes represent extinct common ancestors.
Combinatorial Optimization: To reconstruct a phy-
logenetic tree from a MSA, criteria are required to as-
sess how well a specific tree topology explains (fits) the
underlying molecular sequence data. One may think of
this as an abstract function f () that scores alternative tree
topologies for a given MSA. The goal of phylogenetic
algorithms is to find the tree topology with the best
score according to f (), that is, phylogenetic inference
is a combinatorial optimization problem. The algorithmic
problem in phylogenetics is characterized by the number
of possible distinct unrooted binary tree topologies for n
organisms which is given by:
Q
n
i=3
(2i 5). Finding the
best tree for MSA-based criteria f () such as Maximum
Likelihood [4] or Maximum Parsimony [5] is NP-hard.
Apart from developing efficient heuristic search strategies,
the optimization of the scoring function f (), that is in-
voked millions of times during a heuristic tree search, and
hence dominates execution times, represents an important
research objective in phylogenetics.
Likelihood and parsimony are currently among the most
popular methods for phylogenetic inference. Compared to
the likelihood criterion, parsimony requires significantly
less memory and computations to calculate f () on a
given tree topology which is important for analyzing very
large datasets [1]. Here, we focus on the acceleration
of the parsimony kernel via a pipelined reconfigurable
architecture and by deploying 256-bit wide AVX vector
instructions on a general purpose CPU. The constellation
at our lab is rather atypical, since NA is a computer
engineer and AS has a parallel computing background
and 10 years of experience in developing and tuning
phylogeny programs (e.g., the widely-used RAxML code;
the three main papers have accumulated 2000 citations
on Google Scholar; March 24, 2011). Hence, there is a
permanent, vivid discussion whether FPGAs are suitable
for accelerating phylogenetic kernels or not.
Thus, we compare our HW design with a highly opti-
mized (at the bit level) open-source SW implementation
that deploys 128-bit SSE3 vector instructions and the
relatively recent 256-bit AVX vector instructions to accel-
erate the parsimony kernel. We find that, given a realistic
parsimony kernel usage model, there is no significant
difference in execution speeds between a high-end FPGA
and an Intel i7 processor with AVX support. However,
there are substantial differences in engineering effort: The
design, implementation, and verification of the hardware
architecture required almost a month, while porting the
already existing SSE3 vectorization to AVX required only
half a day. To allow for reproduction of all results in
this paper the hardware description of the architecture
is available as open-source code at: http://wwwkramer.in.
tum.de/exelixis/FPGA
MaxPars.tar.bz2.
The remainder of this paper is organized as follows: in

Section II we address related work and in Section III we
describe how to compute the parsimony score on a tree. In
Section IV we outline the architecture and in Section V we
present performance results. We conclude in Section VI.
II. RELATED WORK
Few phylogenetic kernels have been mapped to hard-
ware. Mak and Lam [6], [7], Alachiotis et al. [8], [9], [10],
and Zierke and Bakos [11] map the floating point intensive
likelihood function to FPGAs. Davis et al. [12] presented
an implementation of the UPGMA method (Unweighted
Pair Group Method with Arithmetic Mean) which is a
simple tree reconstruction algorithm that is practically not
used for phylogenetic analyses any more. In [13], Bakos
et al. focused on tree reconstruction using gene order
data, that is, the arrangement of corresponding genes in
the genomes of different organisms is used to reconstruct
trees.
Kasap and Benkrid [14], [15] recently presented, the—
to the best of our knowledge—first reconfigurable archi-
tecture for the parsimony kernel and assessed performance
on a FPGA supercomputer by exploiting fine-grain and
coarse-grain parallelism. The implementation is limited
to trees with a maximum of 12 organisms, which are
very small by todays standards; the largest published
parsimony-based tree has 73,060 taxa [1]. The authors
use an exhaustive search algorithm to evaluate all pos-
sible trees with 12 organisms in parallel for finding the
tree with the best parsimony score. An evaluation of all
possible trees, even in parallel, is evidently not possible
for parsimony-based analyses of larger trees because of
the super-exponential increase in the possible number
of trees. Parsimony-based programs for large datasets
deploy heuristic search strategies (e.g., Subtree Pruning
and ReGrafting (SPR) or Tree Bisection and Reconnec-
tion (TBR)). These search strategies (as implemented for
instance, in TNT, parsimonator (our code), or PAUP
)
do not require a de-novo computation of the parsimony
score, based on a full post-order tree traversal as imple-
mented in [14], [15]. Instead, they only require the update
of a comparatively small fraction of ancestral parsimony
vectors. Hence, a fundamentally different approach to
implementing the parsimony function on a reconfigurable
architecture for such commonly used heuristic search
strategies is required.
Kasap and Benkrid report speedups between a factor of
5 and up to a factor of 32,414 for utilizing 1, 2, 4, and 8
nodes (each node is equipped with a Xilinx Virtex4 FX100
FPGA) on the Maxwell system compared to a 2.2GHz
Intel Centrino Duo processor. However, the speedups re-
ported are only relative speedups with respect to the parsi-
mony implementation in PAUP
[16] and not with respect
to the fastest-known implementation of parsimony in the
TNT program package used in [1]. Unfortunately, neither
PAUP, nor TNT are open-source and therefore do not allow
for an accurate performance analysis and comparison of
the parsimony kernel. Therefore, we use our in-house code
parsimonator (available at: http://wwwkramer.in.tum.
AG
place virtual root
into arbitrary branch
AT
AC AC
Ancestral Vector
Ancestral Vector
1.
2.
AC
AC
AG
AT
Sequence Data
post−order traversal
virtual root 3. Compute overall score
by summing over
per−site scores
Figure 1. Virtual rooting and post-order traversal of a phylogenetic tree.
de/exelixis/software.html), which implements a represen-
tative, yet simple, search strategy based on SPR moves.
The parsimony kernel in parsimonator is highly opti-
mized and the program can compute parsimony trees on
DNA datasets with up to 116,408 organisms and ten genes.
III. THE PARSIMONY KERNEL
The parsimony kernel operates directly on the MSA
and the tree. The sequences in the MSA are assigned to
the leaves of the tree and an overall score for the tree is
computed via a post-order tree traversal with respect to
a virtual root. An important property of the parsimony
function is that parsimony scores are invariant to the
placement of such a virtual root. Parsimony is charac-
terized by two additional properties: (i) it assumes that
MSA columns have evolved independently, that is, given
a fixed tree topology, one can simultaneously compute the
parsimony score for each MSA column in parallel. To
obtain the overall score of the tree, the sum over all m per-
column parsimony scores at the virtual root is computed.
(ii) parsimony scores are computed via a post-order tree
traversal that proceeds from the tips towards the virtual
root and computes ancestral parsimony vectors of length
m at each inner node that is visited (see Figure 1).
The parsimony criterion intends to minimize the number
of nucleotide changes on a tree. Hence, for a given,
fixed, tree topology we need to compute the smallest
number of changes (mutations) required to generate the
tree. This minimum number can be computed via a post-
order traversal of the tree under consideration. Given an
arbitrarily rooted tree, one can proceed bottom up from the
tips toward the virtual root to compute ancestral parsimony
vectors and count mutations, based on the two (previously
computed; post-order traversal!) child vectors. To store tip
vectors (containing the actual DNA data) and ancestral
parsimony vectors (containing the ancestral sequences),
we need to allocate 4 bits per alignment site m at each
node of the tree. Hence, the total memory required to store
the parsimony vectors is m · 4 · (2n 2) bits, where m
is the number of sites and 2n 2 the total number of
inner and outer nodes in an unroote d binary tree (n is the
number of tips). In the following we will only describe the
computational steps required to compute the parsimony
score on a tree (please refer to [17] for a justification and
further details).
The parsimony vectors (bit vectors) at the tips are
initialized as follows: for a nucleotide A at a position i,
where i = 0...m 1 we assign A:=1000 (respectively
C:=0100, G:=0010, T:= 0001). When the tip vectors

000
000
000
000
111
111
111
111
00
00
00
00
00
11
11
11
11
11
000
000
000
000
111
111
111
111
000
000
000
000
111
111
111
111
00
00
00
00
00
11
11
11
11
11
000
000
000
000
111
111
111
111
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
000000
000000
000000
000000
000000
000000
000000
000000
111111
111111
111111
111111
111111
111111
111111
111111
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
000000
000000
000000
000000
000000
000000
000000
000000
111111
111111
111111
111111
111111
111111
111111
111111
qv = 1100
Q
R
P
qs = 5
qv = 1100
Q
R
P
qs = 3
ancestral parsimony vector update with a mutation ancestral parsimony vector update without a mutation
ps = 3 + 2 = 5
pv = 1100 & 1000 = 1000
rs = 2
rv = 1000
to virtual root
rs = 4
rv = 0010
ps = 4 + 5 + 1 = 10
pv = 1100 | 0010 = 1110
to virtual root
Figure 2. Parsimony vector and score updates with and without
mutation.
have been initialized, one can start computing the parsi-
mony score of the tree. We will focus on computing the
parsimony score ps
i
(minimum number of mutations) for
a single site i, since the overall score is simply the sum
over all per-site scores at the virtual root:
P
m1
i=0
ps
i
.
Given two already computed child vectors q and r,
we compute the parent vector p at site i as follows (see
Figure 2). The parsimony score is initially set to the sum of
the parsimony scores of two child vectors ps
i
:= qs
i
+rs
i
,
that is, we take into account how many mutations were
required to explain the two subtrees rooted at q and r for
site i. Then, we compare the 4-bit vectors of q and r with
a bit-wise and operation.
If this bit-wise and yields 0, this means, for instance,
that site i in subtree q may only contain As (qv
i
= 1000)
and r may only contain Cs (rv
i
= 0100) at site i.
Hence, we need to add a mutation and increment the
parsimony score by one ps
i
:= ps
i
+ 1. The parsimony
vector at position i of p is then calculated as: pv
i
:= qv
i
OR
bitwise
rv
i
, that is, we conduct a bit-wise or on qv
i
and rv
i
to obtain a new state that now comprises A or
C. Thus, pv
i
:= 1100, which means that the ancestral
state can be A or C because we have already counted
the required mutation. If the initial bit-wise and on qv
i
and rv
i
does not yield zero, we do not need to count
a mutation, and simply set pv
i
:= qv
i
AND
bitwise
rv
i
,
thereby essentially saving the shared state between qv
i
and
rv
i
in pv
i
. When the virtual root is reached, we conduct
exactly the same computations on child vectors q and r
for updating the parsimony score at the root p, but we do
not require to store the ancestral state pv, since we are
only interested in the score (mutation count) at p.
There exist only few open-source implementations of
the parsimony kernel. The PHYLIP package [18] con-
tains a proof-of-concept parsimony implementation that
is not optimized at the bit-level. As already mentioned,
we have recently released an optimized code called
parsimonator. The parsimonator manual also in-
cludes a performance comparison of the non-vectorized,
SSE3- and AVX-vectorized versions. We believe that this
is currently the fastest open-source parsimony implemen-
tation with respect to the parsimony kernel implementa-
tion, albeit the search algorithm it uses is rather na¨ıve
because it is designed to generate starting trees for maxi-
mum likelihood analyses. The fastest available parsimony
program is TNT [19]. PAUP
[16] is also a popular pro-
gram for parsimony analysis, but significantly slower than
TNT and parsimonator. Since we focus on designing
Figure 3. Architecture of the basic parsimony processing unit.
an architecture for the parsimony kernel implementation,
irrespective of the actual search algorithm, we use our in-
house code to accurately measure and compare execution
times.
IV. RECONFIGURABLE ARCHITECTURE
In the following we describe the reconfigurable par-
simony architecture. We denote ancestral vector compu-
tations as NV operations and score computations at the
virtual root by EV .
A. Processing Unit (PRU) A rchitecture
Figure 3 illustrates the basic processing unit (PRU) of
the reconfigurable parsimony kernel. Each PRU operates
on two child vector entries (two sites). The PRU archi-
tecture deploys a pair of dual-port memories, that is, one
memory instance is used for storing tip vectors and one
for storing inner vectors. Each memory instance can store
a maximum of 2048 addressable bytes. The rationale for
selecting this specific memory size is that, thereby we
occupy a single 18Kb block RAM slice per PRU memory
instance. If each PRU only requires a limited amount
of memory blocks, the overall reconfigurable parsimony
system can be extended by additional PRUs in a seamless
way (see Subsection IV-B).
To initiate a parsimony analysis, only the TIP MEM-
ORY has to be initialized with the bit-encoded DNA
sequences in the MSA. Every tip and inner vector is
assigned a static address space in the respective memory
prior to executing any operation. During a post-order tree
traversal, the following three memory access situations can
occur regarding the input child vectors at nodes q and r:
(i) q and r are both tips, (ii) either q is a tip or r is a
tip, (ii) q and r are inner nodes. For the TIP-TIP and TIP-
INNER cases, the input vectors are retrieved and read from
the corresponding TIP and INNER memories, respectively.
The result vector at node p (when it needs to be stored
for a NV operation), is stored in the INNER memory.
In analogy, a NV operation for the INNER-INNER case
would require an INNER memory with three memory
ports: two ports for reading the q and r vectors and a
third port for writing the p vector. To efficiently implement
the INNER-INNER case for NV (p, q, r are inner nodes)
using present FPGA technology that only provides two
ports per memory block, the p vector is temporarily stored
in a special memory. We denote this special memory,
which forms part of the TIP MEMORY, as EXTRA
space. At each clock cycle, two 4to1 multiplexers (see
Figure 3) are used to select the correct memory buses

B
A
node
inner
tip
4−tip group
C
virtual
root
Level 2
Level 1
Figure 4. Worst-case tree topology in terms of EXTRA space require-
ments.
with valid vector input data. The multiplexer selection
bits signal the corresponding TIP-TIP, TIP-INNER, and
INNER-INNER cases. The group of logic gates in the
center of Figure 3 implements the bit-wise operations
to compute the parsimony kernel (see Section III). An
additional 2to1 multiplexer is used to distinguish between
the two data buses that provide input to the second TIP
MEMORY port. One bus provides the DNA sequences of
the MSA during the memory initialization process while
the second bus provides ancestral states during NV and
EV operations if the EXTRA space of the TIP MEMORY
is used.
The size of the EXTRA space depends on the dimension
of the input dataset, that is, the number n of taxa in the
MSA and the number m of nucleotides per DNA sequence.
It also depends on the tree shape. Figure 4 illustrates the
worst-case tree in terms of EXTRA space requirements.
Fully balanced trees require maximum EXTRA space,
which amounts 50% of the memory required to store the
input tip sequences in TIP MEMORY. In a fully balanced
tree, for every group of 4 tips, one inner node needs to
be stored in EXTRA space. In Figure 4, the highlighted
group of 4 tips at Level 1 (tip level) has two parent
nodes/vectors at Level 2 (one level closer to the virtual
root). Vectors A and B (the direct ancestors of the tips),
can be stored in INNER MEMORY. Since both, A and
B, are inner vectors (stored in INNER MEMORY), their
common ancestor C must be stored in EXTRA space to
avoid a memory port conflict. In this worst case scenario,
every inner vector in the highlighted grey area of Figure 4
must be stored in EXTRA space. Decisions for writing
inner vectors to EXTRA space are orchestrated by an
appropriately adapted parsimonator version. Thus, a
dedicated, reconfigurable EXTRA space control unit is
not necessary. For a fully balanced tree with n tips, the
maximum number of inner nodes IN
EX that need to
be stored in EXTRA space during a phylogenetic analysis
is given by the following equation:
IN
EX =
n/2 2 if n mod 4 = 0
(n + 4 n mod 4)/2 2 if n mod 4 6= 0
B. Pipelined Datapath
The generic input command that must be issued to the
pipeline during a clock cycle to initiate parsimony com-
PA PB PC PD WRP
SCORE
MEM
PRU PRU
REG
POPULATION COUNTER
. . . . . . . . . .
. . .
PRU ARRAY
ADDR GEN ADDR GEN ADDR GEN
STAGE
PRU
PORT
SEL
STAGE
STAGE
GEN
ADDR
STAGE
POP
SCORE ACCUMULATOR
STAGE
ACCUM
SCR
CNT
CNTRL
FSM
Q VEC ADDR R VEC ADDR P VEC ADDR SEL BITS Q/R SCR ADDR CMD
Figure 5. Top-level design of the pipelined architecture.
putations is highlighted at the bottom of Figure 5. Both
operations (NV and EV) require a set of four (two-byte
long) read addresses; they contain the start addresses of the
parsimony child vectors (Q VEC ADDR, R VEC ADDR)
and corresponding parsimony scores (Q SCR ADDR, R
SCR ADDR) of child nodes q and r. A NV operation
also requires two additional write addresses for storing
the parent vector p and the respective score at p.
The top-level design of the pipelined datapath has five
stages (see Figure 5). Read/write memory addresses are
initially generated by using 11-bit counters during the ad-
dress generation stage (ADDR GEN). In the next pipeline
stage (PORT SEL), the PA, PB, PC, and PD components
implement multiplexers and logic that decides to which
three (out of four) PRU memory ports the read/write
addresses will be sent. The WRP component generates
the write enable signal for the selected write port.
The PRU array comprises several parallel and com-
pletely independent PRUs. The number of PRUs deter-
mines the array size. All PRUs in the array receive
the same read/write addresses, write enable signal, and
selection bits to perform the same operation on differ-
ent parsimony vector entries (sites of the MSA). Each
PRU contains two memory components and logic. Since
the PRU array is a vector-like component (each PRU
operates independently), using only two 18Kb block
ram slices allows for tailoring the array size to the
available FPGA according to the number of unoccupied
memory blocks on the device. Therefore, we imple-
mented a program called FPGA
MaxPars Gen (included
in FPGA
MaxPars.tar.bz2) for generating VHDL wrapper
files. Thereby, one can instantiate PRU arrays with user-
defined length.
The POP CNT pipeline stage contains a population
counter (to count the number of bits set to 1) for the
computation of partial parsimony scores across alignment
sites (PRUs). If one regards the tip and inner memory
instances of the PRU array as two larger memory compo-
nents, each with depth of 2048 and an array-size dependent
memory line length, a partial score refers to the total
number of mutations across sites that can be stored in
a PRU array memory line (see Section IV-C for details on

the population counter).
Finally, in the SCR ACCUM stage, the parsimony score
is computed with three 32-bit adders. The SCORE MEM
memory is used to store intermediate scores (parsimony
scores for each inner node that defines a subtree). One
adder is used to calculate the sum of the input child node
scores at q and r. The scores for q and r are retrieved
from the SCORE MEM, based on the input SCR ADDR
addresses. The second adder/accumulator sums up the
partial scores that are produced by the population counter.
The last adder computes the final score by adding the
output of the accumulator to the sum of the scores at q
and r. For NV operations, the parsimony score is written
to SCORE MEM.
The size of SCORE MEM (2048 integers) is the upper
limit for the number n of organisms (DNA sequences)
our architecture can accommodate. Accordingly, this max-
imum number of 2048 organisms decreases proportionally
with the number of PRU array memory lines required by
each sequence. This number of PRU array memory lines
increases with the number m of MSA sites/columns.
C. Population Counter
The population counter is implemented as a tree of
adders with increasing width, that is, at each level of
the adder-tree the adders are one bit wider than the
previous level. The input bit-vector size for the first
level depends on the PRU array size. The stand-alone
FPGA
MaxPars Gen software can be used to generate
a population counter with a user-defined size and la-
tency. Pipeline registers are optionally inserted as needed
between adder levels in the tree to alleviate the neg-
ative impact of a very large (deep) population counter
component on the overall operating clock frequency. To
the best of our knowledge, FPGA
MaxPars Gen is the
only open-source population counter generator for FPGAs.
Because the latency of the population counter influences
the total latency of the parsimony architecture pipeline,
FPGA
MaxPars Gen instantiates a shift register (in the
VHDL wrapper file) to synchronize the population count
computations with the rest of the system.
V. IMPLEMENTATION, VERIFICATION, AND RESULTS
We describe the verification of the reconfigurable ar-
chitecture in Section V-A. Then, we present a PC-FPGA
prototype system (Section V-B) and a performance evalu-
ation for a larger reconfigurable system (Section V-C).
A. Verification of the Parsimony Architecture
Initially, we modeled our architecture (including the
address assignment to EXTRA space) in parsimonator
using C. We replaced the standard NV and EV functions
(accounting for 99% of total execution time) by imple-
mentations that reflected our reconfigurable architecture
to assess and confirm the correctness of our approach.
Thereafter, the reconfigurable architecture was imple-
mented in VHDL and mapped on a Virtex 5 SX95T-
1 FPGA. We verified the correctness of the hardware
system by extensive post place and route simulations using
Table I
RESOURCES/PERFORMANCE OF VIRTEX 5 AND VIRTEX 6 S YSTEMS.
64-PRU System 512-PRU System
Device Virtex 5 SX95T Virtex 6 SX475T
Slice Registers 5,568(9%) 41,091(6%)
Slice LUTs 4,133(7%) 22,520(7%)
Occupied Slices 1,933(13%) 9,608(12%)
Block Rams (18Kb) 132(27%) 1,028(48%)
Frequency (MHz) 192.374 188.456
Modelsim 6.3f by Mentor Graphics and tests on an actual
chip using a HTG-V5-PCIE development board with a
Virtex 5 SX95T FPGA. Chipscope Pro Analyzer was used
to monitor the input and output ports of the design.
B. PC-FPGA Prototype System
After successful verification, a fully operational PC-
FPGA prototype system was designed to test this im-
plementation of parsimonator using actual biological
datasets on an actual board. We used the C interface of our
open-source PC FPGA communication platform [20]
available at http://opencores.org/project,pc
fpga com to
transfer bit-encoded DNA sequences and issue 13-byte
long NV/EV commands to the board. On the FPGA side,
the DNA sequences were used to initialize the TIP MEM-
ORY, and the NV/EV commands to trigger computations.
The receiving background reader mechanism provided by
this platform was used to receive parsimony scores on
the PC side after an EV command had been issued to
the board. The FPGA
MaxPars Gen program was used
to generate a PRU-array of size/width 64 as well as a
correctly sized population counter for the prototype sys-
tem, that is, 64 PRUs were placed in parallel allowing 128
alignment sites to be processed simultaneously (remember
that each PRU can compute the parsimony score for two
sites; see Section IV-A).
C. Results
To present a fair performance assessment for our accel-
erator architecture we created a high performance instance
of our architecture (using FPGA
MaxPars Gen) with an
array of 512 PRUs and mapped it on a Virtex 6 SX475T-
2 FPGA. Furthermore, we vectorized parsimonator
with 256-bit wide AVX SIMD instructions. An evaluation
of the prototype and high performance systems regarding
resources and clock frequencies is provided in Table I.
Note that, the currently largest available FPGA with
respect to available block RAM slices (Virtex 7 VX865T)
can accommodate an array of 1800 PRUs and thus allows
for computing 3600 sites in parallel.
Table II shows execution times (in seconds) for real-
world biological datasets using the SSE3 and AVX ver-
sions of parsimonator (using one core of an Intel i7-
2600 CPU at 3.40GHz) and the reconfigurable architecture
with 512 PRUs (mapped on the Virtex 6 device). The
FPGA accelerator is up to 9.65 times faster than the
optimized software.

Citations
More filters

Journal ArticleDOI
TL;DR: The Phylogenetic Likelihood Library is introduced, a highly optimized application programming interface for developing likelihood-based phylogenetic inference and postanalysis software that improves the sequential performance of current software by a factor of 2–10 while requiring only 1 month of programming time for integration.
Abstract: We introduce the Phylogenetic Likelihood Library (PLL), a highly optimized application programming interface for developing likelihood-based phylogenetic inference and postanalysis software. The PLL implements appropriate data structures and functions that allow users to quickly implement common, error-prone, and labor-intensive tasks, such as likelihood calculations, model parameter as well as branch length optimization, and tree space exploration. The highly optimized and parallelized implementation of the phylogenetic likelihood function and a thorough documentation provide a framework for rapid development of scalable parallel phylogenetic software. By example of two likelihood-based phylogenetic codes we show that the PLL improves the sequential performance of current software by a factor of 2-10 while requiring only 1 month of programming time for integration. We show that, when numerical scaling for preventing floating point underflow is enabled, the double precision likelihood calculations in the PLL are up to 1.9 times faster than those in BEAGLE. On an empirical DNA dataset with 2000 taxa the AVX version of PLL is 4 times faster than BEAGLE (scaling enabled and required). The PLL is available at http://www.libpll.org under the GNU General Public License (GPL).

72 citations


Cites methods from "FPGA Acceleration of the Phylogenet..."

  • ...Finally, PLL also includes a highly optimized and vectorized parsimony implementation (Alachiotis and Stamatakis 2011) that can be used to generate parsimony starting trees or to filter promising SPR and TBR moves based on their parsimony scores....

    [...]


01 Jan 2017
TL;DR: An overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs is provided.
Abstract: Since their introduction, FPGAs can be seen in more and more different fields of applications. The key advantage is the combination of software-like flexibility with the performance otherwise common to hardware. Nevertheless, every application field introduces special requirements to the used computational architecture. This paper provides an overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs.

22 citations


Journal ArticleDOI
TL;DR: This accelerated version of PaPaRa provides a significant performance improvement that allows for analyzing larger datasets in less time and observes that state-of-the-art SIMD and SIMT architectures deliver comparable performance for this dynamic programming kernel when the “competing programmer approach” is deployed.
Abstract: Aligning short DNA reads to a reference sequence alignment is a prerequisite for detecting their biological origin and analyzing them in a phylogenetic context. With the PaPaRa tool we introduced a dedicated dynamic programming algorithm for simultaneously aligning short reads to reference alignments and corresponding evolutionary reference trees. The algorithm aligns short reads to phylogenetic profiles that correspond to the branches of such a reference tree. The algorithm needs to perform an immense number of pairwise alignments. Therefore, we explore vector intrinsics and GPUs to accelerate the PaPaRa alignment kernel. We optimized and parallelized PaPaRa on CPUs and GPUs. Via SSE 4.1 SIMD (Single Instruction, Multiple Data) intrinsics for x86 SIMD architectures and multi-threading, we obtained a 9-fold acceleration on a single core as well as linear speedups with respect to the number of cores. The peak CPU performance amounts to 18.1 GCUPS (Giga Cell Updates per Second) using all four physical cores on an Intel i7 2600 CPU running at 3.4 GHz. The average CPU performance (averaged over all test runs) is 12.33 GCUPS. We also used OpenCL to execute PaPaRa on a GPU SIMT (Single Instruction, Multiple Threads) architecture. A NVIDIA GeForce 560 GPU delivered peak and average performance of 22.1 and 18.4 GCUPS respectively. Finally, we combined the SIMD and SIMT implementations into a hybrid CPU-GPU system that achieved an accumulated peak performance of 33.8 GCUPS. This accelerated version of PaPaRa (available at http://www.exelixis-lab.org/software.html ) provides a significant performance improvement that allows for analyzing larger datasets in less time. We observe that state-of-the-art SIMD and SIMT architectures deliver comparable performance for this dynamic programming kernel when the “competing programmer approach” is deployed. Finally, we show that overall performance can be substantially increased by designing a hybrid CPU-GPU system with appropriate load distribution mechanisms.

18 citations


Cites methods from "FPGA Acceleration of the Phylogenet..."

  • ...As in previous work on using accelerators (FPGA versus x86 with AVX intrinsics [15]) for computing the phylogenetic parsimony kernel [16,17], SB explicitly worked on obtaining the best possible x86 performance and NA competed with SB to obtain the best possible GPU performance....

    [...]


Proceedings ArticleDOI
01 Dec 2012
TL;DR: A novel feature to automatically configure (previously hard-coded) internal settings on the FPGA is provided to substantially reduce the installation overhead when a FPGa shall communicate with several different PCs.
Abstract: We present a substantially improved version of our popular UDP/IP core for simple and fast PC ↔ FPGA communication over Gigabit Ethernet. We provide a novel feature to automatically configure (previously hard-coded) internal settings on the FPGA. Thereby, we substantially reduce the installation overhead when a FPGA shall communicate with several different PCs. The UDP/IP core is designed to occupy a minimum amount of hardware resources on the FPGA. On the PC side, this new automatic configuration protocol can be used and invoked via a C software interface which provides convenient functions for setting up the connection to the FPGA device and sending/retrieving arrays of common C data types to/from the UDP/IP core on the FPGA. The initial UDP/IP core version is available under the LGPL license at http://opencores.org/project, udp_ip__core while the improved version of the core, including the C software interface (also under LGPL), is available at http://opencores.org/project, pc_fpga_com.

15 citations


Cites methods from "FPGA Acceleration of the Phylogenet..."

  • ...Phylogenetic Parsimony Kernel: The reconfigurable architecture we used for the third experiment was designed to accelerate the parsimony kernel [16] for building phylogenetic trees as implemented in the open-source Parsimonator code [17]....

    [...]


Journal ArticleDOI
Pablo A. Goloboff1Institutions (1)
TL;DR: This contribution discusses new heuristic methods for parsimony analysis, including methods highly praised by their authors, such as Hydra, Sampars and GA + PR + LS.
Abstract: In recent years, several publications in computer science journals have proposed new heuristic methods for parsimony analysis. This contribution discusses those papers, including methods highly praised by their authors, such as Hydra, Sampars and GA + PR + LS. Trees of comparable or better scores can be obtained using the program TNT, but from one to three orders of magnitude faster. In some cases, the search methods are very similar to others long in use in phylogenetics, but the enormous speed differences seem to correspond more to poor implementations than to actual differences in the methods themselves.

8 citations


Cites background from "FPGA Acceleration of the Phylogenet..."

  • ...Alachiotis and Stamatakis (2011) found similar problems with Kasap and Benkrid’s (2010) earlier claim of acceleration of...

    [...]

  • ...Alachiotis and Stamatakis (2011) found similar problems with Kasap and Benkrid’s (2010) earlier claim of acceleration of parsimony calculations by hundreds or thousands of times: the virtues of their FPGA implementation had been grossly overestimated due to improper comparisons – Kasap and Benkrid…...

    [...]


References
More filters

Journal Article

16,846 citations



Journal ArticleDOI
Joseph Felsenstein1Institutions (1)
TL;DR: A computationally feasible method for finding such maximum likelihood estimates is developed, and a computer program is available that allows the testing of hypotheses about the constancy of evolutionary rates by likelihood ratio tests.
Abstract: The application of maximum likelihood techniques to the estimation of evolutionary trees from nucleic acid sequence data is discussed. A computationally feasible method for finding such maximum likelihood estimates is developed, and a computer program is available. This method has advantages over the traditional parsimony algorithms, which can give misleading results if rates of evolution differ in different lineages. It also allows the testing of hypotheses about the constancy of evolutionary rates by likelihood ratio tests, and gives rough indication of the error of the estimate of the tree.

12,078 citations


"FPGA Acceleration of the Phylogenet..." refers methods in this paper

  • ...Since DNA sequences for distinct organisms typically have different lengths, a so-called multiple sequence alignment (MSA) of the DNA sequences is computed prior to conducting a phylogenetic analysis using characterbased methods (Maximum Parsimony [2] or Maximum Likelihood [3])....

    [...]


Journal ArticleDOI
20 Jan 1967-Science

3,417 citations


"FPGA Acceleration of the Phylogenet..." refers methods in this paper

  • ...Since DNA sequences for distinct organisms typically have different lengths, a so-called multiple sequence alignment (MSA) of the DNA sequences is computed prior to conducting a phylogenetic analysis using characterbased methods (Maximum Parsimony [2] or Maximum Likelihood [3])....

    [...]


Journal ArticleDOI
TL;DR: New methods for parsimony analysis of large data sets are presented, including sectorial searches, tree‐drifting, and tree‐fusing which find a shortest tree in less than 10 min and perform well in other cases analyzed.
Abstract: New methods for parsimony analysis of large data sets are presented. The new methods are sectorial searches, tree-drifting, and tree-fusing. For Chase et al.'s 500-taxon data set these methods (on a 266-MHz Pentium II) find a shortest tree in less than 10 min (i.e., over 15,000 times faster than PAUP and 1000 times faster than PAUP*). Making a complete parsimony analysis requires hitting minimum length several times independently, but not necessarily all “islands” for Chase et al.'s data set, this can be done in 4 to 6 h. The new methods also perform well in other cases analyzed (which range from 170 to 854 taxa).

850 citations


"FPGA Acceleration of the Phylogenet..." refers methods in this paper

  • ...The fastest available parsimony program is TNT [ 19 ]....

    [...]


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20221
20201
20191
20174
20154
20142