scispace - formally typeset
Open AccessJournal ArticleDOI

Shouji: a fast and efficient pre-alignment filter for sequence alignment

Reads0
Chats0
TLDR
Shouji as mentioned in this paper is a parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms and can be adapted for any bioinformatics pipeline that performs sequence alignment for verification.
Abstract
Motivation The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm. Results Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step. Availability and implementation https://github.com/CMU-SAFARI/Shouji. Supplementary information Supplementary data are available at Bioinformatics online.

read more

Content maybe subject to copyright    Report

Sequence analysis
Shouji: a fast and efficient pre-alignment filter
for sequence alignment
Mohammed Alser
1,2,3,
*, Hasan Hassan
1
, Akash Kumar
2
, Onur Mutlu
1,3,
*
and Can Alkan
3,
*
1
Computer Science Department, ETH Zu¨rich, Zu¨rich 8092, Switzerland,
2
Chair for Processor Design, Center For
Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universita¨t Dresden, 01062
Dresden, Germany and
3
Computer Engineering Department, Bilkent University, 06800 Ankara, Turkey
*To whom correspondence should be addressed.
Associate Editor: Inanc Birol
Received on September 13, 2018; revised on February 27, 2019; editorial decision on March 7, 2019; accepted on March 27, 2019
Abstract
Motivation: The ability to generate massive amounts of sequencing data continues to overwhelm
the processing capability of existing algorithms and compute infrastructures. In this work, we
explore the use of hardware/software co-design and hardware acceleration to significantly reduce
the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes.
We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces
the need for computationally-costly dynamic programming algorithms. The first key idea of our
proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all com-
mon subsequences shared between two given sequences. The second key idea is to design a
hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to
further boost the performance of our algorithm.
Results: Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders
of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our
FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU imple-
mentation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji
with five state-of-the-art sequence aligners, designed for different computing platforms. The add-
ition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art
sequence aligners by up to 18.8. Shouji can be adapted for any bioinformatics pipeline that
performs sequence alignment for verification. Unlike most existing methods that aim to accelerate
sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify
or replace the alignment step.
Availability and implementation: https://github.com/CMU-SAFARI/Shouji.
Contact: mohammed.alser@inf.ethz.ch or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
One of the most fundamental computational steps in most bioinfor-
matics analyses is the detection of the differences/similarities be-
tween two genomic sequences. Edit distance and pairwise alignment
are two approaches to achieve this step, formulated as approximate
string matching (Navarro, 2001). Edit distance approach is a
measure of how much two sequences differ. It calculates the min-
imum number of edits needed to convert a sequence into the other.
The higher the edit distance the more different the sequences from
one another. Commonly allowed edit operations include deletion,
insertion and substitution of characters in one or both sequences.
Pairwise alignment is a measure of how much the sequences are
V
C
The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 4255
Bioinformatics, 35(21), 2019, 4255–4263
doi: 10.1093/bioinformatics/btz234
Advance Access Publication Date: 28 March 2019
Original Paper

alike. It calculates the alignment that is an ordered list of characters
representing possible edit operations and matches required to
change one of the two given sequences into the other. As any two
sequences can have several different arrangements of the edit opera-
tions and matches (and hence different alignments), the alignment
algorithm usually involves a backtracking step. This step finds the
alignment that has the highest alignment score (called optimal align-
ment). The alignment score is the sum of the scores of all edits and
matches along the alignment implied by a user-defined scoring func-
tion. The edit distance and pairwise alignment approaches are non-
additive measures (Calude et al., 2002). This means that if we divide
the sequence pair into two consecutive subsequence pairs, the edit
distance of the entire sequence pair is not necessarily equivalent to
the sum of the edit distances of the shorter pairs. Instead, we need to
examine all possible prefixes of the two input sequences and keep
track of the pairs of prefixes that provide an optimal solution.
Enumerating all possible prefixes is necessary for tolerating edits
that result from both sequencing errors (Fox et al., 2014) and genet-
ic variations (McKernan et al., 2009). Therefore, the edit distance
and pairwise alignment approaches are typically implemented as
dynamic programming algorithms to avoid re-examining the same
prefixes many times. These implementations, such as Levenshtein
distance (Levenshtein, 1966), Smith–Waterman (Smith and
Waterman, 1981) and Needleman–Wunsch (Needleman and
Wunsch, 1970), are inefficient as they have quadratic time and space
complexity [i.e. O(m
2
) for a sequence length of m]. Many attempts
were made to boost the performance of existing sequence aligners.
Despite more than three decades of attempts, the fastest known edit
distance algorithm (Masek and Paterson, 1980) has a running time
of O(m
2
/log
2
m) for sequences of length m, which is still nearly
quadratic (Backurs and Indyk, 2017). Therefore, more recent works
tend to follow one of two key new directions to boost the perform-
ance of sequence alignment and edit distance implementations:
(i) accelerating the dynamic programming algorithms using hard-
ware accelerators. (ii) Developing filtering heuristics that reduce the
need for the dynamic programming algorithms, given an edit dis-
tance threshold.
Hardware accelerators are becoming increasingly popular for
speeding up the computationally expensive alignment and edit dis-
tance algorithms (Al Kawam et al., 2017; Aluru and Jammula,
2014; Ng et al., 2017; Sandes et al., 2016). Hardware accelerators
include multi-core and single instruction multiple data (SIMD) cap-
able central processing units (CPUs), graphics processing units
(GPUs) and field-programmable gate arrays (FPGAs). The classical
dynamic programming algorithms are typically accelerated by com-
puting only the necessary regions (i.e. diagonal vectors) of the
dynamic programming matrix rather than the entire matrix, as pro-
posed in Ukkonen’s banded algorithm (Ukkonen, 1985). The num-
ber of the diagonal bands required for computing the dynamic
programming matrix is 2E þ 1, where E is a user-defined edit dis-
tance threshold. The banded algorithm is still beneficial even with
its recent sequential implementations as in Edlib (
So
si
c and
Siki
c,
2017). The Edlib algorithm is implemented in C for standard CPUs
and it calculates the banded Levenshtein distance. Parasail (Daily,
2016) exploits both Ukkonen’s banded algorithm and SIMD-
capable CPUs to compute a banded alignment for a sequence pair
with a user-defined scoring function. SIMD instructions offer signifi-
cant parallelism to the matrix computation by executing the same
vector operation on multiple operands at once. The multi-core archi-
tecture of CPUs and GPUs provides the ability to compute align-
ments of many sequence pairs independently and concurrently
(Georganas et al., 2015; Liu and Schmidt, 2015). GSWABE (Liu and
Schmidt, 2015) exploits GPUs (Tesla K40) for highly parallel com-
putation of global alignment with a user-defined scoring function.
CUDASWþþ 3.0 (Liu et al., 2013) exploits the SIMD capability of
both CPUs and GPUs (GTX690) to accelerate the computation of
the Smith–Waterman algorithm with a user-defined scoring func-
tion. CUDASWþþ 3.0 provides only the optimal score, not the opti-
mal alignment (i.e. no backtracking step). Other designs, for
instance FPGASW (Fei et al., 2018), exploit the very large number
of hardware execution units in FPGAs (Xilinx VC707) to form a lin-
ear systolic array (Kung, 1982). Each execution unit in the systolic
array is responsible for computing the value of a single entry of the
dynamic programming matrix. The systolic array computes a single
vector of the matrix at a time. The data dependencies between the
entries restrict the systolic array to computing the vectors sequential-
ly (e.g. top-to-bottom, left-to-right or in an anti-diagonal manner).
FPGA accelerators seem to yield the highest performance gain com-
pared to the other hardware accelerators (Banerjee et al., 2018;
Chen et al.
, 2016; Fei et al., 2018; Waidyasooriya and Hariyama,
2015). However, many of these efforts either simplify the scoring
function, or only take into account accelerating the computation of
the dynamic programming matrix without providing the optimal
alignment as in Chen et al. (2014), Liu et al. (2013) and Nishimura
et al. (2017). Different and more sophisticated scoring functions are
typically needed to better quantify the similarity between two
sequences (Henikoff and Henikoff, 1992; Wang et al., 2011). The
backtracking step required for the optimal alignment computation
involves unpredictable and irregular memory access patterns, which
poses a difficult challenge for efficient hardware implementation.
Pre-alignment filtering heuristics aim to quickly eliminate some
of the dissimilar sequences before using the computationally expen-
sive optimal alignment algorithms. There are a few existing filtering
techniques, such as the Adjacency Filter (Xin et al., 2013), which is
implemented for standard CPUs as part of FastHASH (Xin et al.,
2013). SHD (Xin et al., 2015) is a SIMD-friendly bit-vector filter
that provides higher filtering accuracy compared to the Adjacency
Filter. GRIM-Filter (Kim et al., 2018) exploits the high memory
bandwidth and the logic layer of 3D-stacked memory to perform
highly-parallel filtering in the DRAM chip itself. GateKeeper (Alser
et al., 2017a) is designed to utilize the large amounts of parallelism
offered by FPGA architectures. MAGNET (Alser et al., 2017b)
shows a low number of falsely accepted sequence pairs but its cur-
rent implementation is much slower than that of SHD or
GateKeeper. GateKeeper (Alser et al., 2017a) provides a high filter-
ing speed but suffers from relatively high number of falsely accepted
sequence pairs.
Our goal in this work is to significantly reduce the time spent on
calculating the optimal alignment of short sequences and maintain
high filtering accuracy. To this end, we introduce Shouji (Named
after a traditional Japanese door that is designed to slide open http://
www.aisf.or.jp/jaanus/deta/s/shouji.htm), a new, fast and very ac-
curate pre-alignment filter. Shouji is based on two key ideas: (i) a
new filtering algorithm that remarkably reduces the need for compu-
tationally expensive banded optimal alignment by rapidly excluding
dissimilar sequences from the optimal alignment calculation. (ii)
Judicious use of the parallelism-friendly architecture of modern
FPGAs to greatly speed up this new filtering algorithm.
The contributions of this paper are as follows:
We introduce Shouji, a highly parallel and highly accurate pre-
alignment filter, which uses a sliding search window approach to
quickly identify dissimilar sequences without the need for com-
putationally expensive alignment algorithms. We overcome the
4256 M.Alser et al.

implementation limitations of MAGNET (Alser et al., 2017b).
We build two hardware accelerator designs that adopt modern
FPGA architectures to boost the performance of both Shouji and
MAGNET.
We provide a comprehensive analysis of the run time and space
complexity of Shouji and MAGNET algorithms. Shouji and
MAGNET are asymptomatically inexpensive and run in linear
time with respect to the sequence length and the edit distance
threshold.
We demonstrate that Shouji and MAGNET significantly improve
the accuracy of pre-alignment filtering by up to two and four
orders of magnitude, respectively, compared to GateKeeper and
SHD.
We demonstrate that our FPGA implementations of Shouji and
MAGNET are two to three orders of magnitude faster than their
CPU implementations. We demonstrate that integrating Shouji
with five state-of-the-art aligners reduces the execution time of
the sequence aligner by up to 18.8.
2 Materials and methods
2.1 Overview
Our goal is to quickly reject dissimilar sequences with high accuracy
such that we reduce the need for the computationally-costly align-
ment step. To this end, we propose the Shouji algorithm to achieve
highly accurate filtering. Then, we accelerate Shouji by taking
advantage of the parallelism of FPGAs to achieve fast filtering oper-
ations. The key filtering strategy of Shouji is inspired by the pigeon-
hole principle, which states that if E items are distributed into E þ 1
boxes, then one or more boxes would remain empty. In the context
of pre-alignment filtering, this principle provides the following key
observation: if two sequences differ by E edits, then the two sequen-
ces should share at least a single common subsequence (i.e. free of
edits) and at most E þ 1 non-overlapping common subsequences,
where E is the edit distance threshold. With the existence of at most
E edits, the total length of these non-overlapping common subse-
quences should not be <m E, where m is the sequence length.
Shouji employs the pigeonhole principle to decide whether or not
two sequences are potentially similar. Shouji finds all the non-
overlapping subsequences that exist in both sequences. If the total
length of these common subsequences <mE, then there exist more
edits than the allowed edit distance threshold, and hence Shouji
rejects the two given sequences. Otherwise, Shouji accepts the two
sequences. Next, we discuss the details of Shouji.
2.2 Shouji pre-alignment filter
Shouji identifies the dissimilar sequences, without calculating the
optimal alignment, in three main steps. (i) The first step is to con-
struct what we call a neighborhood map that visualizes the pairwise
matches and mismatches between two sequences given an edit dis-
tance threshold of E characters. (ii) The second step is to find all the
non-overlapping common subsequences in the neighborhood map
using a sliding search window approach. (iii) The last step is to ac-
cept or reject the given sequence pairs based on the length of the
found matches. If the length of the found matches is small, then
Shouji rejects the input sequence pair.
2.2.1 Building the neighborhood map
The neighborhood map, N, is a binary m by m matrix, where m is the
sequence length. Given a text sequence T[1...m], a pattern
sequence P[1...m], and an edit distance threshold E, the neighborhood
map represents the comparison result of the ith character of P with
the jth character of T, where i and j satisfy 1 i m and
iE j i þ E. The entry N[i, j] of the neighborhood map can be
calculated as follows:
N ½i; j¼
0; if Pi
½
¼ Tj
½
1; if Pi
½
T½j
(1)
We present in Figure 1 an example of a neighborhood map for
two sequences, where a pattern P differs from a text T by three
edits.
The entry N[i, j] is set to zero if the ith character of the pattern
matches the jth character of the text. Otherwise, it is set to one. The
way we build our neighborhood map ensures that computing each
of its entries is independent of every other, and thus the entire map
can be computed all at once in a parallel fashion. Hence, our neigh-
borhood map is well suited for highly parallel computing platforms
(Alser et al., 2017a; Seshadri et al., 2017). Note that in sequence
alignment algorithms, computing each entry of the dynamic pro-
graming matrix depends on the values of the immediate left, upper
left and upper entries of its own. Different from ‘dot plot’ or ‘dot
matrix’ (visual representation of the similarities between two closely
similar genomic sequences) that is used in FASTA/FASTP (Lipman
and Pearson, 1985), our neighborhood map computes only neces-
sary diagonals near the main diagonal of the matrix (e.g. seven diag-
onals shown in Fig. 1).
j123456789101112
GGTGCAGAGC T C
G
G
T
G
A
G
A
G
T
T
G
T
i
1
2
3
4
5
6
7
8
9
10
11
12
Neighborhood map:
0010
00101
110111
0010110
1111010
1011010
1101011
1101011
1111101
111101
10111
1101
Three common
subsequences
Search
Window # 7
Search
Window#1
00 00 1 0 00 0 1 0 1
Last botto
m
right entry
Search window # 1
Search window # 2
Search window # 3
Search window # 4
Search window # 5
Search window # 6
Search window # 7
Search window # 8
Shouji bit-vector:
....
Fig. 1. Neighborhood map (N) and the Shouji bit-vector, for text
T¼GGTGCAGAGCTC and pattern P¼GGTGAGAGTTGT for E ¼ 3. The three
common subsequences (i.e. GGTG, AGAG and T) are highlighted in gray. We
use a search window of size four columns (two examples of which are high-
lighted in black) with a step size of a single column. Shouji searches diagonal-
ly within each search window for the 4-bit vector that has the largest number
of zeros. Once found, Shouji examines if the found 4-bit vector maximizes the
number of zeros at the corresponding location of the 4-bit vector in the Shouji
bit-vector. If so, then Shouji stores this 4-bit vector in the Shouji bit-vector at
its corresponding location
Shouji: a fast and efficient pre-alignment filter for sequence alignment 4257

2.2.2 Identifying the diagonally consecutive matches
The key goal of this step is to accurately find all the non-overlapping
common subsequences shared between a pair of sequences. The
accuracy of finding these subsequences is crucial for the overall fil-
tering accuracy, as the filtering decision is made solely based on total
subsequence length. With the existence of E edits, there are at most
E þ 1 non-overlapping common subsequences (based on the
pigeonhole principle) shared between a pair of sequences. Each non-
overlapping common subsequence is represented as a streak of diag-
onally consecutive zeros in the neighborhood map (as highlighted in
yellow in Fig. 1). These streaks of diagonally consecutive zeros are
distributed along the diagonals of the neighborhood map without
any prior information about their length or number. One way of
finding these common subsequences is to use a brute-force ap-
proach, which examines all the streaks of diagonally consecutive
zeros that start at the first column and selects the streak that has the
largest number of zeros as the first common subsequences. It then
iterates over the remaining part of the neighborhood map to find the
other common subsequences. However, this brute-force approach is
infeasible for highly optimized hardware implementation as the
search space is unknown at design time. Shouji overcomes this issue
by dividing the neighborhood map into equal-size parts. We call
each part a search window. Limiting the size of the search space
from the entire neighborhood map to a search window has three key
benefits. (i) It helps to provide a scalable architecture that can be
implemented for any sequence length and edit distance threshold.
(ii) Downsizing the search space into a reasonably small sub-matrix
with a known dimension at design time limits the number of all pos-
sible permutations of each bit-vector to 2
n
, where n is the search
window size. This reduces the size of the look-up tables (LUTs)
required for an FPGA implementation and simplifies the overall
design. (iii) Each search window is considered as a smaller
sub-problem that can be solved independently and rapidly with high
parallelism. Shouji uses a search window of four columns wide, as
we illustrate in Figure 1. We need m search windows for processing
two sequences, each of which is of length m characters. Each search
window overlaps with its next neighboring search window by three
columns. This ensures covering the entire neighborhood map and
finding all the common subsequences regardless of their starting
location. We select the width of each search window to be four col-
umns to guarantee finding the shortest possible common subse-
quence, which is a single match located between two mismatches
(i.e. ‘101’). However, we observe that the bit pattern ‘101’ is not al-
ways necessarily a part of the correct alignment (or the common
subsequences). For example, the bit pattern ‘101’ exists once as a
part of the correct alignment in Figure 1, but it also appears five
times in other different locations that are not included in the correct
alignment. To improve the accuracy of finding the diagonally con-
secutive matches, we increase the length of the diagonal vector to be
examined to four bits. We also experimentally evaluate different
search window sizes in Supplementary Materials, Section 6.1. We
find that a search window size of four columns provides the highest
filtering accuracy without falsely rejecting similar sequences.
Shouji finds the diagonally consecutive matches that are part of
the common subsequences in the neighborhood map in two main
steps. Step 1: for each search window, Shouji finds a 4-bit diagonal
vector that has the largest number of zeros. Shouji greedily considers
this vector as a part of the common subsequence as it has the least
possible number of edits (i.e. 1’s). Finding always the maximum
number of matches is necessary to avoid overestimating the actual
number of edits and eventually preserving all similar sequences.
Shouji achieves this step by comparing the 4 bits of each of the
2E þ 1 diagonal vectors within a search window and selects the
4-bit vector that has the largest number of zeros. In the case where
two 4-bit subsequences have the same number of zeros, Shouji
breaks the ties by selecting the first one that has a leading zero.
Then, Shouji slides the search window by a single column (i.e. step
size ¼1 column) toward the last bottom right entry of the neighbor-
hood map and repeats the previous computations. Thus, Shouji
performs ‘Step 1’ m times using m search windows, where m is the
sequence length. Step 2: the last step is to gather the results found
for each search window (i.e. 4-bit vector that has the largest number
of zeros) and construct back all the diagonally consecutive matches.
For this purpose, Shouji maintains a Shouji bit-vector of length m
that stores all the zeros found in the neighborhood map as we illus-
trate in Figure 1. For each sliding search window, Shouji examines if
the selected 4-bit vector maximizes the number of zeros in the
Shouji bit-vector at the same corresponding location. If so, Shouji
stores the selected 4-bit vector in the Shouji bit-vector at the same
corresponding location. This is necessary to avoid overestimating
the number of edits between two given sequences. The common sub-
sequences are represented as streaks of consecutive zeros in the
Shouji bit-vector.
2.2.3 Filtering out dissimilar sequences
The last step of Shouji is to calculate the total number of edits (i.e.
ones) in the Shouji bit-vector. Shouji examines if the total number of
ones in the Shouji bit-vector >E. If so, Shouji excludes the two
sequences from the optimal alignment calculation. Otherwise,
Shouji considers the two sequences similar within the allowed edit
distance threshold and allows their optimal alignment to be com-
puted using optimal alignment algorithms. The Shouji bit-vector
represents the differences between two sequences along the entire
length of the sequence, m. However, Shouji is not limited to end-to-
end edit distance calculation. Shouji is also able to provide edit dis-
tance calculation in local and glocal (semi-global) fashion. For ex-
ample, achieving local edit distance calculation requires ignoring the
ones that are located at the two ends of the Shouji bit-vector.
We present an example of local edit distance between two sequences
of different length in Supplementary Materials, Section 8. Achieving
glocal edit distance calculation requires excluding the ones that are
located at one of the two ends of the Shouji bit-vector from the total
count of the ones in the Shouji bit-vector. This is important for cor-
rect pre-alignment filtering for global, local and glocal alignment
algorithms. We provide the pseudocode of Shouji and discuss its
computational complexity in Supplementary Materials, Section 6.2.
We also present two examples of applying the Shouji filtering algo-
rithm in Supplementary Materials, Section 8.
2.3 Accelerator architecture
Our second aim is to substantially accelerate Shouji, by leveraging
the parallelism of FPGAs. In this section, we present our hardware
accelerator that is designed to exploit the large amounts of parallel-
ism offered by modern FPGA architectures (Aluru and Jammula,
2014; Herbordt et al., 2007; Trimberger, 2015). We then outline the
implementation of Shouji to be used in our accelerator design.
Figure 2 shows the hardware architecture of the accelerator. It con-
tains a user-configurable number of filtering units. Each filtering
unit provides pre-alignment filtering independently from other units.
The workflow of the accelerator starts with transmitting the se-
quence pair to the FPGA through the fastest communication
4258 M.Alser et al.

medium available on the FPGA board (i.e. PCIe). The sequence con-
troller manages and provides the necessary input signals for each fil-
tering unit in the accelerator. Each filtering unit requires two
sequences of the same length and an edit distance threshold. The re-
sult controller gathers the output result (i.e. a single bit of value ‘1’
for similar sequences and ‘0’ for dissimilar sequences) of each filter-
ing unit and transmits them back to the host side in the same order
as their sequences are transmitted to the FPGAs.
The host-FPGA communication is achieved using RIFFA 2.2
(Jacobsen et al., 2015). To make the best use of the available resources in
the FPGA chip, our algorithm utilizes the operations that are easily sup-
ported on an FPGA, such as bitwise operations, bit shifts and bit count.
To build the neighborhood map on the FPGA, we use the observation
that the main diagonal can be implemented using a bitwise XOR oper-
ation between the two given sequences. The upper E diagonals can be
implemented by gradually shifting the pattern (P) to the right-hand direc-
tion and then performing bitwise XOR with the text (T). This allows
each character of P to be compared with the right-hand neighbor charac-
ters (up to E characters) of its corresponding character of T.ThelowerE
diagonals can be implemented in a way similar to the upper E diagonals,
but here the shift operation is performed in the left-hand direction. This
ensures that each character of P is compared with the left-hand neighbor
characters (up to E characters) of its corresponding character of T.
We also build an efficient hardware architecture for each search
window of the Shouji algorithm. It quickly finds the number of zeros in
each 4-bit vector using a hardware look-up table that stores the 16 pos-
sible permutations of a 4-bit vector along with the number of zeros for
each permutation. We present the block diagram of the search window
architecture in Supplementary Materials, Section 6.3. Our hardware
implementation of the Shouji filtering unit is independent of the specific
FPGA-platform as it does not rely on any vendor-specific computing
elements (e.g. intellectual property cores). However, each FPGA board
has different resources and hardware capabilities that can directly or in-
directly affect the performance and the data throughput of the design.
The maximum data throughput of the design and the available FPGA
resources determine the number of filtering units in the accelerator.
Thus, if, e.g. the memory bandwidth is saturated, then increasing the
number of filtering units would not improve performance.
3 Results
In this section, we evaluate (i) the filtering accuracy, (ii) the FPGA
resource utilization, (iii) the execution time of Shouji, our hardware
implementation of MAGNET (Alser et al., 2017b), GateKeeper
(Alser et al., 2017a) and SHD (Xin et al., 2015), (iv) the benefits of
the pre-alignment filters together with state-of-the-art aligners and
(v) the benefits of Shouji together with state-of-the-art read map-
pers. As we mention in Section 1, MAGNET leads to a small num-
ber of falsely accepted sequence pairs but suffers from poor
performance. We comprehensively explore this algorithm and pro-
vide an efficient and fast hardware implementation of MAGNET in
Supplementary Materials, Section 7. We run all experiments using a
3.6 GHz Intel i7-3820 CPU with 8 GB RAM. We use a Xilinx
Virtex 7 VC709 board (Xilinx, 2014) to implement our accelerator
architecture (for both Shouji and MAGNET). We build the FPGA
design using Vivado 2015.4 in synthesizable Verilog.
3.1 Dataset description
Our experimental evaluation uses 12 different real datasets. Each
dataset contains 30 million real sequence pairs. We obtain three dif-
ferent read sets (ERR240727_1, SRR826460_1 and SRR826471_1)
of the whole human genome that include three different read lengths
(100, 150 and 250 bp). We download these three read sets from
EMBL-ENA (www.ebi.ac.uk/ena). We map each read set to the
human reference genome (GRCh37) using the mrFAST (Alkan et al.,
2009) mapper. We obtain the human reference genome from the
1000 Genomes Project (1000 Genomes Project Consortium, 2012).
For each read set, we use four different maximum numbers of
allowed edits using the e parameter of mrFAST to generate four
real datasets. Each dataset contains the sequence pairs that are gen-
erated by the mrFAST mapper before the read alignment step. This
enables us to measure the effectiveness of the filters using both
aligned and unaligned sequences over a wide range of edit distance
thresholds. We summarize the details of these 12 datasets in
Supplementary Materials, Section 9. For the reader’s convenience,
when referring to these datasets, we number them from 1 to 12 (e.g.
set_1 to set_12). We use Edlib (
So
si
c and
Siki
c, 2017) to generate
the ground truth edit distance value for each sequence pair.
3.2 Filtering accuracy
We evaluate the accuracy of a pre-alignment filter by computing its
false accept rate and false reject rate. We first assess the false accept
rate of Shouji, MAGNET (Alser et al., 2017b), SHD (Xin et al.,
2015) and GateKeeper (Alser et al., 2017a) across different edit dis-
tance thresholds and datasets. The false accept rate is the ratio of the
number of dissimilar sequences that are falsely accepted by the filter
and the number of dissimilar sequences that are rejected by the opti-
mal sequence alignment algorithm. We aim to minimize the false
accept rate to maximize that number of dissimilar sequences that are
eliminated. In Figure 3, we provide the false accept rate of the four
filters across our 12 datasets and edit distance thresholds of 0–10%
of the sequence length (we provide the exact values in Section 10 in
Supplementary Materials).
Based on Figure 3, we make four key observations. (i) We ob-
serve that Shouji, MAGNET, SHD and GateKeeper are less accurate
in examining the low-edit sequences (i.e. datasets 1, 2, 5, 6, 9 and
10) than the high-edit sequences (i.e. datasets 3, 4, 7, 8, 11 and 12).
(ii) SHD (Xin et al., 2015) and GateKeeper (Alser et al., 2017a)
become ineffective for edit distance thresholds of >8% (E ¼ 8), 5%
(E ¼ 7) and 3% (E ¼ 7) for sequence lengths of 100, 150 and 250
characters, respectively. This causes them to examine each sequence
pair unnecessarily twice (i.e. once by GateKeeper or SHD and once
by the alignment algorithm). (iii) For high-edit datasets, Shouji pro-
vides up to 17.2, 73 and 467 (2.4, 2.7 and 38 for low-edit
Fig. 2. Overview of our hardware accelerator architecture. The filtering units
can be replicated as many times as possible based on the resources available
on the FPGA
Shouji: a fast and efficient pre-alignment filter for sequence alignment 4259

Figures
Citations
More filters
Posted Content

A Modern Primer on Processing in Memory.

TL;DR: This chapter discusses recent research that aims to practically enable computation close to data, an approach called processing-in-memory (PIM).
Journal ArticleDOI

Processing-in-memory: A workload-driven perspective

TL;DR: This article describes the work on systematically identifying opportunities for PIM in real applications and quantifies potential gains for popular emerging applications (e.g., machine learning, data analytics, genome analysis) and describes challenges that remain for the widespread adoption of PIM.
Posted Content

In-DRAM Bulk Bitwise Execution Engine.

TL;DR: Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory, exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy.
Proceedings ArticleDOI

NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling

TL;DR: NERO, an FPGA+HBM-based accelerator connected through IBM CAPI2 (Coherent Accelerator Processor Interface) to an IBM POWER9 host system is developed and it is concluded that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
References
More filters
Journal ArticleDOI

A general method applicable to the search for similarities in the amino acid sequence of two proteins

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Journal ArticleDOI

Identification of common molecular subsequences.

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Posted ContentDOI

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Heng Li
- 16 Mar 2013 - 
TL;DR: BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment, which is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What contributions have the authors mentioned in the paper "Shouji: a fast and efficient pre-alignment filter for sequence alignment" ?

In this work, the authors explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. The authors introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of their proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array ( FPGA ) architectures to further boost the performance of their algorithm. 

Another potential target of their research is to explore the possibility of accelerating optimal alignment calculations for longer sequences ( few tens of thousands of characters ) ( Senol et al., 2018 ) using pre-alignment filtering. 

Hardware accelerators include multi-core and single instruction multiple data (SIMD) capable central processing units (CPUs), graphics processing units (GPUs) and field-programmable gate arrays (FPGAs). 

The backtracking step required for the optimal alignment computation involves unpredictable and irregular memory access patterns, which poses a difficult challenge for efficient hardware implementation. 

(i) The design for a single MAGNET filtering unit requires about 10.5 and 37.8% of the available LUTs for edit distance thresholds of 2 and 5, respectively. 

To make the best use of the available resources in the FPGA chip, their algorithm utilizes the operations that are easily supported on an FPGA, such as bitwise operations, bit shifts and bit count. 

One of the most fundamental computational steps in most bioinformatics analyses is the detection of the differences/similarities between two genomic sequences. 

The multi-core architecture of CPUs and GPUs provides the ability to compute alignments of many sequence pairs independently and concurrently (Georganas et al., 2015; Liu and Schmidt, 2015). 

The workflow of the accelerator starts with transmitting the sequence pair to the FPGA through the fastest communicationmedium available on the FPGA board (i.e. PCIe). 

Enumerating all possible prefixes is necessary for tolerating edits that result from both sequencing errors (Fox et al., 2014) and genetic variations (McKernan et al., 2009). 

The upper E diagonals can be implemented by gradually shifting the pattern (P) to the right-hand direction and then performing bitwise XOR with the text (T). 

(iv) MAGNET shows up to 1577, 3550 and 25 552 lower false accept rates for high-edit datasets (3.5, 14.7 and 135 for lowedit datasets) compared to GateKeeper and SHD for sequence lengths of 100, 150 and 250 characters, respectively. 

The authors observe that if the execution time of the aligner is much larger than that of the pre-alignment filter (which is the case for Edlib, Parasail and GSWABE for E ¼ 5 characters), then MAGNET provides up to 1.3 more end-to-end speedup over Shouji. 

MAGNET also shows up to 205, 951 and 16 760 lower false accept rates for highedit datasets (2.7, 10 and 88 for low-edit datasets) over Shouji for sequence lengths of 100, 150 and 250 characters, respectively. 

One way of finding these common subsequences is to use a brute-force approach, which examines all the streaks of diagonally consecutive zeros that start at the first column and selects the streak that has the largest number of zeros as the first common subsequences. 

Shouji improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the best-performing existing pre-alignment filter, GateKeeper. 

It quickly finds the number of zeros in each 4-bit vector using a hardware look-up table that stores the 16 possible permutations of a 4-bit vector along with the number of zeros for each permutation. 

Shouji offers the ability to make the best use of existing aligners without sacrificing any of their capabilities, as it does not modify or replace the alignment step.