What contributions have the authors mentioned in the paper "Shouji: a fast and efficient pre-alignment filter for sequence alignment" ?

In this work, the authors explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. The authors introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of their proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array ( FPGA ) architectures to further boost the performance of their algorithm.

What have the authors stated for future works in "Shouji: a fast and efficient pre-alignment filter for sequence alignment" ?

Another potential target of their research is to explore the possibility of accelerating optimal alignment calculations for longer sequences ( few tens of thousands of characters ) ( Senol et al., 2018 ) using pre-alignment filtering.

How many LUTs are required for a single MAGNET filtering unit?

(i) The design for a single MAGNET filtering unit requires about 10.5 and 37.8% of the available LUTs for edit distance thresholds of 2 and 5, respectively.

How many false accept rates does MAGNET show for high-edit datasets?

(iv) MAGNET shows up to 1577, 3550 and 25 552 lower false accept rates for high-edit datasets (3.5, 14.7 and 135 for lowedit datasets) compared to GateKeeper and SHD for sequence lengths of 100, 150 and 250 characters, respectively.

What is the effect of MAGNET on the execution time of the aligner?

The authors observe that if the execution time of the aligner is much larger than that of the pre-alignment filter (which is the case for Edlib, Parasail and GSWABE for E ¼ 5 characters), then MAGNET provides up to 1.3 more end-to-end speedup over Shouji.

How many false accept rates does MAGNET show for highedit datasets?

MAGNET also shows up to 205, 951 and 16 760 lower false accept rates for highedit datasets (2.7, 10 and 88 for low-edit datasets) over Shouji for sequence lengths of 100, 150 and 250 characters, respectively.

How does Shouji improve the accuracy of pre-alignment filtering?

Shouji improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the best-performing existing pre-alignment filter, GateKeeper.

What is the way to make the use of existing aligners?

Shouji offers the ability to make the best use of existing aligners without sacrificing any of their capabilities, as it does not modify or replace the alignment step.

(Open Access) Shouji: a fast and efficient pre-alignment filter for sequence alignment (2019) | Mohammed Alser

Q: What is the way to compute a sequence pair?

The multi-core architecture of CPUs and GPUs provides the ability to compute alignments of many sequence pairs independently and concurrently (Georganas et al., 2015; Liu and Schmidt, 2015).

Sequence analysis

Shouji: a fast and efficient pre-alignment filter

for sequence alignment

Mohammed Alser

1,2,3,

*, Hasan Hassan

, Akash Kumar

, Onur Mutlu

1,3,

and Can Alkan

Computer Science Department, ETH Zu¨rich, Zu¨rich 8092, Switzerland,

Chair for Processor Design, Center For

Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universita¨t Dresden, 01062

Dresden, Germany and

Computer Engineering Department, Bilkent University, 06800 Ankara, Turkey

*To whom correspondence should be addressed.

Associate Editor: Inanc Birol

Received on September 13, 2018; revised on February 27, 2019; editorial decision on March 7, 2019; accepted on March 27, 2019

Abstract

Motivation: The ability to generate massive amounts of sequencing data continues to overwhelm

the processing capability of existing algorithms and compute infrastructures. In this work, we

explore the use of hardware/software co-design and hardware acceleration to signiﬁcantly reduce

the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes.

We introduce Shouji, a highly parallel and accurate pre-alignment ﬁlter that remarkably reduces

the need for computationally-costly dynamic programming algorithms. The ﬁrst key idea of our

proposed pre-alignment ﬁlter is to provide high ﬁltering accuracy by correctly detecting all com-

mon subsequences shared between two given sequences. The second key idea is to design a

hardware accelerator that adopts modern ﬁeld-programmable gate array (FPGA) architectures to

further boost the performance of our algorithm.

Results: Shouji signiﬁcantly improves the accuracy of pre-alignment ﬁltering by up to two orders

of magnitude compared to the state-of-the-art pre-alignment ﬁlters, GateKeeper and SHD. Our

FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU imple-

mentation of Shouji. Using a single FPGA chip, we benchmark the beneﬁts of integrating Shouji

with ﬁve state-of-the-art sequence aligners, designed for different computing platforms. The add-

ition of Shouji as a pre-alignment step reduces the execution time of the ﬁve state-of-the-art

sequence aligners by up to 18.8. Shouji can be adapted for any bioinformatics pipeline that

performs sequence alignment for veriﬁcation. Unlike most existing methods that aim to accelerate

sequence alignment, Shouji does not sacriﬁce any of the aligner capabilities, as it does not modify

or replace the alignment step.

Availability and implementation: https://github.com/CMU-SAFARI/Shouji.

Contact: mohammed.alser@inf.ethz.ch or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

One of the most fundamental computational steps in most bioinfor-

matics analyses is the detection of the differences/similarities be-

tween two genomic sequences. Edit distance and pairwise alignment

are two approaches to achieve this step, formulated as approximate

string matching (Navarro, 2001). Edit distance approach is a

measure of how much two sequences differ. It calculates the min-

imum number of edits needed to convert a sequence into the other.

The higher the edit distance the more different the sequences from

one another. Commonly allowed edit operations include deletion,

insertion and substitution of characters in one or both sequences.

Pairwise alignment is a measure of how much the sequences are

Bioinformatics, 35(21), 2019, 4255–4263

doi: 10.1093/bioinformatics/btz234

Advance Access Publication Date: 28 March 2019

Original Paper

alike. It calculates the alignment that is an ordered list of characters

representing possible edit operations and matches required to

change one of the two given sequences into the other. As any two

sequences can have several different arrangements of the edit opera-

tions and matches (and hence different alignments), the alignment

algorithm usually involves a backtracking step. This step finds the

alignment that has the highest alignment score (called optimal align-

ment). The alignment score is the sum of the scores of all edits and

matches along the alignment implied by a user-defined scoring func-

tion. The edit distance and pairwise alignment approaches are non-

additive measures (Calude et al., 2002). This means that if we divide

the sequence pair into two consecutive subsequence pairs, the edit

distance of the entire sequence pair is not necessarily equivalent to

the sum of the edit distances of the shorter pairs. Instead, we need to

examine all possible prefixes of the two input sequences and keep

track of the pairs of prefixes that provide an optimal solution.

Enumerating all possible prefixes is necessary for tolerating edits

that result from both sequencing errors (Fox et al., 2014) and genet-

ic variations (McKernan et al., 2009). Therefore, the edit distance

and pairwise alignment approaches are typically implemented as

dynamic programming algorithms to avoid re-examining the same

prefixes many times. These implementations, such as Levenshtein

distance (Levenshtein, 1966), Smith–Waterman (Smith and

Waterman, 1981) and Needleman–Wunsch (Needleman and

Wunsch, 1970), are inefficient as they have quadratic time and space

complexity [i.e. O(m

) for a sequence length of m]. Many attempts

were made to boost the performance of existing sequence aligners.

Despite more than three decades of attempts, the fastest known edit

distance algorithm (Masek and Paterson, 1980) has a running time

of O(m

/log

m) for sequences of length m, which is still nearly

quadratic (Backurs and Indyk, 2017). Therefore, more recent works

tend to follow one of two key new directions to boost the perform-

ance of sequence alignment and edit distance implementations:

(i) accelerating the dynamic programming algorithms using hard-

ware accelerators. (ii) Developing filtering heuristics that reduce the

need for the dynamic programming algorithms, given an edit dis-

tance threshold.

Hardware accelerators are becoming increasingly popular for

speeding up the computationally expensive alignment and edit dis-

tance algorithms (Al Kawam et al., 2017; Aluru and Jammula,

2014; Ng et al., 2017; Sandes et al., 2016). Hardware accelerators

include multi-core and single instruction multiple data (SIMD) cap-

able central processing units (CPUs), graphics processing units

(GPUs) and field-programmable gate arrays (FPGAs). The classical

dynamic programming algorithms are typically accelerated by com-

puting only the necessary regions (i.e. diagonal vectors) of the

dynamic programming matrix rather than the entire matrix, as pro-

posed in Ukkonen’s banded algorithm (Ukkonen, 1985). The num-

ber of the diagonal bands required for computing the dynamic

programming matrix is 2E þ 1, where E is a user-defined edit dis-

tance threshold. The banded algorithm is still beneficial even with

its recent sequential implementations as in Edlib (





c and



Siki



2017). The Edlib algorithm is implemented in C for standard CPUs

and it calculates the banded Levenshtein distance. Parasail (Daily,

2016) exploits both Ukkonen’s banded algorithm and SIMD-

capable CPUs to compute a banded alignment for a sequence pair

with a user-defined scoring function. SIMD instructions offer signifi-

cant parallelism to the matrix computation by executing the same

vector operation on multiple operands at once. The multi-core archi-

tecture of CPUs and GPUs provides the ability to compute align-

ments of many sequence pairs independently and concurrently

(Georganas et al., 2015; Liu and Schmidt, 2015). GSWABE (Liu and

Schmidt, 2015) exploits GPUs (Tesla K40) for highly parallel com-

putation of global alignment with a user-defined scoring function.

CUDASWþþ 3.0 (Liu et al., 2013) exploits the SIMD capability of

both CPUs and GPUs (GTX690) to accelerate the computation of

the Smith–Waterman algorithm with a user-defined scoring func-

tion. CUDASWþþ 3.0 provides only the optimal score, not the opti-

mal alignment (i.e. no backtracking step). Other designs, for

instance FPGASW (Fei et al., 2018), exploit the very large number

of hardware execution units in FPGAs (Xilinx VC707) to form a lin-

ear systolic array (Kung, 1982). Each execution unit in the systolic

array is responsible for computing the value of a single entry of the

dynamic programming matrix. The systolic array computes a single

vector of the matrix at a time. The data dependencies between the

entries restrict the systolic array to computing the vectors sequential-

ly (e.g. top-to-bottom, left-to-right or in an anti-diagonal manner).

FPGA accelerators seem to yield the highest performance gain com-

pared to the other hardware accelerators (Banerjee et al., 2018;

Chen et al.

, 2016; Fei et al., 2018; Waidyasooriya and Hariyama,

2015). However, many of these efforts either simplify the scoring

function, or only take into account accelerating the computation of

the dynamic programming matrix without providing the optimal

alignment as in Chen et al. (2014), Liu et al. (2013) and Nishimura

et al. (2017). Different and more sophisticated scoring functions are

typically needed to better quantify the similarity between two

sequences (Henikoff and Henikoff, 1992; Wang et al., 2011). The

backtracking step required for the optimal alignment computation

involves unpredictable and irregular memory access patterns, which

poses a difficult challenge for efficient hardware implementation.

Pre-alignment filtering heuristics aim to quickly eliminate some

of the dissimilar sequences before using the computationally expen-

sive optimal alignment algorithms. There are a few existing filtering

techniques, such as the Adjacency Filter (Xin et al., 2013), which is

implemented for standard CPUs as part of FastHASH (Xin et al.,

2013). SHD (Xin et al., 2015) is a SIMD-friendly bit-vector filter

that provides higher filtering accuracy compared to the Adjacency

Filter. GRIM-Filter (Kim et al., 2018) exploits the high memory

bandwidth and the logic layer of 3D-stacked memory to perform

highly-parallel filtering in the DRAM chip itself. GateKeeper (Alser

et al., 2017a) is designed to utilize the large amounts of parallelism

offered by FPGA architectures. MAGNET (Alser et al., 2017b)

shows a low number of falsely accepted sequence pairs but its cur-

rent implementation is much slower than that of SHD or

GateKeeper. GateKeeper (Alser et al., 2017a) provides a high filter-

ing speed but suffers from relatively high number of falsely accepted

sequence pairs.

Our goal in this work is to significantly reduce the time spent on

calculating the optimal alignment of short sequences and maintain

high filtering accuracy. To this end, we introduce Shouji (Named

after a traditional Japanese door that is designed to slide open http://

www.aisf.or.jp/jaanus/deta/s/shouji.htm), a new, fast and very ac-

curate pre-alignment filter. Shouji is based on two key ideas: (i) a

new filtering algorithm that remarkably reduces the need for compu-

tationally expensive banded optimal alignment by rapidly excluding

dissimilar sequences from the optimal alignment calculation. (ii)

Judicious use of the parallelism-friendly architecture of modern

FPGAs to greatly speed up this new filtering algorithm.

The contributions of this paper are as follows:

•

We introduce Shouji, a highly parallel and highly accurate pre-

alignment ﬁlter, which uses a sliding search window approach to

quickly identify dissimilar sequences without the need for com-

putationally expensive alignment algorithms. We overcome the

4256 M.Alser et al.

implementation limitations of MAGNET (Alser et al., 2017b).

We build two hardware accelerator designs that adopt modern

FPGA architectures to boost the performance of both Shouji and

MAGNET.

•

We provide a comprehensive analysis of the run time and space

complexity of Shouji and MAGNET algorithms. Shouji and

MAGNET are asymptomatically inexpensive and run in linear

time with respect to the sequence length and the edit distance

threshold.

•

We demonstrate that Shouji and MAGNET signiﬁcantly improve

the accuracy of pre-alignment ﬁltering by up to two and four

orders of magnitude, respectively, compared to GateKeeper and

SHD.

•

We demonstrate that our FPGA implementations of Shouji and

MAGNET are two to three orders of magnitude faster than their

CPU implementations. We demonstrate that integrating Shouji

with ﬁve state-of-the-art aligners reduces the execution time of

the sequence aligner by up to 18.8.

2 Materials and methods

2.1 Overview

Our goal is to quickly reject dissimilar sequences with high accuracy

such that we reduce the need for the computationally-costly align-

ment step. To this end, we propose the Shouji algorithm to achieve

highly accurate filtering. Then, we accelerate Shouji by taking

advantage of the parallelism of FPGAs to achieve fast filtering oper-

ations. The key filtering strategy of Shouji is inspired by the pigeon-

hole principle, which states that if E items are distributed into E þ 1

boxes, then one or more boxes would remain empty. In the context

of pre-alignment filtering, this principle provides the following key

observation: if two sequences differ by E edits, then the two sequen-

ces should share at least a single common subsequence (i.e. free of

edits) and at most E þ 1 non-overlapping common subsequences,

where E is the edit distance threshold. With the existence of at most

E edits, the total length of these non-overlapping common subse-

quences should not be <m E, where m is the sequence length.

Shouji employs the pigeonhole principle to decide whether or not

two sequences are potentially similar. Shouji finds all the non-

overlapping subsequences that exist in both sequences. If the total

length of these common subsequences <mE, then there exist more

edits than the allowed edit distance threshold, and hence Shouji

rejects the two given sequences. Otherwise, Shouji accepts the two

sequences. Next, we discuss the details of Shouji.

2.2 Shouji pre-alignment filter

Shouji identifies the dissimilar sequences, without calculating the

optimal alignment, in three main steps. (i) The first step is to con-

struct what we call a neighborhood map that visualizes the pairwise

matches and mismatches between two sequences given an edit dis-

tance threshold of E characters. (ii) The second step is to find all the

non-overlapping common subsequences in the neighborhood map

using a sliding search window approach. (iii) The last step is to ac-

cept or reject the given sequence pairs based on the length of the

found matches. If the length of the found matches is small, then

Shouji rejects the input sequence pair.

2.2.1 Building the neighborhood map

The neighborhood map, N, is a binary m by m matrix, where m is the

sequence length. Given a text sequence T[1...m], a pattern

sequence P[1...m], and an edit distance threshold E, the neighborhood

map represents the comparison result of the ith character of P with

the jth character of T, where i and j satisfy 1  i  m and

iE  j  i þ E. The entry N[i, j] of the neighborhood map can be

calculated as follows:

N ½i; j¼

0; if Pi

½

¼ Tj

½

1; if Pi

½

6¼ T½j



(1)

We present in Figure 1 an example of a neighborhood map for

two sequences, where a pattern P differs from a text T by three

edits.

The entry N[i, j] is set to zero if the ith character of the pattern

matches the jth character of the text. Otherwise, it is set to one. The

way we build our neighborhood map ensures that computing each

of its entries is independent of every other, and thus the entire map

can be computed all at once in a parallel fashion. Hence, our neigh-

borhood map is well suited for highly parallel computing platforms

(Alser et al., 2017a; Seshadri et al., 2017). Note that in sequence

alignment algorithms, computing each entry of the dynamic pro-

graming matrix depends on the values of the immediate left, upper

left and upper entries of its own. Different from ‘dot plot’ or ‘dot

matrix’ (visual representation of the similarities between two closely

similar genomic sequences) that is used in FASTA/FASTP (Lipman

and Pearson, 1985), our neighborhood map computes only neces-

sary diagonals near the main diagonal of the matrix (e.g. seven diag-

onals shown in Fig. 1).

j123456789101112

GGTGCAGAGC T C

Neighborhood map:

0010

00101

110111

0010110

1111010

1011010

1101011

1111101

111101

10111

1101

Three common

subsequences

Window # 7

Window#1

00 00 1 0 00 0 1 0 1

Last botto

right entry

Search window # 1

Search window # 2

Search window # 3

Search window # 4

Search window # 5

Search window # 6

Search window # 7

Search window # 8

Shouji bit-vector:

....

Fig. 1. Neighborhood map (N) and the Shouji bit-vector, for text

T¼GGTGCAGAGCTC and pattern P¼GGTGAGAGTTGT for E ¼ 3. The three

common subsequences (i.e. GGTG, AGAG and T) are highlighted in gray. We

use a search window of size four columns (two examples of which are high-

lighted in black) with a step size of a single column. Shouji searches diagonal-

ly within each search window for the 4-bit vector that has the largest number

of zeros. Once found, Shouji examines if the found 4-bit vector maximizes the

number of zeros at the corresponding location of the 4-bit vector in the Shouji

bit-vector. If so, then Shouji stores this 4-bit vector in the Shouji bit-vector at

its corresponding location

Shouji: a fast and efficient pre-alignment filter for sequence alignment 4257

2.2.2 Identifying the diagonally consecutive matches

The key goal of this step is to accurately find all the non-overlapping

common subsequences shared between a pair of sequences. The

accuracy of finding these subsequences is crucial for the overall fil-

tering accuracy, as the filtering decision is made solely based on total

subsequence length. With the existence of E edits, there are at most

E þ 1 non-overlapping common subsequences (based on the

pigeonhole principle) shared between a pair of sequences. Each non-

overlapping common subsequence is represented as a streak of diag-

onally consecutive zeros in the neighborhood map (as highlighted in

yellow in Fig. 1). These streaks of diagonally consecutive zeros are

distributed along the diagonals of the neighborhood map without

any prior information about their length or number. One way of

finding these common subsequences is to use a brute-force ap-

proach, which examines all the streaks of diagonally consecutive

zeros that start at the first column and selects the streak that has the

largest number of zeros as the first common subsequences. It then

iterates over the remaining part of the neighborhood map to find the

other common subsequences. However, this brute-force approach is

infeasible for highly optimized hardware implementation as the

search space is unknown at design time. Shouji overcomes this issue

by dividing the neighborhood map into equal-size parts. We call

each part a search window. Limiting the size of the search space

from the entire neighborhood map to a search window has three key

benefits. (i) It helps to provide a scalable architecture that can be

implemented for any sequence length and edit distance threshold.

(ii) Downsizing the search space into a reasonably small sub-matrix

with a known dimension at design time limits the number of all pos-

sible permutations of each bit-vector to 2

, where n is the search

window size. This reduces the size of the look-up tables (LUTs)

required for an FPGA implementation and simplifies the overall

design. (iii) Each search window is considered as a smaller

sub-problem that can be solved independently and rapidly with high

parallelism. Shouji uses a search window of four columns wide, as

we illustrate in Figure 1. We need m search windows for processing

two sequences, each of which is of length m characters. Each search

window overlaps with its next neighboring search window by three

columns. This ensures covering the entire neighborhood map and

finding all the common subsequences regardless of their starting

location. We select the width of each search window to be four col-

umns to guarantee finding the shortest possible common subse-

quence, which is a single match located between two mismatches

(i.e. ‘101’). However, we observe that the bit pattern ‘101’ is not al-

ways necessarily a part of the correct alignment (or the common

subsequences). For example, the bit pattern ‘101’ exists once as a

part of the correct alignment in Figure 1, but it also appears five

times in other different locations that are not included in the correct

alignment. To improve the accuracy of finding the diagonally con-

secutive matches, we increase the length of the diagonal vector to be

examined to four bits. We also experimentally evaluate different

search window sizes in Supplementary Materials, Section 6.1. We

find that a search window size of four columns provides the highest

filtering accuracy without falsely rejecting similar sequences.

Shouji finds the diagonally consecutive matches that are part of

the common subsequences in the neighborhood map in two main

steps. Step 1: for each search window, Shouji finds a 4-bit diagonal

vector that has the largest number of zeros. Shouji greedily considers

this vector as a part of the common subsequence as it has the least

possible number of edits (i.e. 1’s). Finding always the maximum

number of matches is necessary to avoid overestimating the actual

number of edits and eventually preserving all similar sequences.

Shouji achieves this step by comparing the 4 bits of each of the

2E þ 1 diagonal vectors within a search window and selects the

4-bit vector that has the largest number of zeros. In the case where

two 4-bit subsequences have the same number of zeros, Shouji

breaks the ties by selecting the first one that has a leading zero.

Then, Shouji slides the search window by a single column (i.e. step

size ¼1 column) toward the last bottom right entry of the neighbor-

hood map and repeats the previous computations. Thus, Shouji

performs ‘Step 1’ m times using m search windows, where m is the

sequence length. Step 2: the last step is to gather the results found

for each search window (i.e. 4-bit vector that has the largest number

of zeros) and construct back all the diagonally consecutive matches.

For this purpose, Shouji maintains a Shouji bit-vector of length m

that stores all the zeros found in the neighborhood map as we illus-

trate in Figure 1. For each sliding search window, Shouji examines if

the selected 4-bit vector maximizes the number of zeros in the

Shouji bit-vector at the same corresponding location. If so, Shouji

stores the selected 4-bit vector in the Shouji bit-vector at the same

corresponding location. This is necessary to avoid overestimating

the number of edits between two given sequences. The common sub-

sequences are represented as streaks of consecutive zeros in the

Shouji bit-vector.

2.2.3 Filtering out dissimilar sequences

The last step of Shouji is to calculate the total number of edits (i.e.

ones) in the Shouji bit-vector. Shouji examines if the total number of

ones in the Shouji bit-vector >E. If so, Shouji excludes the two

sequences from the optimal alignment calculation. Otherwise,

Shouji considers the two sequences similar within the allowed edit

distance threshold and allows their optimal alignment to be com-

puted using optimal alignment algorithms. The Shouji bit-vector

represents the differences between two sequences along the entire

length of the sequence, m. However, Shouji is not limited to end-to-

end edit distance calculation. Shouji is also able to provide edit dis-

tance calculation in local and glocal (semi-global) fashion. For ex-

ample, achieving local edit distance calculation requires ignoring the

ones that are located at the two ends of the Shouji bit-vector.

We present an example of local edit distance between two sequences

of different length in Supplementary Materials, Section 8. Achieving

glocal edit distance calculation requires excluding the ones that are

located at one of the two ends of the Shouji bit-vector from the total

count of the ones in the Shouji bit-vector. This is important for cor-

rect pre-alignment filtering for global, local and glocal alignment

algorithms. We provide the pseudocode of Shouji and discuss its

computational complexity in Supplementary Materials, Section 6.2.

We also present two examples of applying the Shouji filtering algo-

rithm in Supplementary Materials, Section 8.

2.3 Accelerator architecture

Our second aim is to substantially accelerate Shouji, by leveraging

the parallelism of FPGAs. In this section, we present our hardware

accelerator that is designed to exploit the large amounts of parallel-

ism offered by modern FPGA architectures (Aluru and Jammula,

2014; Herbordt et al., 2007; Trimberger, 2015). We then outline the

implementation of Shouji to be used in our accelerator design.

Figure 2 shows the hardware architecture of the accelerator. It con-

tains a user-configurable number of filtering units. Each filtering

unit provides pre-alignment filtering independently from other units.

The workflow of the accelerator starts with transmitting the se-

quence pair to the FPGA through the fastest communication

4258 M.Alser et al.

medium available on the FPGA board (i.e. PCIe). The sequence con-

troller manages and provides the necessary input signals for each fil-

tering unit in the accelerator. Each filtering unit requires two

sequences of the same length and an edit distance threshold. The re-

sult controller gathers the output result (i.e. a single bit of value ‘1’

for similar sequences and ‘0’ for dissimilar sequences) of each filter-

ing unit and transmits them back to the host side in the same order

as their sequences are transmitted to the FPGAs.

The host-FPGA communication is achieved using RIFFA 2.2

(Jacobsen et al., 2015). To make the best use of the available resources in

the FPGA chip, our algorithm utilizes the operations that are easily sup-

ported on an FPGA, such as bitwise operations, bit shifts and bit count.

To build the neighborhood map on the FPGA, we use the observation

that the main diagonal can be implemented using a bitwise XOR oper-

ation between the two given sequences. The upper E diagonals can be

implemented by gradually shifting the pattern (P) to the right-hand direc-

tion and then performing bitwise XOR with the text (T). This allows

each character of P to be compared with the right-hand neighbor charac-

ters (up to E characters) of its corresponding character of T.ThelowerE

diagonals can be implemented in a way similar to the upper E diagonals,

but here the shift operation is performed in the left-hand direction. This

ensures that each character of P is compared with the left-hand neighbor

characters (up to E characters) of its corresponding character of T.

We also build an efficient hardware architecture for each search

window of the Shouji algorithm. It quickly finds the number of zeros in

each 4-bit vector using a hardware look-up table that stores the 16 pos-

sible permutations of a 4-bit vector along with the number of zeros for

each permutation. We present the block diagram of the search window

architecture in Supplementary Materials, Section 6.3. Our hardware

implementation of the Shouji filtering unit is independent of the specific

FPGA-platform as it does not rely on any vendor-specific computing

elements (e.g. intellectual property cores). However, each FPGA board

has different resources and hardware capabilities that can directly or in-

directly affect the performance and the data throughput of the design.

The maximum data throughput of the design and the available FPGA

resources determine the number of filtering units in the accelerator.

Thus, if, e.g. the memory bandwidth is saturated, then increasing the

number of filtering units would not improve performance.

3 Results

In this section, we evaluate (i) the filtering accuracy, (ii) the FPGA

resource utilization, (iii) the execution time of Shouji, our hardware

implementation of MAGNET (Alser et al., 2017b), GateKeeper

(Alser et al., 2017a) and SHD (Xin et al., 2015), (iv) the benefits of

the pre-alignment filters together with state-of-the-art aligners and

(v) the benefits of Shouji together with state-of-the-art read map-

pers. As we mention in Section 1, MAGNET leads to a small num-

ber of falsely accepted sequence pairs but suffers from poor

performance. We comprehensively explore this algorithm and pro-

vide an efficient and fast hardware implementation of MAGNET in

Supplementary Materials, Section 7. We run all experiments using a

3.6 GHz Intel i7-3820 CPU with 8 GB RAM. We use a Xilinx

Virtex 7 VC709 board (Xilinx, 2014) to implement our accelerator

architecture (for both Shouji and MAGNET). We build the FPGA

design using Vivado 2015.4 in synthesizable Verilog.

3.1 Dataset description

Our experimental evaluation uses 12 different real datasets. Each

dataset contains 30 million real sequence pairs. We obtain three dif-

ferent read sets (ERR240727_1, SRR826460_1 and SRR826471_1)

of the whole human genome that include three different read lengths

(100, 150 and 250 bp). We download these three read sets from

EMBL-ENA (www.ebi.ac.uk/ena). We map each read set to the

human reference genome (GRCh37) using the mrFAST (Alkan et al.,

2009) mapper. We obtain the human reference genome from the

1000 Genomes Project (1000 Genomes Project Consortium, 2012).

For each read set, we use four different maximum numbers of

allowed edits using the e parameter of mrFAST to generate four

real datasets. Each dataset contains the sequence pairs that are gen-

erated by the mrFAST mapper before the read alignment step. This

enables us to measure the effectiveness of the filters using both

aligned and unaligned sequences over a wide range of edit distance

thresholds. We summarize the details of these 12 datasets in

Supplementary Materials, Section 9. For the reader’s convenience,

when referring to these datasets, we number them from 1 to 12 (e.g.

set_1 to set_12). We use Edlib (





c and



Siki



c, 2017) to generate

the ground truth edit distance value for each sequence pair.

3.2 Filtering accuracy

We evaluate the accuracy of a pre-alignment filter by computing its

false accept rate and false reject rate. We first assess the false accept

rate of Shouji, MAGNET (Alser et al., 2017b), SHD (Xin et al.,

2015) and GateKeeper (Alser et al., 2017a) across different edit dis-

tance thresholds and datasets. The false accept rate is the ratio of the

number of dissimilar sequences that are falsely accepted by the filter

and the number of dissimilar sequences that are rejected by the opti-

mal sequence alignment algorithm. We aim to minimize the false

accept rate to maximize that number of dissimilar sequences that are

eliminated. In Figure 3, we provide the false accept rate of the four

filters across our 12 datasets and edit distance thresholds of 0–10%

of the sequence length (we provide the exact values in Section 10 in

Supplementary Materials).

Based on Figure 3, we make four key observations. (i) We ob-

serve that Shouji, MAGNET, SHD and GateKeeper are less accurate

in examining the low-edit sequences (i.e. datasets 1, 2, 5, 6, 9 and

10) than the high-edit sequences (i.e. datasets 3, 4, 7, 8, 11 and 12).

(ii) SHD (Xin et al., 2015) and GateKeeper (Alser et al., 2017a)

become ineffective for edit distance thresholds of >8% (E ¼ 8), 5%

(E ¼ 7) and 3% (E ¼ 7) for sequence lengths of 100, 150 and 250

characters, respectively. This causes them to examine each sequence

pair unnecessarily twice (i.e. once by GateKeeper or SHD and once

by the alignment algorithm). (iii) For high-edit datasets, Shouji pro-

vides up to 17.2, 73 and 467 (2.4, 2.7 and 38 for low-edit

Fig. 2. Overview of our hardware accelerator architecture. The ﬁltering units

can be replicated as many times as possible based on the resources available

on the FPGA

Shouji: a fast and efficient pre-alignment filter for sequence alignment 4259

Shouji: a fast and efficient pre-alignment filter for sequence alignment

Figures

Citations

A Modern Primer on Processing in Memory.

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

Processing-in-memory: A workload-driven perspective

In-DRAM Bulk Bitwise Execution Engine.

NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling

References

A general method applicable to the search for similarities in the amino acid sequence of two proteins

Identification of common molecular subsequences.

Binary codes capable of correcting deletions, insertions and reversals

Binary codes capable of correcting deletions, insertions, and reversals

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Related Papers (5)

Fast and accurate short read alignment with Burrows–Wheeler transform

Minimap2: pairwise alignment for nucleotide sequences

A scalable processing-in-memory accelerator for parallel graph processing

TOP-PIM: throughput-oriented programmable processing in memory

Personalized copy number and segmental duplication maps using next-generation sequencing

Frequently Asked Questions (18)

Q1. What contributions have the authors mentioned in the paper "Shouji: a fast and efficient pre-alignment filter for sequence alignment" ?

Q2. What have the authors stated for future works in "Shouji: a fast and efficient pre-alignment filter for sequence alignment" ?

Q3. What are the main components of the hardware accelerator?

Q4. What is the way to compute the optimal alignment?

Q5. How many LUTs are required for a single MAGNET filtering unit?

Q6. What is the way to use the available resources in the FPGA?

Q7. What is the fundamental computational step in bioinformatics analyses?

Q8. What is the way to compute a sequence pair?

Q9. What is the workflow of the accelerator?

Q10. What is the importance of enumerating all possible prefixes?

Q11. How can the upper E diagonals be implemented?

Q12. How many false accept rates does MAGNET show for high-edit datasets?

Q13. What is the effect of MAGNET on the execution time of the aligner?

Q14. How many false accept rates does MAGNET show for highedit datasets?

Q15. What is the way to find the common subsequences?

Q16. How does Shouji improve the accuracy of pre-alignment filtering?

Q17. How does the Shouji algorithm find the number of zeros in each 4-bit vector?

Q18. What is the way to make the use of existing aligners?