# An Algorithm-Hardware Co-Optimized Framework for Accelerating *N:M* Sparse Transformers

Chao Fang, Graduate Student Member, IEEE, Aojun Zhou, and Zhongfeng Wang, Fellow, IEEE

Abstract—The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere GPUs leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. (1) From algorithm perspective, we propose a sparsity inheritance mechanism along with inherited dynamic pruning (IDP) to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed to significantly reduce the storage requirement for deployment. (2) From hardware perspective, we present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers. STA features not only a computing engine unifying both sparse-dense and dense-dense matrix multiplications with high computational efficiency but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency. Moreover, STA can achieve  $14.47 \times$ and 11.33× speedup compared to Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and perform  $2.00 \sim 19.47 \times$  faster inference than the state-of-the-art FPGA-based accelerators for Transformers.

Index Terms—Algorithm-hardware co-design, Transformer, hardware accelerator, pruning, model compression.

## I. INTRODUCTION

**T**RANSFORMER-BASED networks are a formidable force in deep learning [1]. Tremendous impact in many fields, such as neural machine translation [2], language understanding [3], and image processing [4], has been made since the innovation of Transformers. Nevertheless, the impressive performance of Transformers comes with heavy computing and memory costs, which become a significant barrier to the efficient deployment of Transformer-based applications. Notably, BERT, a representative Transformer-based model [3],

A. Zhou is with CUHK-Sensetime Joint Lab, CUHK, Hong Kong, China (e-mail: aojunzhou@gmail.com).

requires 440MB memory and over 176G floating-point operations. Such severe requirements on memory and computation make it critical to find an efficient solution for deploying Transformers.



Fig. 1. Accelerating N:M sparse Transformer-based models (a) using modern Ampere GPUs and (b) using diverse FPGAs with our framework. Compared to (a), (b) can generate a series of N:M sparse Transformers along with the dedicated accelerators for efficient model deployment.

Sparsity is an important feature that can be utilized to improve the efficiency of DNNs deployment in dedicated accelerators. In the pioneering works, OPTIMUS [5] and EdgeBERT [6], the latest ASIC accelerators, leverage unstructured sparsity to realize efficient deployment for Transformers. Nevertheless, it is hard to predict the unstructured sparsity in advance, and therefore the acceleration performance can be greatly dragged. Recent studies [7] demonstrate deep neural networks leveraging N:M fine-grained structured sparsity, where N out of M parameters are zeros for every continuous M parameters, can achieve comparable performance over those leveraging unstructured sparsity [8]. However, it is significantly restricted to accelerate N:M sparse networks on current hardware platforms. As shown in Fig. 1 (a), the only existing solution is ASP with Ampere GPUs that focuses on the middle-level (2:4),

This work was supported in part by the National Natural Science Foundation of China under Grant 62174084, 62104097 and in part by the High-Level Personnel Project of Jiangsu Province under Grant JSSCBS20210034, the Key Research Plan of Jiangsu Province of China under Grant BE2019003-4. (*Corresponding author: Zhongfeng Wang.*)

C. Fang and Z. Wang are with the School of Electronic Science and Engineering, Nanjing University, Nanjing 210008, China (e-mail: fantasy-see@smail.nju.edu.cn; zfwang@nju.edu.cn).

i.e., 50%, sparse ratio. Based on our experiments, a heavy Transformer can be dramatically slimmed by weight pruning with the aggressive N:M pattern, e.g., 2:8 or 1:8, achieving a considerable reduction in the amount of both parameters and operations. The only choice of uniform 2:4 sparsity limits performance when deploying Transformer-based models, making it inflexible to meet different hardware constraints (e.g., latency, energy). Compared with the uniform 2:4 sparsity, the more flexible general N:M sparsity in real applications can satisfy various algorithm and hardware constraints under different deployment scenarios. However, there is currently a lack of an integrated framework to investigate the deployment of Transformer with general N:M sparse patterns. To bridge this gap, as presented in Fig. 1 (b), we propose an algorithmhardware co-optimized framework for accelerating N:M sparse Transformers, which addresses two significant issues: 1). how to produce a series of N:M sparse Transformers in an efficient way; 2). how to design a flexible and efficient dedicated architecture for N:M sparse Transformers on diverse FPGA platforms.

Although advanced optimization algorithms ASP [9] and SR-STE [7] can maintain the middle-level (2:4) sparsity via static and dynamic fine-tuning, we observe existing methods degrade the performance significantly under high sparse ratio (e.g.,  $\geq 75\%$ ). In addition, the ASP and SR-STE schemes leverage single-shot magnitude-based pruning for a specified hyper-parameter N and M. This traditional recipe results in a significant performance drop with a higher sparse ratio and restricts the deployment of flexible N:M sparse models on the FPGA platform. To overcome the aforementioned problems, we propose a sparsity inheritance mechanism, which increases the sparsity progressively to enable efficient searching for N:M sparse Transformers under various sparsity configurations (e.g., 2:8, 1:8). We also propose a pruning method, namely inherited dynamic pruning (IDP), which shrinks prepruning models progressively, and the convergent pre-pruning initialized models can aid in the convergence of the following sub-networks. Extensive experiments are conducted on Transformer-based models, showing that models generated by IDP with the sparsity inheritance mechanism have superior performance on various sparsity ratios than those using the ASP and SR-STE. Moreover, for efficient model deployment, we apply a simple but effective bitmap-based compression scheme, which dramatically reduce the storage requirements for N:M sparse Transformers.

To enable flexible and efficient deployment on various FPGA devices, we design a highly configurable dedicated accelerator for *N:M* sparse Transformers, namely STA. STA fully explores the parallelism of Transformers in three aspects, including head parallelism, row parallelism, and column parallelism, which significantly improves computational efficiency. It features two computing cores, a diverse matrix multiplication (MatMul) engine, called DMME, and a scalable softmax module, both of which are highly configurable. Operations of *N:M* sparse Transformers are dominated by two types of MatMuls. One is the sparse-dense MatMul with *N:M* sparse network parameters, and the other is dense-dense MatMul free of parameters. DMME performs both sparse-dense and

dense-dense MatMuls on-the-fly, and achieves much higher computational efficiency over the prior work [10] under both modes. Especially for sparse-dense MatMul, DMME only performs operations related to those remaining non-zero parameters, which greatly improves computational efficiency. The scalable softmax module can perform the softmax function in Transformers. It keeps all the intermediate results fully local, eliminating latency from intermediate off-chip data communication. According to the given architectural settings, STA can be rapidly implemented on FPGAs to realize efficient deployment for specific *N:M* sparse Transformers.

To summarize, the contributions of our paper are as follows:

- To our best knowledge, this is the first work that presents an algorithm-hardware co-optimized framework to systematically study the efficiency of fine-grained *N:M* sparse Transformers on FPGA. The proposed framework can adjust to diverse hardware constraints for flexible and efficient model deployment.
- To generate a series of *N:M* sparse Transformers simultaneously, we propose a sparsity inheritance mechanism along with the inherited dynamic pruning (IDP) algorithm, which can significantly achieve about 6.7% accuracy improvement of Transformers under high sparsity compared with state-of-the-art methods.
- We present a simple but effective bitmap-based compression scheme for N:M sparse Transformers compared to multiple sparse indexing formats, which dramatically reduces the storage requirements up to  $5.33 \times .$
- We propose a dedicated hardware architecture, namely STA, to realize flexible and efficient deployment of *N:M* sparse Transformers. It features two novel hardware modules handling intensive operations of Transformers, including a diverse matrix multiplication engine (DMME) that unifies dense and sparse MatMul operations in high computational efficiency, and a scalable softmax module to avoid frequent off-chip memory accesses.
- Extensive experiments have been conducted on four NLP tasks and four Transformer-based models to evaluate the effectiveness of the proposed framework, which achieves up to  $19.47 \times$  speedup over Intel i9-9900X, NVIDIA RTX 2080 Ti, and prior FPGA-based accelerators for Transformers.

The rest of this paper is organized as follows: Section II presents an overview of Transformers, and state-of-the-art works for accelerating Transformers with innovations on hard-ware architecture. Section III introduces the workflow of our proposed algorithm-hardware co-optimization framework. Section IV and Section V elaborate optimizations on pruning algorithm and hardware architecture, respectively. Comprehensive experimental results are presented in Section VI to show significant potential of our proposed co-optimization framework in Transformer-based applications.

## II. BACKGROUND AND MOTIVATION

In this section, we provide an overview of key structures in Transformers, and review related work on hardware accelerators for Transformers.

#### A. Transformer Overview

The key architectures of the Transformer [11] are characterized by a multi-head attention (MHA) residual block (ResBlock) and a position-wise feed-forward network (FFN) ResBlock. Fig. 2 (a) and (b) illustrate the inner structures of MHA and FFN ResBlocks, respectively. The input and output of the FFN ResBlock are connected by a residual connector. And two linear transformation modules along with a activation function are inside the FFN ResBlock. The structure of MHA ResBlock is more complicated. The inputs of MHA ResBlock are split into multiple parallel heads with corresponding linear projection at first. Then the results are as input fed into the attention mechanism in parallel, and finally, the results of attention heads are concatenated together and passed into a linear layer to obtain the output linear projection. Note that the attention mechanism is totally different from the linear layer, performing parameter-free MatMuls. Thus, the computing engine for Transformers is required to support both sparse and dense MatMuls even though sparsity is introduced to parameters. The residual connector of FFN ResBlock is organized the same as the FFN ResBlock.



Fig. 2. The operations in (a) the MHA ResBlock and (b) the FFN ResBlock under N:M sparsity pattern. Both Resblocks are the key structures of the Transformer. (c) An illustration of 2:4 sparse parameters in a linear layer.

## B. Recent Advances for Transformer Acceleration

Extensive research has concentrated on the design of highperformance and energy-efficient DNN hardware accelerators [12]–[28]. However, most of these works focus on CNN and RNN computations, and not as much scrutiny has been given to accelerating Transformer-based networks with self-attention mechanisms.

As a pioneer work, [10] proposed a dense systolic array accelerator along with the partitioning scheme for FPGAbased acceleration of Transformers. Moreover, FTRANS [29] exploited block-circulant matrix-based weight representation for Transformer acceleration. However, both of them fail to utilize the sparsity of parameters in Transformers, thereby limiting the speedup of model deployment. A<sup>3</sup> [30], SpAtten [31], and Sanger [32] merely focused on the speedup potential for the sparse attention mechanism, all of which can hardly satisfy the needs of agile and efficient deployment of Transformer models. OPTIMUS [5] and EdgeBERT [6] holistically accelerate Transformers with unstructured sparse matrix multiplications and save energy by skipping the computations related to those zero-value parameters. Nevertheless, the unstructured sparsity leads to irregular data access, making both designs suffer low computational efficiency. [33] exploited the coarse-grained block-based sparsity pattern for accelerating Transformers, while this sparsity pattern is so coarse that the models can hardly achieve a considerable sparsity ratio with acceptable accuracy.

In summary, it is hard for all these works to achieve satisfying speedup and efficiency of Transformer deployment due to the lack of attention to model sparsity, limited sparse potential exploration on the whole Transformer models, or restricted computational efficiency for sparse Transformers. To address above issues, this work presents an algorithm-hardware co-optimization framework to realize flexible and efficient deployment of Transformers by leveraging general N:M sparsity patterns. For algorithm optimization, we focus on how to generate a series of N:M sparse Transformers in high quality and efficiency. For hardware optimization, we concentrate on designing a flexible and efficient dedicated architecture that can accelerate N:M sparse Transformers with high computational efficiency.

## III. OVERVIEW OF CO-OPTIMIZATION

To achieve agile and efficient deployment of Transformers, we propose an algorithm-hardware co-optimized framework. The overview of our framework is presented in Fig. 3. According to the given specific requirements, our framework can quickly obtain the required N:M sparse Transformer model with high accuracy, and provide corresponding Transformer accelerators on FPGA devices to realize efficient model deployment. In this section, we elaborate on the workflow of our algorithm-hardware co-optimized framework.

At the algorithm level, we focus on quickly obtaining any desired N:M sparse Transformer model, and achieving effective compression of the N:M sparse Transformer. The algorithm optimization is divided into two stages. As shown in Fig. 3, the first stage is IDP based on the sparsity inheritance mechanism. Compared with single-shot training [7], [9], our method can utilize the knowledge of the previous N:M sparse model, which contributes to faster and better convergence. The second stage is model compression. Only the non-zero parameters in the N:M sparse Transformer would be stored, along with an additional binary mask that indicates position of all recorded elements. The methods of pruning and model compression are presented in Sec. IV.

At the hardware level, we concentrate on efficient and flexible hardware architecture design that boosts the computational efficiency for *N:M* sparse Transformers. The hardware optimization features an efficient hardware architecture for *N:M* sparse Transformers along with an automatic hardware generator, which can meet requirements on various Transformer models, FPGA devices, and *N:M* sparsity. The automatic hardware generator is composed of instruction generator and hardware template library. According to the *N:M* configuration of the winner model and the network structure of the deployment model, the instruction generator can automatically produce instructions that guide STA to perform operations



Fig. 3. The workflow of our proposed algorithm-hardware co-optimized framework. At the algorithm level, *N:M* sparse Transformers can be rapidly generated by inherited dynamic pruning, and significantly compressed for further deployment. At the hardware level, the dedicated accelerator, STA, is implemented on the FPGA platform to accelerate the deployed *N:M* sparse Transformer.

of the winner *N:M* sparse Transformer. As shown in Fig 3, instructions are divided into three categories: load/store data, sparse/dense MatMul operators, and fused vector operators. The hardware template library can quickly generate a dedicated STA based on the pre-defined hardware configurations and the *N:M* configuration of the winner model. STA performs inference tasks for the Transformer model by accessing the compact sparse parameters and the pre-generated instructions. STA, whose hardware architecture is elaborated in Sec. V, can achieve significant improvement on computational efficiency by eliminating all zero-valued parameter operations.

In real-life deployment, the choice of N:M may change if there are multiple FPGA devices with different hardware resources, and varying deployment constraints including latency and model accuracy. However, considering all the above factors, once an N:M model is determined to be deployed, the model can meet the needs of practical applications. Therefore, the N:M would change before deploying models while would not change after the model deployment. Compared to the Ampere GPU dedicated for 2:4 sparse acceleration, specific N:M STA can be flexibly configured and automatically generated on the selected FPGA device with significant performance gains benefited from dedicated N:M sparse acceleration. As N:M changes, our framework would efficiently benefit from the algorithm-hardware co-optimization. At the algorithm level, the proposed IDP could provide a series of N:M models with varying computing complexity and model accuracy, among which we could select the most suitable one for further model deployment. At the hardware level, the proposed STA could be flexibly generated based on the selected N:M and other configurations, achieving significant acceleration of N:M Transformers.

#### **IV. ALGORITHM OPTIMIZATION**

In this section, we elaborated on algorithm optimizations of our framework. Firstly, we demonstrate advantages of *N:M* sparsity pattern in Sec. IV-A by comparing it with other popular sparsity patterns. Then, pruning algorithm and compression scheme of *N*:*M* sparse Transformers are presented in Sec. IV-B and Sec. IV-C, respectively.

#### A. N:M Sparsity Pattern

A dense parameter matrix, can be pruned with a sparsity ratio of 50% using three existing sparsity patterns, unstructured sparsity [5], block-based structured sparsity [33], and N:M group-based structured sparsity [7], respectively. Table I summarizes all these pruning patterns. Elements in any position of the parameter matrix can be pruned if the unstructured sparsity pattern is employed. The unstructured sparse model can achieve a considerable compression ratio while maintaining comparable accuracy to the dense model. However, there is a limited speedup of the unstructured sparse model on hardware [5], [6] due to the irregular pattern. For block-based pruning, the parameter matrix is firstly divided into multiple blocks, and then some unimportant blocks was dropped to reduce storage and computing. The block-based sparse model having regular pattern can achieve high computational efficiency on hardware. Nevertheless, the speedup of block-based sparse models [33] is inefficient since there is limited compression ratio using the block-based pattern. As for N:M group-based structured sparsity, the parameter matrix is divided into multiple groups. Here we consider consecutive column-wise elements in the matrix gather as a group. Each group has M elements and contains N nonzero elements at most. The N:M sparsity can achieve high compression ratio along with computational efficiency on hardware due to its fined-grained regular pattern. Hence, Transformers with N:M sparsity pattern, which remains a lot to be explored, have much more speedup potential than that with unstructured and block-based sparsity.

#### B. Pruning Algorithm

Given a pretrained dense Transformer model, generally, a N:M sparse Transformer can be trained with the objective as

$$\min_{S(\mathcal{W},N,M)} \mathcal{L}(\mathcal{W};\mathcal{D}),\tag{1}$$

 TABLE I

 Comparison between existing three sparsity patterns

|               | Unstructured | Block-based | N:M Group-based |  |  |
|---------------|--------------|-------------|-----------------|--|--|
| Visualization |              |             |                 |  |  |
| Accuracy      | High √       | Medium      | High √          |  |  |
| Efficiency    | Low          | High √      | High √          |  |  |
| Speedup       | Medium       | High √      | High √          |  |  |

where  $\mathcal{D}$  denotes the observed data,  $\mathcal{L}$  represents the loss function,  $\mathcal{W}$  indicates the parameters of the Transformer, and  $S(\mathcal{W}, N, M)$  is the sparse Transformer with N:M sparsity pattern. N is the number of non-zero values. For dense model  $\mathcal{W}$ , it can be equivalent to  $S(\mathcal{W}, N = M, M)$ .

Existing methods NVIDIA ASP [9] and SR-STE [7] leverage the single-shot magnitude-based pruning and dynamic sparse training from dense models W respectively. The specific sparse models S(W, N, M) inherit from global dense models S(W, M, M) with pre-trained weights and random initialization in ASP and SR-STE. This may lead to suboptmial problems, and we observe the ASP and SR-STE hurt the performance significantly on Transformer-based models with the higher sparse ratio (e.g.,  $\geq 75\%$ ). In addition, the ASP and SR-STE undesirably require intensive training computation if we have different hardware constraints with multiple sparsity levels (e.g., 1:8, 2:8, 3:8 and 4:8).

Therefore, we propose a general and simple algorithm for generating models with general N:M sparse patterns, namely IDP, which can produce a series of sparse models with different N:M configurations. Algorithm 1 presents the detail of IDP. To handle the optimization difficulty of the sparse subnetworks inherited from large dense models, we introduce a novel co-training scheme, which optimize different multiple-level N:M sparse models simultaneously (e.g., 1:8, 2:8, 3:8, 4:8 and 5:8). During the training phase, we gradually reduce the non-zeros parameters N, which can guarantee the super models converge well. We can give the general inheritance mechanism of the IDP as follows:

$$S(\mathcal{W}, N_1, M) \leftarrow S(\mathcal{W}, N_2, M) \leftarrow \dots \leftarrow S(\mathcal{W}, M, M),$$
 (2)

where  $S(W, N_i, M)$  are N:M sparse models, and the S(W, M, M) represents the dense model, where  $N_1 < N_2 < \cdots < M$ ,  $\leftarrow$  means the smaller model S(W, N - 1, M) prune from S(W, N, M), named **inheritance mechanism**. It requires merely a hyper-parameter n denoted as the end of iterations of N. With the novel inheritance mechanism, our IDP training method can be summarized with four steps:

- <u>Step 1</u>: Initialize N = M 1 and set the dense pretrained model as the first winner model.
- <u>Step 2</u>: Sparsity inheritance applies the kept parameters of the winner of all the S(W, N + 1, M) candidates to initialize following sparse model S(W, N, M).
- <u>Step 3</u>: Sparse training for *N:M* sparse candidates in several epochs. Parameters are adjusted in every epoch by updating the mask based on the their magnitude. This step

generates a new winner model, which is the convergent model at the last epoch.

• Step 4: If N = n, the whole process is finished, or otherwise  $N \leftarrow N - 1$ , and then go to Step 2.

We expect to obtain M-N+1 preserved winner models with different N:M sparsity for subsequent deployment. Additionally, in the forward pass, we leverage the popular group-wise magnitude pruning [7], [9]. Parameter matrices are partitioned into multiple groups, every one of which contains M consecutive column-wise elements, as shown in Table I. And we keep the N-largest parameters in these groups and generate corresponding masks  $\mathcal{B} \in \{0,1\}^d$ . Specifically, if the *i*-th parameter of  $\mathcal{W}$  survived in the pruned sub-network, we set  $\mathcal{B}_i = 1$ , or else  $\mathcal{B}_i = 0$ . In the backward pass, recent studies [7], [34] demonstrate that the dynamic sparse training can benefit both model convergence and accuracy, and we follow their methods to calculate gradients.

|  | Alg | orithm | 1 | Inherited | Dvnamic | Pruning |
|--|-----|--------|---|-----------|---------|---------|
|--|-----|--------|---|-----------|---------|---------|

**Input:** Pre-trained dense weights W, datasets D, initial learning rate  $\gamma_0$  and the end of iterations n.

1: for N = M - 1, M - 2, ..., n do

Forward Pass:

$$S(\mathcal{W}, N, M) \leftarrow$$
 the winner of  $S(\mathcal{W}, N+1, M)$ 

3: for each training iteration t do

generate  $\mathcal{B}_t$  by group-wise magnitude pruning.

- 5: **Backward Pass**:  $W_{t+1} = W_t - \gamma_t g(W_t \odot \mathcal{B}_t) + \lambda((1 - \mathcal{B}_t) \odot W_t).$ 6: **end for**
- 7: end for

4:

**Output:** A series of *N*:*M* sparse models with different computation complexity and corresponding masks: the winners of S(W, N = M - 1, M), S(W, N = M - 2, M),..., S(W, N = n, M).



Fig. 4. Compact storage scheme example for N:M sparse parameter matrix.

## C. Packing N:M Sparse Parameters

An N:M sparse Transformer can be obtained after IDP, where each group of all parameter matrices only contains at most N non-zero elements. However, it occupies a large

amount of memory since the parameter store scheme is the same as the dense Transformer. We apply the bitmapbased compression scheme to obtain a compact N:M sparse Transformer, which greatly achieves saving on storage for deployment. Compared to COO, CSC, CSR [35], and step indexing [36], our scheme has better compression performance in the range of practical N:M sparsity. Fig. 4 presents the compression scheme of N:M sparse parameter matrix using 2:4 sparsity as an example. For a parameter matrix  $W \in$  $\mathbb{R}^{R \times C}$ , after IDP, there are at most 2 non-zero elements in a group. The entire parameter matrix has  $\frac{R}{2} \times C$  remaining non-zero elements. In our scheme, we merely preserve nonzero elements in each group, and use a binary mask to indicate the elements' position. By using our scheme, a 2:4 parameter matrix  $W \in \mathbb{R}^{R \times C}$  can be stored with  $\frac{R}{2} \times C$  valid elements and  $R \times C$  bits for the mask, instead of the  $R \times C$  elements.

Considering a dense parameter matrix  $W \in \mathbb{R}^{R \times C}$ , in which all elements are quantized using q bits. W can be compressed to  $\tilde{W} \in \mathbb{R}^{R \times \lceil \frac{C}{M} \rceil N}$ , where there are  $R \lceil \frac{C}{M} \rceil$  groups and each group has N non-zero parameters at most. The storage requirement of W is qRC bits, and after pruning, we can only store the N:M sparse matrix  $\tilde{W}$  in a compact way with only  $qR \lceil \frac{C}{M} \rceil N$  bits, and an additional binary mask with RC bits. Therefore, the compression ratio (CR) can be represented as:

$$CR = \frac{qC}{q\lceil \frac{C}{M}\rceil N + C}.$$
(3)

#### V. HARDWARE OPTIMIZATION

The flexible and efficient hardware architecture, namely STA, is developed for N:M sparse Transformers in this section. We first present the overall architecture of STA, and then elaborate on designs of its core computing engines, including DMME and scalable softmax module.



Fig. 5. The overall architecture of STA. It is composed of computing, storage, and control function blocks. These red arrows pass control signals, while those black arrows transfer data.

## A. Overall Architecture

The overall architecture of STA is shown in Fig. 5, which consists of three major function blocks, including computing, storage, and control. The computing blocks consist of a diverse MatMul computing engine, namely DMME, a scalable

softmax module, a vector unit, and a data reshuffle network. Dominated operations of N:M sparse Transformers, i.e. sparsedense or dense-dense MatMuls, are performed by DMME onthe-fly with the dynamic configuration under high computational efficiency. The scalable softmax module is responsible for the softmax operation in MHA ResBlocks, eliminating the off-chip transfer for intermediate data. The vector unit takes charge of operations with low computational density including bias addition, residual addition, and activation functions. The reshuffle network reorders the temporary results before writing back to the intermediate on-chip memory. As for on-chip storage, it can be partitioned into three parts, including the weight memory, the input memory, and the intermediate memory. The weight and input memory store model parameters and input data of Transformers from the off-chip memory, respectively. The results of a ResBlock are also written back to the input memory, and pass to the off-chip memory. And all the temporary results in a ResBlock will be stored in the intermediate memory with no communication to the external memory.



Fig. 6. The hierarchical architecture of DMME. It consists of H parallel unified MatMul computing engine (a). Each engine contains  $R \times C$  unified systolic PE capable of handling both sparse-dense and dense-dense dot products (b). The key components of PEs, non-zero element selector and the *N*-parallel MAC, are in (c) and (d), respectively.

#### B. DMME

DMME unifies both sparse-dense and dense-dense MatMuls with a high computational efficiency in *N*:*M* sparse Transformers. When it performs sparse-dense MatMuls, it merely loads nonzero weight parameters and selects corresponding activations to compute, thereby improving computational efficiency.

The architecture of the DMME is illustrated in Fig. 6. It is a two-level hierarchy design with a full exploration of parallelism inside the MatMuls of N:M sparse Transformers. The exploited parallelism consists of head parallelism, row parallelism, and column parallelism, which are denoted as H, R, and C, respectively. The DMME is composed of H

parallel  $R \times C$  unified MatMul computing engine (Fig. 6 (a)), every one of which can efficiently realize both sparse-dense and dense-dense MatMuls in a time-division multiplexing manner. The capability of performing sparse-dense and densedense MatMuls comes from the inner unified systolic PE (Fig. 6 (b)). It is composed of a non-zero element selector, a *N*-parallel MAC, multiple multiplexers, and registers. The nonzero element selector, only being activated in the sparse-dense MatMul mode, is to select the proper activation according to the input bitmask. The N-parallel MAC accepts N 16-bit input data and parameters, realizes inner product, and then accumulates the result with the local 32-bit output partial sum. The multiplexers and registers are used for datapath selection and temporary data storage, respectively. The design of nonzero element selector is presented in Fig. 6 (c). It takes as input an *M*-bit mask, in which only *N* bits are set as 1 to indicate the position of non-zero elements, and then generates N one-hot encoding masks to select the corresponding N data for dot product computation. The translation to N one-hot encoding masks is performed by cascading the simple bitarithmetic blocks and XOR gates. With the help of N one-hot encoding masks, data related to non-zero parameters are fed into the N-parallel MAC along with these non-zero parameters in one group. It could be pointed out that the non-zero element selector can be further optimized by pruning the redundant indexing indicators and element candidates. The N-parallel MAC, as shown in Fig. 6 (d), is composed of N parallel multipliers, an adder tree, and a final accumulator. Both nonzero element selector and N-parallel MAC are fully pipelined to maximize the throughput of DMME.



Fig. 7. The activated datapath of PEs under (a) dense-dense and (b) sparse-dense modes.

The activated datapath under dense-dense and sparse-dense MatMuls are presented in Fig. 7 (a) and (b), respectively. The dense-dense mode of PEs would be only activated when performing the self-attention operation of MHA. Both input operands are arranged in dense sequences in this mode. In this case, N elements, as an operand, in the input sequences are in parallel streamed into the PE from the west and the north, respectively. In a cycle, the PE performs a dot product with a size of N under dense-dense mode. The partial sum is stored in the local registers, and the input operands from the west and the north stored in the registers are passed into adjacent PEs on the east and south, respectively. For energy saving, the non-zero element selector is bypassed to avoid signal switching. Under sparse-dense MatMul mode, as shown in Fig. 7 (b), the input operands is different from that under the dense-dense

mode. In a cycle, N non-zero parameters in a group along with the corresponding M-bit mask are streamed into the PE from the west, while M data in one group are fed from the north. The N valid data in pair with the input parameters is picked up by the non-zero element selector, and then performs a dot product with these input parameters. When the computing task is done, under either dense-dense or sparse-dense modes, the PE turns into the shifting mode, accepts the results from its western PE to its local registers and transfers its local result registers to the east.



Fig. 8. Computing dataflows of DMME when it performs (a) dense-dense and (c) sparse-dense MatMuls. Compared to (b) as a baseline, (c) eliminates all zero-valued redundant operations under sparse-dense mode, thus improving computational efficiency.

#### C. Supporting Efficient Matrix Computations

STA is capable of supporting both sparse-dense and densedense MatMuls of *N:M* sparse Transformers in an efficient way. We demonstrate this significant capability of STA by exploiting four aspects: the computing dataflow of DMME, data access pattern of DMME, data mapping of input memory, and datapath from input memory to DMME.

Fig. 8 illustrates efficient computing dataflows of DMME under both dense-dense and sparse-dense modes. For simplicity, we assume N:M is 1:2, and MatMul is performed by  $2 \times 4$ input sequences and  $4 \times 2$  parameter sequences. Here, we consider the computing engine in [10] as a baseline, which is orchestrated as a classic systolic array. In Fig. 8 (a), DMME finishes the dense-dense MatMul in the given computing task using 5 cycles, which consumes the same cycles as the baseline. Hence, DMME achieves the same computational efficiency as the baseline when performing dense-dense MatMuls. As for sparse-dense MatMuls, Fig. 8 (b) and (c) present the computational manner of the baseline and DMME, respectively. The baseline takes 5 cycles to finish the task, while it cost merely 3 cycles by DMME since the redundant operations can be skipped with no waste on computing cycles. For sparse-dense MatMuls, DMME improves the computational efficiency by eliminating redundant computations, thereby significantly reducing latency and energy.

Fig. 9 (a) and (b) presents data access patterns of DMME when it performs dense-dense and sparse-dense MatMuls, respectively. For dense-dense MatMuls, attention heads as input are both separated into H tiles. In this case, DMME can be decomposed as H independent systolic arrays, every one of which fetches elements from the corresponding tiles to



Fig. 9. Data access pattern of DMME to support efficient MatMuls under (a) dense-dense and (b) sparse-dense modes.

the top-most and left-most PEs, respectively. For sparse-dense MatMuls, compressed weight parameters are divided into H tiles. Every cycle DMME fetches NR weight elements from all H tiles in parallel, and casts them one-on-one to the left-most systolic PEs in H systolic arrays. DMME is also required to access MC activation elements, and broadcasts them to the top-most PEs in all H unified systolic arrays.

To balance the bandwidth of input memory when switching between dense-dense and sparse-dense modes, we make NH equal to M of STA. There are C banks of STA for input data storage. Fig. 10 illustrates data mapping of input memory and datapath from input memory to DMME by assuming N is 2, M is 4, H is 2, and C is 4. The data storage storage structure in the input memory is varied for different computing modes. In dense-dense mode, DMME performs 2 parallel dense-dense MatMuls for attention mechanism of Transformers. There are 2 tiles for the loaded input data. As shown in Fig. 10 (a), the first bank is connected to the first column of DMME, and the first address of the bank indexes the data from the first 2 elements at the first column in the tile one and two, respectively. In sparse-dense mode, DMME performs MatMuls with the N:M sparse parameters. As depicted in Fig. 10 (b), we do not tile input data for sparse-dense MatMul. The first address of bank one that connected to the first column of DMME, indexes the data from the first 4 elements at the first column of input data. It is the same of the indexing principle for the other banks in both computing modes.



Fig. 10. Data mapping of input memory and datapath from input memory to DMME under (a) dense-dense and (b) sparse-dense modes.

As for the datapath from input memory to DMME, we takes the first column of DMME as an example. There are 4 elements accessed from the first bank streaming to the first column with a head dimension of 2. In dense-dense mode, as shown in Fig. 10 (a), the unified systolic PE in the first head of the first column directly receives these 4 elements, and the lowest 2 elements are fed into *N*-parallel MAC for computing. However, the unified systolic PE in the second head of the first column requires 2-to-1 MUXs to select the correct 2 elements from the 4 accessed elements for computing. In sparse-dense mode, as presented in Fig. 10 (b), data accessed from the first bank broadcast to unified systolic PEs in all head dimensions of the first column of DMME.

## D. Scalable Softmax Module

The softmax function takes as input a vector  $\mathbf{x}$  of n real numbers, and normalizes it into a probability distribution consisting of n probabilities proportional to the exponentials of the input numbers. It is critical for Transformer dedicated accelerators to contain a softmax hardware implementation since the softmax function appears in every MHA module of Transformers. Fig. 11 presents the details of our proposed scalable softmax architecture, which is capable of performing softmax functions of arbitrary length. It keeps all the intermediate results fully local, avoiding off-chip data communication.

The architecture has two adjustable parameters, P and Q, where P denotes the parallelism of the architecture, and Qrepresents the pipeline depth, as well as the output precision. P input data are streamed into the softmax module in parallel, and transformed into the exponent outputs. The exponent outputs are not only temporarily stored in the data buffer, but also used as input for further accumulation. Once the accumulation process is done, the divider module takes both accumulated results and exponent outputs as input to perform Q-level pipelined division, and generates P softmax function outputs represented by Q-bit.



Fig. 11. The architecture of the scalable softmax operator.

As shown in Fig. 11, the scalable softmax module consists of three major parts: an area-efficient exponential function, a partial sum accumulator, and a scalable divider. The exponential function is approximated using a lookup table combined with a first-order Taylor expansion. An exponential operator can be implemented using only one multiplier and one adder. The configurable partial sum accumulator can adapt to input vectors of various lengths, which improves the flexibility of the hardware. To reduce the latency of the division, we design a highly parallel divider by cascading multiple divider blocks with pipelines, where a divider block is composed of subtractors and shifters with little cost on hardware.

#### VI. EXPERIMENTAL RESULTS

In this section, we comprehensively evaluate both algorithm and hardware optimizations of the proposed framework. Three benchmark sets with varying size and complexity are applied to evaluate the proposed framework.

#### A. Experimental Setup

#### 1) Benchmark Sets:

The first set focuses on the evaluation of algorithm optimizations, by comprehensively presenting improvements on both model accuracy and compression ratio under various *N:M* configurations. This benchmark set comprises a BERT model [3], the well-known Transformer-based model, and four evaluation datasets from the GLUE benchmark [37], including WNLI, QNLI, QQP, and MRPC. WNLI is a reading comprehension task. QNLI is a question-answering dataset consisting of question-paragraph pairs. QQP is a collection of question pairs from the community question-answering website Quora. MPRC is a corpus of sentence pairs automatically extracted from online news sources. For these four tasks, we report accuracy of the validation sets. We also report the compression ratio on BERT by setting various *N:M* configurations and quantized bitwidth of parameters.

The second set dives into hardware resource consumption. Firstly, we study consumption of DMMEs, the core computing engine in the STA, at any common scale of *N:M* sparsity, and then we explore hardware utilization of representative STAs under various FPGA devices. For the former evaluation, DMMEs are not allowed to be synthesized using DSP blocks, which can be done by setting the property *MAX\_DSP* as zero. Hence, the consumption of LUTs and FFs can measure the cost of combinational logic and sequential logic for DMMEs, respectively. The utilization of LUTs and FFs are reported at the synthesis stage as the metric of hardware resource requirements. For the latter evaluation, the resource utilization of STAs on various FPGA devices are reported at the implementation stage, including the consuming amount of LUTs, FFs, BRAMs, and DSPs.

The third set studies performance improvements of overall STA hardware system on multiple FPGA platforms when deploying various Transformer-based models. All key configurations of evaluated models in benchmark sets are presented in Table II. At first, we evaluate the processing time with a single batch on all MHA and FFN ResBlocks in varying models from TinyBERT [38], Dino [39], and the Transformer-base model [11]. The selected models target different applications. TinyBERT is a lightweight BERT model for many language tasks. Dino, a tiny vision Transformer, can be the backbone for a lot of computer vision tasks. Transformer-base model is the classic one for the neural machine translation (NMT) task. Considering NMT is one of the sequence-to-sequence tasks, hence we split the Transformer-base model into two parts, the stacked encoders and decoders, respectively. We finally make a fair comparison of the implemented STAs with previous works and commercial products using a shallow Transformer, which is the commonly used benchmark model in [29], [33]. Latency, throughput, power, energy efficiency and MAC efficiency are key metrics for applications, and thus used for performance evaluation.

TABLE II Key configurations of Transformer-based models in benchmark sets

| Benchmark | Model                                | Num. of<br>Encoders | Num. of<br>Decoders | Sequence<br>length | Attention<br>heads | Hidden<br>size | Intermediate<br>size |
|-----------|--------------------------------------|---------------------|---------------------|--------------------|--------------------|----------------|----------------------|
| Set I     | BERT                                 | 12                  | 0                   | 128                | 12                 | 768            | 3072                 |
| Set III   | TinyBERT4                            | 4                   | 0                   | 128                | 12                 | 312            | 1200                 |
|           | Dino-vits8                           | 12                  | 0                   | 64                 | 6                  | 384            | 1536                 |
|           | Transformer-base<br>stacked encoders | 6                   | 0                   | 64                 | 8                  | 512            | 2048                 |
|           | Transformer-base<br>stacked decoders | 0                   | 6                   | 64                 | 8                  | 512            | 2048                 |
|           | Shallow<br>Transformer               | 2                   | 1                   | 64                 | 4                  | 200            | 800                  |

#### 2) Implementation Details:

For algorithm implementation (Set I), the pre-trained models, the scripts and datasets are provided by the HuggingFace repository [40]. All models are implemented and executed using PyTorch v1.5.

As for hardware implementation (Set II & III), all modules of STA are designed in synthesizable SystemVerilog with the aid of hardware components from the BaseJump standard template library [41] and the PULP platform [42]. Xilinx Vivado 2018.2 is the tool for synthesis and implementation. We implement STA on three types of FPGA devices with various scales, including Xilinx ZYNQ Z7020 (XC7Z020), Xilinx Virtex-7 FPGA (XC7VX485T), and Xilinx UltraScale+ FPGA (XCVU13P). Specifically, XC7Z020 is low-cost and low-resource System-on-Chip device equipped with a dualcore ARM Cortex-A9 processor and FPGA, which is fabricated in the 28 nm technology node. XC7VX485T is a relatively large FPGA device fabricated in the 28 nm technology node. XCVU13P, fabricated in the 16 nm technology node, is an extremely expensive and advanced FPGA device with abundant hardware resource.

## B. Benchmark Set I: Algorithm Optimizations

For benchmark set I, ASP [9] and SR-STE [7], the two existing methods for acquiring *N:M* sparse models, are selected as our baselines. The reported accuracy of baselines is obtained by training with released open-source code. For a fair comparison, the generated *N:M* sparse models using ASP, SR-STE, and our method are achieved with identical finetune epochs. For all tasks, we use a batch size of 32 and a initial learning rate of 2e-5. For WNLI, QNLI, QQP, there are 3 epochs to recover accuracy for every step of N, while there are 5 epochs for MRPC.

Compared with existing methods, Fig. 12 shows that IDP can achieve comparable or better accuracy under 75.00% sparse ratio. In addition, the IDP can outperform the ASP and SR-STE method significantly with the sparse ratio increases. For instance, we can find that under 87.50% (2:16) sparse ratio, IDP consistently obtains large performance improvements to the baseline on all tasks (5.36% accuracy gain on MNLI, 10.38% accuracy gain on QNLI, 2.98% accuracy gain on QQP, and 8.08% accuracy gain on MRPC). Therefore, we can



Fig. 12. Pruning results on various tasks incluing (a) MNLI, (b) QNLI, (c) QQP, and (d) MRPC in comparison with ASP [9] and SR-STE [7].

obtain the state-of-the-art *N:M* sparse models for FPGA-based platform deployment with the plug-and-play IDP algorithm.

Based on our evaluations of model accuracy with respect to parameter sparsity, as shown in Fig. 12, it is observed that Transformers can hardly achieve a sparsity over 90% without impacting accuracy. It would be more likely practical for N:M sparse Transformers with a sparsity ranging from 50% to 87.5%. Next, BERT is taken as an example to evaluate the storage reduction of our compression scheme when using multiple quantized bits under various N:M sparsity configurations. We make a elaborated comparison between our bitmap-based scheme, COO, CSR, CSC, and step indexing [36]. Compared with the other mainstream methods, as shown in Fig. 13, our scheme can achieve the highest compression ratio when the model sparsity is varied from 50% to 87.5%. The compression ratio keeps increasing as the model sparsity increases. An N:M sparse BERT, can achieve a higher compression ratio when quantized in larger bit widths. BERT with 50.00% N:M sparsity can reach a  $1.78 \times$  reduction on storage of parameters. When BERT has a sparsity of 87.50%, it achieves a significant storage saving, up to  $5.33 \times$ , on parameters. Our compression scheme can efficiently reduce the storage requirement for N:M sparse parameters. In subsequent hardware evaluations, we uniformly adopt a 16-bit fixed-point representation for Transformers to avoid negative impacts on model accuracy due to quantization.



Fig. 13. Compression ratio on BERT with various sparsity configurations.

## C. Benchmark Set II: Hardware Resource Consumption

The second benchmark set verifies hardware consumption of STAs that have not yet deployed Transformer models.

At first, we evaluate the hardware requirements of DMME in common sparse configurations by comparing it against multiple dense computing engines. For a fair comparison, evaluated computing engines are all synthesized as a  $2 \times 2$  unified MatMul computing engine , and only PEs in these engines are configured into various *N:M* configurations. Note that those computing engines that make *N* equal to *M*, exclude the nonzero element selectors and merely support dense computing. It can be regarded as the computing engine in [10] if both *N* and *M* are *1* in the evaluated DMME. These computing engines are not allowed to be synthesized using DSP blocks, which can be done by setting the property *MAX\_DSP* as zero. Hence, the utilization of LUTs and FFs from Vivado synthesis reports can be used to measure the consumption of combinational logic and sequential logic for DMMEs, respectively.

Fig. 14 presents the comparison of required hardware resource consumption between DMMEs of various configurations. For simplicity, all results are normalized to the 4:4 dense baseline computing engine. Gray bars are resource consumption of various computing engines with sole support on dense matrix multiplication. Green, red, and yellow bars represent hardware utilization of DMMEs, when N is set as 1, 2, and 3, respectively. In Fig. 14, we can observe hardware resource saved by DMMEs compared to dense baseline computing engine under sparse matrix computing mode. When M = 16and N is set to 1, 2, and 3, respectively, DMME, in contrast to 16:16 baseline, obtains saving of combinational logic up to  $7.96 \times$ ,  $4.76 \times$ , and  $3.17 \times$ , while achieving reduction on sequential logic  $4.41\times$ ,  $2.63\times$ , and  $1.99\times$ . According to Fig. 14, we further evaluate the impact of separately increasing N and M in DMMEs on hardware resource consumption. For instance, 3:4 DMME costs  $2.42 \times$  combinational logic and  $2.78 \times$  sequential logic of 1:4 DMME. However, 1:16 DMME merely requires  $1.28 \times$  combinational logic and  $2.00 \times$ sequential logic over 1:4 DMME.

We finally evaluate hardware resource consumption of STA on three types of FPGA platforms, including XC7Z020, XC7VX485T, and XCVU13P. These FPGA platforms are used to represent diverse devices in Fig. 2. Considering the hardware resource and cost on these platforms, we intend to deploy



Fig. 14. The normalized resource consumption of unified computing engine over the classic computing engine on various scales including (a) combinational logic and (b) sequential logic.

XC7Z020 to the edge and XC7VX485T and XCVU13P on the clouds. There are many tunable parameters of STAs, especially in DMME, which have great impacts on performance. In order to determine the specific parameters of STA on each FPGA platforms, we design a cycle-accurate simulator to evaluate actual inference performance based on the given specifications. STA in XC7Z020 adopts an aggressive 1:8 sparsity since latency is the critical metric on the edge platforms. However, STAs in both XC7VX485T and XCVU13P configured N:M as 2:8 because devices deployed on the clouds concerns latency as well as model accuracy. Table III shows resource consumption of STAs deployed on three scales of FPGA platforms, namely STA-Tiny, STA-Small, and STA-Large, respectively. The FPGA resource and power breakdown of STA-Small are presented in Table IV. The N-parallel MAC (N-MAC) module dominates the DSP consumption of STA since it is the core of computing engine for MAC operations. The non-zero element selector (NZES) module takes the majority of LUT consumption due to the cost for decoding and index selection. The routing module occupies most registers of DMME for datapath selection and temporary data storage. Moreover, the proposed DMME and softmax module occupy 47.29% and 9.42% power consumption, respectively.

TABLE III FPGA RESOURCE UTILIZATION

| Platform      | Frequency | LUT      | FF       | BRAM     | DSP      |
|---------------|-----------|----------|----------|----------|----------|
| 1:8 STA-Tiny  | 150MHz    | 21K      | 75K      | 96       | 132      |
| (XC7Z020)     |           | (40.38%) | (71.21%) | (68.57%) | (60.00%) |
| 2:8 STA-Small | 200MHz    | 116K     | 337K     | 532      | 1,040    |
| (XC7VX485T)   |           | (38.42%) | (55.52%) | (51.65%) | (37.14%) |
| 2:8 STA-Large | 200MHz    | 464K     | 1,321K   | 1,192    | 4,160    |
| (XCVU13P)     |           | (26.88%) | (38.24%) | (44.35%) | (33.85%) |

## D. Benchmark Set III: Overall System Evaluation

The third benchmark set is used to evaluate performance when deploying various Transformer-based models on STAs.

Firstly, we study the inference speedup of STAs by contrast with CPUs, GPUs, and the prior dedicated accelerators. The selected models to be deployed are composed of TinyBERT [38], Dino [39], the classic Transformer model [11]. Here we consider single-batch processing time in all the MHA and FFN ResBlocks of these Transformer-based models. For cross-platform comparison, the hardware setup is as follows to execute the Transformer inference tasks. The CPU results are

TABLE IV FPGA RESOURCE AND POWER BREAKDOWN OF STA-SMALL

|         |       | LUT               | FF                | BRAM             | DSP               | Power (W)         |
|---------|-------|-------------------|-------------------|------------------|-------------------|-------------------|
|         | N-MAC | 16K<br>(13.79%)   | 68K<br>(20.18%)   | -                | 1024<br>(98.46%)  | 2.78<br>(28.17%)  |
| DMME    | NZES  | 66K<br>(56.90%)   | 68K<br>(20.18%)   | -                | -                 | 0.83<br>(8.41%)   |
| Routing |       | 9K<br>(7.76%)     | 180K<br>(53.41%)  | -                | -                 | 1.06<br>(10.74%)  |
| Soft    | imax  | 13K<br>(11.21%)   | 8K<br>(2.37%)     | 16<br>(3.00%)    | 16<br>(1.54%)     | 0.93<br>(9.42%)   |
| Otl     | ners  | 12K<br>(10.34%)   | 13K<br>(3.86%)    | 516<br>(97.00%)  | -                 | 4.27<br>(43.26%)  |
| Total   |       | 116K<br>(100.00%) | 337K<br>(100.00%) | 532<br>(100.00%) | 1040<br>(100.00%) | 9.87<br>(100.00%) |

measured using an ARM Cortex A57 and an Intel i9-9900X. The former commonly appeared in mobile devices for edge applications, while the latter is a high-end CPU product for deploying cloud applications. The GPU results are measured using an NVIDIA Jeston Nano, an embedded GPU product for edge applications, an NVIDIA RTX 2080Ti, and an NVIDIA RTX 3090 capable of 2:4 sparse acceleration. Following the comparison method in [32], we apply [10] as baselines on FPGA platforms by scaling the size of its computing engine. Two existing sparse accelerators for Transformers, OPTIMUS [5] and EdgeBERT [6], are evaluated as well for a more comprehensive comparison of STA. Fig. 15 shows the idealized performance speedup of different hardware platforms, where edge and cloud platforms normalized to the ARM Cortex A57 and Intel i9-9900X, respectively.



Fig. 15. The processing time of Transformer-based models on various (a) edge platforms and (b) cloud platforms.

As shown in Fig. 15 (a), among all edge platforms, STA-

Tiny achieves a geometric mean increase of  $32.96 \times$ ,  $2.69 \times$ , and  $5.38 \times$  over CPU, GPU, and the FPGA baseline, respectively. Fig. 15 (b) presents the performance comparison between various cloud platforms. For fair comparison, Baseline-Small (red), EdgeBERT (purple), OPTIMUS (brown), and 2:8 STA-Small (pink) in Fig. 15 (b) are evaluated under the same number of MAC units and clock frequency.

- STA v.s. Baseline-Small [10]: 2:8 STA-Small achieves 2.89× speedup on average over Baseline-Small, which enables dense Transformer acceleration using a large 2D systolic array. It suffers low utilization of MAC units due to inflexible mapping scheme and large skew latency for systolic-arranged data. STA achieves significant performance improvement by: 1) innovation of DMME from the architectural aspects, and 2) reduction on MAC operations of *N:M* sparse Transformers from the algorithmic aspects. We further present a performance breakdown of STA compared to Baseline-Small. As shown in Fig. 16, the architectural innovation for DMME can achieve 1.08× better performance improvement, and there is 2.68× speedup on top of the architectural innovation by efficiently enabling 2:8 sparse acceleration.
- STA v.s. EdgeBERT [6]: 2:8 STA-Small achieves 2.33× speedup on average over EdgeBERT-32, which is an energy-optimized Transformer accelerator exploiting unstructured sparsity. When performing sparse MAC operations, processing units of EdgeBERT skip the zero input value through a gating strategy, which can significantly reduce energy consumption, but offer little benefit to latency. Compared to EdgeBERT, STA performs *N:M* sparse MAC operations by choosing non-zero inputs, which can both reduce energy and optimize latency.
- STA v.s. OPTIMUS [5]: 2:8 STA-Small has a 1.20× better performance on average over OPTIMUS, which is a high performance sparse accelerator for Transformers exploiting unstructured sparsity in weight parameters. When performing sparse MAC operations, OPTIMUS can hardly achieve high MAC utilization due to load imbalance and input load miss. STA can effectively overcome these two problem suffered by OPTIMUS. STA gets rid of load imbalance by arranging each MAC in DMME to perform operations in a balanced *N:M* group. In addition, STA loads a series of input *N:M* groups at each cycle, and utilizes these elements multiple times in a systolic manner, which effectively addresses input load miss compared to OPTIMUS.

Finally, we compare STA with other previous FPGA-based works and commercial CPU and GPU products. Table V presents a fair performance comparison without batching on various platforms. Prior FPGA-based works for accelerating Transformers, include [10], [29], and [33]. The dedicated accelerator in [10], equipped with a large 2D systolic array for dense operation, is the pioneer work for Transformer accelerator for Transformers, which exploits the speedup potential of block-circulant weight representations. In [33], the proposed accelerator utilizes coarse-grained block-based sparsity to



Fig. 16. Performance and MAC operation breakdown of STA-Small.

speedup Transformer inference. The shallow Transformer used in [29] and [33] is applied as a benchmark for a fair evaluation. The comparison is benchmarked on CPU, GPUs, prior cuttingedge FPGA solutions, and STA on various FPGA platforms. We evaluate these designs in terms of latency, throughput, power, energy efficiency, and MAC efficiency. All of them are key metrics for a computing system.

As shown in Table V, STA-Tiny far outperforms the embedded GPU, Jetson Nano, and the high-end CPU, i9-9900X, in all evaluated metrics. STA-Small surpasses the CPU and GPU platforms in all metrics. Moreover, STA-Small is close to [10] and [33] in terms of latency and throughput, while using a relatively small number of MACs compared to them. The energy efficiency and MAC efficiency of STA-Small are also superior to all previous FPGA-based works. Compared to previous FPGA solutions, STA-Large achieves  $2.00 \sim 19.47 \times$ throughput improvement, achieves  $1.26 \sim 16.47 \times$  energy efficiency improvement, and  $1.80 \sim 36.00 \times$  MAC efficiency gain, respectively. The performance gain of STA comes from optimizations from two levels. At the algorithm level, we carefully exploit the potential of N:M sparsity pattern, which can significantly reduce the computational cost of Transformerbased models. At the hardware level, STA can efficiently handle N:M sparse parameters, which significantly improves the utilization of computing units. In addition, our deployment framework, taking STA-Tiny, STA-Small, and STA-Large as examples, can realize flexible hardware generation for Transformers. The proposed framework can flexibly and efficiently meet the requirements for deploying Transformerbased models on various FPGA devices.

### VII. CONCLUSION

In this paper, we present a flexible, agile, and efficient framework for deploying *N:M* sparse Transformers, which is benefited from both algorithm and hardware optimizations, making it practical to significantly accelerate Transformerbased models on diverse FPGA devices. At the algorithm level, we propose a sparsity inheritance mechanism and a inherited dynamic pruning (IDP) method to obtain a series of *N:M* sparse Transformers with high accuracy. A further proposed compression scheme greatly reduces the storage requirements of models. At the hardware level, we present a flexible and efficient architecture, namely STA, to accelerate *N:M* sparse Transformers. STA is composed of a computing core, DMME,

|                             | CPU      |             | GPU         |                            |                         |                                    | FPGA                |          |                 |           |
|-----------------------------|----------|-------------|-------------|----------------------------|-------------------------|------------------------------------|---------------------|----------|-----------------|-----------|
| Platform                    | i9-9900X | Jetson Nano | RTX 2080 Ti | RTX 3090                   | SOCC'20                 | ISLPED'20                          | ISQED'21            | Our work |                 |           |
|                             |          |             |             |                            | [10]                    | [29]                               | [33]                | STA-Tiny | STA-Small       | STA-Large |
| Chip                        | Skylake  | Tegra X1    | TU102       | GA102                      | XCVU13P                 | XCVU9P                             | XCU200              | XC7Z020  | XC7VX485T       | XCVU13P   |
| Technology                  | 14 nm    | 20 nm       | 12 nm       | 8 nm                       | 16 nm                   | 16 nm                              | 16 nm               | 28 nm    | 28 nm           | 16 nm     |
| Frequency                   | 3.50 GHz | 640 MHz     | 1.35 GHz    | 1.70 GHz                   | 200 MHz                 | -                                  | -                   | 150 MHz  | 200 MHz         | 200 MHz   |
| Methods                     | -        | -           | -           | 2:4 group-based<br>pruning | Low-bit<br>quantization | Block-circulant<br>matrix with FFT | Block-based pruning | N:M      | group-based pru | ning      |
| # MAC units                 | -        | -           | -           | -                          | 4096                    | $\sim 5647$                        | $\sim 3368$         | 128      | 1024            | 4096      |
| Bit Precision               | FP-32    | FP-32       | FP-32       | FP-32                      | FIX-8                   | FIX-16                             | -                   |          | FIX-16          |           |
| Test Network                |          |             |             |                            | Shallow Tra             | ansformer                          |                     |          |                 |           |
| Latency (ms)                | 2.17     | 16.24       | 1.70        | 0.46                       | 0.30                    | 2.94                               | 0.32                | 2.01     | 0.42            | 0.15      |
| Batch-1 Throughput (GOP/s)  | 101.38   | 13.55       | 129.41      | 478.26                     | 733.33                  | 75.34                              | 687.50              | 109.45   | 523.81          | 1466.67   |
| Power (W)                   | 165.00   | 7.56        | 250.00      | 350.00                     | 16.70                   | 22.45                              | -                   | 2.71     | 9.87            | 26.59     |
| Energy Efficiency (GOP/J)   | 0.61     | 1.79        | 0.52        | 1.37                       | 43.91                   | 3.35                               | -                   | 40.39    | 53.07           | 55.16     |
| MAC Efficiency (GOP/s/unit) | -        | -           | -           | -                          | 0.18                    | $\sim 0.01$                        | $\sim 0.20$         | 0.86     | 0.51            | 0.36      |

 TABLE V

 Comparison of STAs with previous works and commercial products

that unifies both sparse and dense intensive matrix multiplications in N:M sparse Transformers, and a scalable softmax module, which eliminates intermediate off-chip data accesses. The experimental results show that N:M sparse Transformers generated by IDP achieves an average of 6.7% improvement in accuracy over the state-of-the-art methods. STA implementation significantly outperforms CPU, GPU, and prior FPGAbased Transformer accelerators in terms of latency, throughput, energy efficiency, and MAC efficiency, showing its significant potential in applications using Transformer-based models.

#### ACKNOWLEDGMENT

We would like to sincerely thank our reviewers for their valuable feedback.

#### REFERENCES

- Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, "Efficient Transformers: A Survey," arXiv preprint arXiv:2009.06732, 2020.
- [2] K. Song, K. Wang, H. Yu, Y. Zhang, Z. Huang, W. Luo, X. Duan, and M. Zhang, "Alignment-Enhanced Transformer for Constraining NMT with Pre-specified Translations," in *Proceedings of the AAAI Conference* on Artificial Intelligence (AAAI), vol. 34, no. 05, 2020, pp. 8886–8893.
- [3] J. D. M.-W. C. Kenton and L. K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT), 2019, pp. 4171–4186.
- [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in *International Conference* on Learning Representations (ICLR), 2021.
- [5] J. Park, H. Yoon, D. Ahn, J. Choi, and J.-J. Kim, "OPTIMUS: OPTImized Matrix MUltiplication Structure for Transformer Neural Network Accelerator," in *Proceedings of Machine Learning and Systems (MLSys)*, 2020.
- [6] T. Tambe, C. Hooper, L. Pentecost, T. Jia, E.-Y. Yang, M. Donato, V. Sanh, P. N. Whatmough, A. M. Rush, D. Brooks, and G.-Y. Wei, "EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference," in *Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)*, 2021.
- [7] A. Zhou, Y. Ma, J. Zhu, J. Liu, Z. Zhang, K. Yuan, W. Sun, and H. Li, "Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch," in *International Conference on Learning Representations* (*ICLR*), 2021.

- [8] W. Sun, A. Zhou, S. Stuijk, R. Wijnhoven, A. O. Nelson, H. Corporaal et al., "DominoSearch: Find Layer-wise Fine-grained N: M Sparse Schemes from Dense Neural Networks," Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021.
- [9] A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, "Accelerating Sparse Deep Neural Networks," arXiv preprint arXiv:2104.08378, 2021.
- [10] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, "Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer," in 2020 IEEE 33rd International System-on-Chip Conference (SOCC), 2020.
- [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," in *Proceedings* of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 2017.
- [12] D. Wu, X. Fan, W. Cao, and L. Wang, "SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 29, no. 5, pp. 936–949, 2021.
- [13] S. Colleman and M. Verhelst, "High-Utilization, High-Flexibility Depth-First CNN Coprocessor for Image Pixel Processing on FPGA," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 29, no. 3, pp. 461–471, 2021.
- [14] H. E. Yantır, A. M. Eltawil, and K. N. Salama, "IMCA: An Efficient In-Memory Convolution Accelerator," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 29, no. 3, pp. 447–460, 2021.
- [15] S. Yin, Z. Jiang, M. Kim, T. Gupta, M. Seok, and J.-S. Seo, "Vesti: Energy-Efficient In-Memory Computing Accelerator for Deep Neural Networks," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 28, no. 1, pp. 48–61, 2019.
- [16] G. Paulin, R. Andri, F. Conti, and L. Benini, "RNN-Based Radio Resource Management on Multicore RISC-V Accelerator Architectures," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 29, no. 9, pp. 1624–1637, 2021.
- [17] C. Fang, L. He, H. Wang, J. Wei, and Z. Wang, "Accelerating 3D Convolutional Neural Networks Using 3D Fast Fourier Transform," in 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5.
- [18] Y. Yu, T. Zhao, M. Wang, K. Wang, and L. He, "Uni-OPU: An FPGAbased Uniform Accelerator for Convolutional and Transposed Convolutional Networks," *IEEE transactions on very large scale integration* (VLSI) systems, vol. 28, no. 7, pp. 1545–1556, 2020.
- [19] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, "An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 28, no. 9, pp. 1953–1965, 2020.
- [20] A. A. Moreno, J. Olivito, J. Resano, and H. Mecha, "Analysis of a Pipelined Architecture for Sparse DNNs on Embedded Systems," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 28, no. 9, pp. 1993–2003, 2020.
- [21] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, "High-Performance FPGA-based CNN Accelerator with Block-Floating-Point Arithmetic,"

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 8, pp. 1874–1885, 2019.

- [22] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, "High-Performance CNN Accelerator on FPGA using Unified Winograd-GEMM Architecture," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 27, no. 12, pp. 2816–2828, 2019.
- [23] X. Xie, J. Lin, Z. Wang, and J. Wei, "An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks," *IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS-I)*, vol. 68, no. 7, pp. 2936–2949, 2021.
- [24] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 1–13.
- [25] Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 367–379.
- [26] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An Instruction Set Architecture for Neural Networks," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 393–405.
- [27] Z.-G. Liu, P. N. Whatmough, and M. Mattina, "Systolic Tensor Array: An Efficient Structured-sparse GEMM Accelerator for Mobile CNN Inference," *IEEE Computer Architecture Letters*, vol. 19, no. 1, pp. 34– 37, 2020.
- [28] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks," in *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*, 2017, pp. 27–40.
- [29] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, H. Liu, and C. Ding, "FTRANS: Energy-Efficient Acceleration of Transformers using FPGA," in *Proceedings of the ACM/IEEE International Sympo*sium on Low Power Electronics and Design (ISLPED), 2020.
- [30] T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J.-H. Park, S. Lee, K. Park, J. W. Lee *et al.*, "A<sup>3</sup>: Accelerating Attention Mechanisms in Neural Networks with Approximation," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.
- [31] H. Wang, Z. Zhang, and S. Han, "SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning," in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021.
- [32] L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, and Y. Liang, "Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture," in *Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)*, 2021.
- [33] H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, S. Wang, and C. Ding, "Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning," in 2021 22nd International Symposium on Quality Electronic Design (ISQED), 2021.
- [34] T. Lin, S. U. Stich, L. Barba, D. Dmitriev, and M. Jaggi, "Dynamic model pruning with feedback," arXiv preprint arXiv:2006.07253, 2020.
- [35] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, "Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks," *Journal of Machine Learning Research (JMLR)*, vol. 22, no. 241, pp. 1–124, 2021.
- [36] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-X: An accelerator for sparse neural networks," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–12.
- [37] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding," in *Proceedings of the 2018 EMNLP Work*shop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355.
- [38] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, "TinyBERT: Distilling BERT for Natural Language Understanding," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): Findings, 2020, pp. 4163–4174.
- [39] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, "Emerging Properties in Self-Supervised Vision Transformers," in *Proceedings of the International Conference on Computer Vision (ICCV)*, 2021.
- [40] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., "Huggingface's Trans-

formers: State-of-the-art Natural Language Processing," arXiv preprint arXiv:1910.03771, 2019.

- [41] M. B. Taylor, "Basejump STL: SystemVerilog Needs a Standard Template Library for Hardware Design," in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 2018, pp. 1–6.
- [42] D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini, "PULP: A Parallel Ultra Low Power Platform for Next Generation IoT Applications," in 2015 IEEE Hot Chips 27 Symposium (HCS). IEEE, 2015, pp. 1–39.