## Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou

Department of Electronics Engineering National Chiao Tung University, Taiwan Email : hkkuo[at]ee.eda.nctu.edu.tw

**ASP-DAC 2013** 



## Outline

#### Introduction

- GPGPU Background
- Motivational Examples
- Cache Capacity Aware Thread Scheduling
- Experimental Results
- Conclusions

## Introduction – GPGPU

## General Purpose Graphic Processing Unit

- An accelerator for general computing
- Numerous computing cores (> 512 cores/chip)
- Throughput-oriented







Techniques to alleviate memory bottleneck
 Memory Coalescing
 On-chip Shared Cache

Source: Nvidia, http://http://www.nvidia.com

## Introduction – Alleviate Memory Bottleneck

#### Memory Coalescing

- Combine several narrow accesses into a single wide one
- Effective and widely used in regular applications
  - Fast Fourier Transform (FFT) and Matrix Multiplications

#### On-chip Shared Cache

- Shared among several computing cores
- Automatically exploit data reuse

#### However, in Irregular Applications

Lack of coordinated memory access (Non-Coalescing)

Numerous threads with limited cache capacity (Cache Contention)

## Introduction – Cache Contention

#### Cache Contention

- Happen when the cache capacity is insufficient for all the concurrent threads
- Example :



## **Introduction – Previous Studies**

#### Previous studies

#### Deng, et al. (ICCAD'09)

Scratch-pad memory to enhance coalescing

#### **Zhang, et al. (ASPLOS'11)**

Data and computation reordering to improve coalescing

#### Kuo, et al. (ASPDAC'12)

Thread clustering to enhance coalescing and mitigate cache contention

#### Without considering the Cache Capacity

Cannot fully resolve the Cache Contention issue

Y. Deng, et al., "Taming Irregular EDA Applications on GPUs," in *ICCAD*, 2009 E. Z. Zhang, et al., "On-the-Fly Elimination of Dynamic Irregularities for GPU Computing," in *ASPLOS*, 2011 H.-K. Kuo, et al., "Thread Affinity Mapping for Irregular Data Access on Shared Cache GPGPU," in *ASPDAC*, 2012 Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

## Introduction – Contributions

#### This paper

Formulate a general thread scheduling problem on GPGPUs

**Cache Capacity Aware Thread Scheduling Problem** 

Carry out a comprehensive analysis on the variants of the problem

Nvidia's Fermi architecture is modeled as a special variant

Propose thread scheduling algorithms for different variants
 An average of 44.7% cache misses reduction
 An average of 28.5% runtime enhancement

## GPGPU Background – Programming Model

Nvidia's CUDA Programming Model

- Cooperative Thread Array (CTA)
  - A collection of threads

Kernel

A collection of CTAs



Source: Nvidia, http://http://www.nvidia.com

## GPGPU Background – GPGPU Architecture

- Nvidia's Fermi GPGPU Architecture
  - Streaming Multiprocessor (SM)
  - Unified L2 Cache
  - GigaThread Scheduler
    Fixed number of
    - concurrent CTAs
- This paper
  - Consider **re-configuring** the number of concurrent CTAs
    - Need synchronizations



#### **Unified L2 Cache**

Source: Nvidia, http://http://www.nvidia.com

## **Motivational Examples – Example 1**

## Assume that

- A collection of CTAs = {A, B, C, D, E, F, G, H, I, J, K, L}
- Working set sizes = {1, 8, 3, 1, 2, 2, 1, 7, 4, 4, 2, 5}
- Cache capacity = 10
- Maximum number of concurrent CTA = 4

## Example 1

| Example 1 : Cache Capacity Agnostic Scheduling |                                             |                                      |  |  |  |
|------------------------------------------------|---------------------------------------------|--------------------------------------|--|--|--|
| Scheduling<br>Steps                            | Concurrent<br>CTAs Cache Contention Evaluat |                                      |  |  |  |
| Step1                                          | A, B, C, D                                  | 1 + 8 + 3 + 1 = 13 > 10 (Contention) |  |  |  |
| Step2                                          | E, F, G, H                                  | 2 + 2 + 1 + 7 = 12 > 10 (Contention) |  |  |  |
| Step3                                          | I, J, K, L                                  | 4 + 4 + 2 + 5 = 15 > 10 (Contention) |  |  |  |

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

## **Motivational Examples – Example 2**

#### Example 2

Too restrictive to schedule more concurrent CTAs

| Example 2 : Cache Capacity Aware Scheduling with<br>Fixed Number of Concurrent CTAs |                    |                                       |  |  |  |
|-------------------------------------------------------------------------------------|--------------------|---------------------------------------|--|--|--|
| Scheduling<br>Steps                                                                 | Concurrent<br>CTAs | <b>Cache Contention Evaluation</b>    |  |  |  |
| Step1                                                                               | B, E               | $8 + 2 = 10 \le 10$ (Contention free) |  |  |  |
| Step2                                                                               | C, H               | $3 + 7 = 10 \le 10$ (Contention free) |  |  |  |
| Step3                                                                               | L, J               | $5 + 4 = 9 \le 10$ (Contention free)  |  |  |  |
| Step4                                                                               | F, I               | $2 + 2 = 6 \le 10$ (Contention free)  |  |  |  |
| Step5                                                                               | A, K               | $1 + 2 = 3 \le 10$ (Contention free)  |  |  |  |
| Step6                                                                               | D, G               | $1 + 1 = 2 \le 10$ (Contention free)  |  |  |  |

## **Motivational Examples – Example 3**

#### Example 3

Should also consider the synchronization cost

| Example 3 : Cache Capacity Aware Scheduling with<br>Reconfigurable Number of Concurrent CTAs |                    |                                                  |  |  |  |
|----------------------------------------------------------------------------------------------|--------------------|--------------------------------------------------|--|--|--|
| Scheduling<br>Steps                                                                          | Concurrent<br>CTAs | nt Cache Contention Evaluation                   |  |  |  |
| Step1                                                                                        | B, E               | , E $8 + 2 = 10 \le 10$ (Contention free)        |  |  |  |
| Step2                                                                                        | С, Н               | $3 + 7 = 10 \le 10$ (Contention free)            |  |  |  |
| Synchronize and re-configure the number of concurrent CTAs                                   |                    |                                                  |  |  |  |
| Step3                                                                                        | L, K, F, J         | $5 + 2 + 2 + 1 = 10 \le 10$ (Contention free)    |  |  |  |
| Step4                                                                                        | J, I, D, G         | D, G $4 + 4 + 1 + 1 = 10 \le 10$ (Contention fre |  |  |  |

## Cache Capacity Aware Thread Scheduling – Problem Formulation (1/4)

#### Input

 $c^n$ : a collection of CTAs  $\Box c^n = \{c_1, c_2 \cdots, c_n\}$   $\Box w(c_i)$ : working set size of the CTA  $c_i$ 

## Output

s<sup>m</sup> : a schedule of CTAs (a series of scheduling step)
 s<sup>m</sup> = {s<sub>1</sub>, s<sub>2</sub> ..., s<sub>m</sub>}
 Each scheduling step s<sub>i</sub> is a subset of c<sup>n</sup>
 conc(s<sub>i</sub>) : concurrency of the scheduling step s<sub>i</sub>
 Number of CTAs belongs to s<sub>i</sub>

## Cache Capacity Aware Thread Scheduling – Problem Formulation (2/4)

□ Constraint (**Cache Capacity**) •  $\forall s_i: \sum_{c_j \in s_i} w(c_j) \le Cap\_unified\_L2$ 

Cost Function  $m + sync_cost(s^m)$ : overall cost of the schedule  $s^m$  $\Box m$ : total number of scheduling steps  $\Box sync_cost(s^m)$ : total synchronization cost  $= sync_cost(s^m) = cps \times \sum_{i=0}^{m-1} sync(s_i, s_{i+1})$  $\Box$  sync( $s_i, s_{i+1}$ ) : necessity of synchronization  $\textbf{sync}(s_i, s_{i+1}) = \begin{cases} 0, \ conc(s_i) = conc(s_{i+1}) \\ 1, \ conc(s_i) \neq conc(s_{i+1}) \end{cases}$ **cps**: cost per synchronization  $cps \in \mathbb{R}, 0 < cps \leq 1$ 

## Cache Capacity Aware Thread Scheduling – Problem Formulation (3/4)

#### Problem Definition

Cache Capacity Aware Thread Scheduling Problem : Given a collection of CTAs  $c^n$  with working set size  $w(c_i)$ , the problem is to find a schedule  $s^m$ where the overall cost is minimized subject to cache capacity constraint:

minimize

subject to

$$m + sync\_cost(s^{m})$$
  

$$\forall s_{i}: \sum_{c_{j} \in s_{i}} w(c_{j}) \leq Cap\_unified\_L2$$
  

$$\forall s_{i} \neq s_{j}: s_{i} \cap s_{j} = \emptyset$$
  

$$s_{1} \cup s_{2} \cdots s_{m} = c^{n}$$

## Cache Capacity Aware Thread Scheduling – Problem Formulation (4/4)

#### NP-hardness

Lemma 1 : The Cache Capacity Aware Thread Scheduling Problem is NP-hard

Proof : The NP-hard problem, Bin Packing Problem can be reduced to this problem

#### □ P ≠ NP

- No optimal algorithm in polynomial time
- Acceptable quality in polynomial time
  Approximation algorithms

## Cache Capacity Aware Thread Scheduling – Fixed Concurrency (1/2)

#### Fixed Concurrency Constraint

 $\forall s_i \neq s_j: conc(s_i) = cons(s_j)$ 

Imply no synchronization cost

Reduced to k-Cardinality Bin Packing Problem

## k-Cardinality Bin Packing Problem

- Given : a set of items  $a_1, a_2, \dots, a_n$ , each with sizes  $s(a_i)$  and the bin capacity *cap*
- Result : a division of the items into to a minimum number of bins
- Constraints : each bin contains at most k items and its aggregated size cannot exceed the capacity cap

## **Cache Capacity Aware Thread Scheduling** – Fixed Concurrency (2/2)

- k-Cardinality Bin Packing Algorithms
  - Largest Memory First (LMF) and Iterated Worst-Case Decreasing (IWFD)

Constant approximation ratio

#### Algorithm 1 : Thread Scheduling for Fixed Concurrency $k \leftarrow$ maximum possible concurrency 1 sort $c^n$ in decending order by working set size 2 3 repeat $cap \leftarrow w(c_1) + w(c_2) + \dots + w(c_k)$ 4 $k \leftarrow k - 1$ 5 **until** $cap \leq Cap\_unified\_L2$ 6 7 $cap \leftarrow Cap\_unified\_L2$ 8 $s^m \leftarrow K - CARDINALITY - BIN - PACKING(c^n, cap, k)$ **return** s<sup>m</sup> 9

M. R. Garey, et al., "Worst-Case Analysis of Memory Allocation Algorithms," in ACM Symp. Theory of Computing, 1972

K. L. Krause, et al., "Analysis of Several Task-Scheduling Algorithms for a Model of Multiprogramming Computer Systems," J. ACM, vol. 22, pp. 522-550, 1975 18

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

## Cache Capacity Aware Thread Scheduling – Variable Concurrency (1/2)

#### **Cost Function:** $m + sync_cost(s^m)$

Trade-off between the number of scheduling steps (m) and synchronization cost (sync\_cost(s<sup>m</sup>))

#### Interesting Findings

Lemma 2 : For any schedule  $s^m$ , the overall cost,  $m + sync\_cost(s^m)$  is lesser or equal to 2m - 1

Lemma 3 : For any schedule  $s^m$ , the synchronization cost is minimum if the scheduling steps are sorted by the concurrency ( $conc(s_i)$ )

## Cache Capacity Aware Thread Scheduling – Variable Concurrency (2/2)

## Algorithm Design

Lemma 2  $\rightarrow$  Minimize the number of steps (*m*)

Lemma 3  $\rightarrow$  Minimize sync. cost (*sync\_cost(s<sup>m</sup>*))

**Algorithm 2 : Thread Scheduling for Variable Concurrency** 

1  $k \leftarrow$  maximum possible concurrency

2 
$$cap \leftarrow Cap\_unified\_L2$$

4

5

6

8

9

 $s^{m} \leftarrow K-CARDINALITY-BIN-PACKING(c^{n}, cap, k)$  Lemma 2 sort  $s^{m}$  by concurrency to minimize synchronization cost  $old\_cost \leftarrow m + sync\_cost(s^{m})$  Lemma 3

$$k \leftarrow k - 1$$

$$s^{m'} \leftarrow K - CARDINALITY - BIN - PACKING(c^n, cap, k)$$

sort  $s^{m'}$  by concurrency to minimize synchronization cost

$$10 \quad new\_cost \leftarrow m + sync\_cost(s^m)$$

11 **until** 
$$new\_cost \ge old\_cost$$

12 **return** *s*<sup>*m*</sup>

## Experimental Results – Experiment Setup (1/2)

#### □ GPGPU-Sim (ISPASS'09) Simulation Setup

| Fermi's Architectural Configurations in GPGPU-Sim |                                                                                                                                            |  |  |  |
|---------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Number of SMs                                     | 15                                                                                                                                         |  |  |  |
| SM configuration                                  | 32-wide pipeline, 32 threads/warp, 1536 threads/SM, 32768<br>registers/SM,<br><b>number of CTAs/SM (dynamic reconfigurable, default 8)</b> |  |  |  |
| L2 cache                                          | unified 768KB, 8-way, 64 byte/block                                                                                                        |  |  |  |
| DRAM                                              | 6 GDDR5 channels, 2 chips/channel, 16 banks, 16 entries/chip<br>FR-FCFS policy                                                             |  |  |  |
| Interconnection network                           | single stage butterfly, 32-byte flit size                                                                                                  |  |  |  |

# Thread clustering for CTA generation Kuo, et al. (ASPDAC'12)

# Ocelot for working set size analysis Ocelot (PACT'10)

A. Bakhoda, et al., "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in *ISPASS*, 2009 H.-K. Kuo, et al., "Thread Affinity Mapping for Irregular Data Access on Shared Cache GPGPU," in *ASPDAC*, 2012 G. F. Diamos, et al., "Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems," in *PACT*, 2010 Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

## Experimental Results – Experiment Setup (2/2)

## Application Domains

| Irregular Massive Parallel Applications |                                                                              |                                                |             |                   |  |  |
|-----------------------------------------|------------------------------------------------------------------------------|------------------------------------------------|-------------|-------------------|--|--|
| Applications                            | Fields                                                                       | Descriptions                                   | Sources     | Data set<br>sizes |  |  |
| bfs                                     | Electronic<br>Design<br>Automation<br>(EDA)                                  | breadth first search                           | Kuo, et al. | 2.6 MB            |  |  |
| sta                                     |                                                                              | static timing analysis                         |             | 3.0 MB            |  |  |
| gsim                                    |                                                                              | gate level logic simulation                    |             | 3.5 MB            |  |  |
| nbf                                     | Molecular<br>Dynamics<br>(MD)<br>Computational<br>Fluid<br>Dynamics<br>(CFD) | kernel abstracted from the GROMOS code         | Cosmic      | 6.3MB             |  |  |
| moldyn                                  |                                                                              | force calculation in the CHARMM program        |             | 10.2MB            |  |  |
| irreg                                   |                                                                              | kernel of Partial Differential Equation solver |             | 6.3MB             |  |  |
| euler                                   |                                                                              | finite-difference approximations on mesh       | Chaos       | 8.5MB             |  |  |
| unstructured                            |                                                                              | fluid dynamics with unstructured mesh          |             | 10.2MB            |  |  |

H.-K. Kuo, et al., "Thread Affinity Mapping for Irregular Data Access on Shared Cache GPGPU," in ASPDAC, 2012

H. Han, et al., "Exploiting Locality for Irregular Scientific Codes," IEEE Trans. Parallel and Distributed Systems, vol. 17, pp. 606-618, 2006

R. Das, et al., "Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures," J. Parallel Distrib. Comput., vol. 22, pp. 462-478, 1994.

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

## Experimental Results – Cache Misses Reduction

sche\_agnostic, sche\_fixed and sche\_variable

cps : low (50 cycles), medium (100 cycles) and high (200 cycles)



W.-C. Feng , et al., "To GPU Synchronize or not GPU Synchronize?," in *ISCAS*, 2010 Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

## Experimental Results – Execution Time Improvement

## sche\_fixed

Too restrictive to schedule more concurrent CTAs (moldyn and unstructured)



## Conclusions

#### **This paper**

Formulate a general thread scheduling problem, Cache Capacity Aware Thread Scheduling Problem

Not only prove the NP-hardness, but also propose two thread scheduling algorithms

Achieve an average of
 44.7% cache misses reduction
 28.5% runtime enhancement

Up to 62.5% for applications with more threads and higher complexity

## THANK YOU FOR YOUR ATTENTION

WE WELCOME YOUR QUESTIONS, COMMENTS AND SUGGESTIONS

#### Hsien-Kai Kuo hkkuo[at]ee.eda.nctu.edu.tw

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs