# Balance Principles for Algorithm-Architecture Co-design

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

May 31, 2011

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)



#### Position: Principles (i.e, "theory") informing practice (co-design)

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech) Balance Principles for Algorithm-Architecture Co-design

э

### Position

#### Position: Principles (i.e, "theory") informing practice (co-design)

#### Hardware/Software Co-design? Algorithm-Architecture Co-design?

## Position

**Position**: Principles (i.e, "theory") informing practice (co-design) For some computation to scale efficiently on a future parallel processor:

- 1. Allocation of cores?
- 2. Allocation of cache?
- 3. How must latency/bandwidth increase to compensate?

Or alternatively, given a particular parallel architecture, what classes of computations will perform efficiently?

Why theoretical models?

The best alternative (and perhaps the "status quo") in co-design is to put together a model of your chip and simulate your algorithm.

Very accurate, but by this point you've already invested lots of time and effort into a specific design.

Why theoretical models?

We advocate a more principled approach that can model the performance of a processor based on some of its most high-level characteristics known to be the main bottlenecks (communication, parallel scalability)...

Such a model can be refined and extended as needed, i.e based on cache characteristics, heterogeneity of the cores

### Balance

We define balance as:

For some algorithm:  $T_{mem} \leq T_{comp}$ 

1

For principled analysis, we need theoretical models for  $T_{mem}$ ,  $T_{comp}$ To be relevant for current/future processors, these models must integrate:

- 1. Parallelism
- 2. Cache/Memory Locality

<sup>1</sup>Similar to classical notions of balance: [Kung 1986], [Callahan, et al 1988], [McCalpin 1995]

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

## Why Balance?

Importance of considering balance:

- 1. Inevitable trend towards imbalance: peak flops outpacing memory hierarchy.
- Imbalance may be nonintuitive (make an improvement to some aspect of a chip without realizing that other areas must also improve to compensate) — for a particular algorithm

## Why Balance?

Balance is a particularly powerful lens for maintaining more *realistic* expectations for performance. Processor makers present raw figures for performance: peak flops, memory specs- very one-dimensional figures on their own. (i.e CPU vs. GPU wars)

Balance marries the two in a way that allows parallel scalability to also enter the picture– and recognizes that not all architectures are suitable for all applications.

## Assumptions

For our particular "principled" approach we use two models:  $T_{mem}$ : External Memory Model (I/O Model)  $T_{comp}$ : Parallel DAG Model / Work-Depth Model For these models alone to be expressive we have assumptions...

- 1. We are modeling work on a single socket. n is large enough to not fit completely in the outer level of cache.
- 2. For our algorithm, we can easily deduce the structure of a dependency DAG for any n
- 3. The developer can overlap computation and communication arbitrarily well
- Communication costs are dominated by misses between cache and RAM(∴ T<sub>comm</sub> ∝ cache misses = Q(n)).

## Parallel DAG Model for $T_{comp}$

$$(T_{mem} \leq T_{comp})$$



Inherent parallelism:  $\frac{W(n)}{D(n)}$  ... spectrum between embarrassingly parallel and inherently sequential (application: CPA) <u>Desired:</u> work optimality, maximum parallelism

<sup>2</sup>Source: Blelloch: Parallel Algorithms

Parallel DAG Model for  $T_{comp}$   $(T_{mem} \leq T_{comp})$ Brents Theorem [1974]: Maps DAG model to PRAM model

$$T_p(n) = O(D(n) + \frac{W(n)}{p})$$



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

Parallel DAG Model for  $T_{comp}$   $(T_{mem} \leq T_{comp})$ We model  $T_{comp}$  with:

$$T_{comp}(n; p, C_0) = (D(n) + \frac{W(n)}{p}) \cdot \frac{1}{C_0}$$



This gives us a lower bound that an optimally-crafted algorithm could theoretically achieve.

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

# I/O Model for $T_{mem}$ ( $T_{mem} \leq T_{comp}$ )



Q(n; Z, L): Number of cache misses. Thus, the volume of data transferred is  $Q(n; Z, L) \times L$ 

I/O Model for  $T_{mem}$ 

$$(T_{mem} \leq T_{comp})$$

Our *intensity* is thus

$$\frac{W(n)}{Q(n;Z,L)\times L}$$

*Desired:* minimize work (work-optimality) while maximizing intensity (by minimizing cache complexity).

Intensity on its own is very descriptive: intuitively we know that high-intensity operations such as matrix multiply perform well on GPUs, whereas low-intensity vector operations perform poorly. "W" and "Q" underly this behavior

I/O Model: Matrix Multiply



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

э

## I/O Model: Matrix Multiply



$$Q(n; Z, L) = \Omega\left(\frac{n^3}{L\sqrt{Z}}\right)$$

Assumes contiguous layout. Result is optimal.



Intensity

э



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

I/O Model for  $T_{mem}$ 

$$(T_{mem} \leq T_{comp})$$

We model  $T_{mem}$  with:

$$T_{mem}(n; p, Z, L, \alpha, \beta) = \alpha \cdot D(n) + \frac{Q_{p;Z,L}(n) \cdot L}{\beta}$$

 $\frac{Q \dots \# \text{ of cache misses}}{C_0 \dots \# \text{ of cycles per second}}$   $p \dots \# \text{ of cores}$   $Z \dots \text{ cache size (bytes)}$   $L \dots \text{ line size (bytes)}$   $\alpha \dots \text{ latency (s)}$   $\beta \dots \text{ bandwidth (bytes/s)}$ 

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

I/O Model for  $T_{mem}$  ( $T_{mem} \leq T_{comp}$ )

We model  $T_{mem}$  with:

$$T_{mem}(n; p, Z, L, \alpha, \beta) = \underline{\alpha \cdot D(n)} + \frac{Q_{p;Z,L}(n) \cdot L}{\beta}$$

 $Q_1$ , sequential cache complexity, is well known for most algorithms.  $Q_p$ , parallel cache complexity, must be separately derived, but can be directly obtained from  $Q_1$  if certain *scheduling* principles are followed.

I/O Model for  $T_{mem}$  ( $T_{mem} \leq T_{comp}$ )

We model  $T_{mem}$  with:

$$T_{mem}(n; p, Z, L, \alpha, \beta) = \alpha \cdot D(n) + \frac{Q_{p;Z,L}(n) \cdot L}{\beta}$$

Example: Work-stealing + core-private caches.

$$Q_p(n; Z, L) < Q_1(n; Z, L) + \mathcal{O}\left(\frac{p \cdot Z \cdot D(n)}{L}\right)$$

Example: Parallel depth-first + all-cores shared caches.

$$Q_p(n; Z + p \cdot L \cdot D(n), L) < Q_1(n; Z, L) \quad {}_{\mathbf{3}}$$

<sup>3</sup>Blelloch, Gibbons, Simhadri (2010). Low-depth cache-oblivious algorithms. Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech) Balance Principles for Algorithm-Architecture Co-design

# $T_{comp}, T_{mem}$



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

## $T_{comp}, T_{mem}$ : After some algebra



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

### Projections

Irony, et. al: Parallel Matrix Multiply Bound:

$$Q_{p;Z,L}(n) \geq rac{W(n)}{\sqrt{2} \cdot L\sqrt{Z/p}}$$

· · .

$$\frac{p \cdot C_0}{\beta} \quad \leq \quad \mathcal{O}\left(\sqrt{\frac{Z}{p}}\right)$$

Example: Matrix-multiply + work-stealing

< 17 ▶

.∃⇒ . ∢

э

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

### Projections

$$\frac{p \cdot C_0}{\beta} \leq \mathcal{O}\left(\sqrt{\frac{Z}{p}}\right)$$

Example: Matrix-multiply + work-stealing

$$\frac{p \cdot C_0}{\beta} \leq \mathcal{O}\left(\log \frac{Z}{p}\right)$$

Example: Cache-oblivious comparison-based sorting\* + work-stealing

Sort: the deterministic cache-oblivious algorithm by Blelloch (SPAA10) in which  $W = n \log n$ ,  $D = (\log n)^2$ ,  $Q = n/L \times \log_Z(n)$ .

## "Punchline": Projections (Matrix Multiply)

| 25                        | t = 0        | CPU           |                       |
|---------------------------|--------------|---------------|-----------------------|
|                           | NVIDIA       | doubling      |                       |
|                           | Fermi        | time          | 10-year               |
| Parameter                 | C2050        | years         | projection            |
| Peak flops, $p \cdot C_0$ | 1.03 Tflop/s | 1.7           | 59 Tflop/s            |
| Peak bandwidth, $\beta$   | 144 GB/s     | 2.8           | 1.7 TB/s              |
| Latency, $\alpha$         | 347.8 ns     | 10.5*<br>10.2 | 179.7 ns<br>256 Bytes |
| Transfer size, L          | 128 Bytes    |               |                       |
| Fast memory, $Z$          | 2.7 MB       | 2.0           | 83 MB                 |
| Cores, p                  | 448          | 1.87          | 18k                   |
| $p \cdot C_0 / eta$       | 7.2          | <u> </u>      | 34.9                  |
| $\sqrt{Z/p}$              | 38.6         |               | 33.5                  |

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

э

# Projections (Matrix Multiply)



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

## Consequences (Stacked Memory)

Scaling the number of PINs from memory to the processor with the *surface area* of the chip rather than the perimeter:  $\beta$  scales at a higher dimension.

$$\frac{p \cdot C_0}{\beta} \leq \mathcal{O}\left(\sqrt{\frac{Z}{p}}\right)$$

Example: Matrix-multiply + work-stealing

$$\frac{p \cdot C_0}{\beta} \leq \mathcal{O}\left(\log \frac{Z}{p}\right)$$

Example: Cache-oblivious comparison-based sorting\* + work-stealing

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

### Limitations

#### **Big-Oh Notation**

Existing analysis is often ( $\approx$  always) in "Big-Oh" notation. So W, D, Q are often in the form O(f(n)). For large n,  $O(f(n)) \approx C \cdot f(n)$ C can sometimes be determined from principles, or from static/dynamic analysis, or simply from benchmarking.

i.e, for FFT,  $W(n) = \#flops = 5(n \log n)$ 

### Limitations

Every model has limitations. We use the DAG model and External Memory model.

 $T_{comp}$  and  $T_{mem}$  can be changed to *any* model that aims to represent memory and compute time independently, i.e if there is a more suitable or predictable model on a particular architecture or algorithm. Example: increasingly heterogeneous chips (many more degrees of freedom).

We believe that *balance* is an ideal frame from which to focus this principled analysis:  $T_{mem} \leq T_{comp}$ 

## Limitations

How can we bring other metrics into play?

- 1. Power:  $Power_{alg}(n; Z, L, p) \propto Q(n; Z, L, p)$  ? Power efficiency necessary for exascale
- A more general cost metric (i.e a cluster of iPads would probably be balanced)

## Bounds

|                                           | Lower bound                                |                                         | Upper bound                                           |                                                      |
|-------------------------------------------|--------------------------------------------|-----------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| Algorithm                                 | Bandwidth                                  | Latency                                 | Bandwidth                                             | Latency                                              |
| Matrix-Multiplication                     |                                            |                                         | $O\left(\frac{n^2}{\sqrt{P}}\right)$                  | $O\left(\sqrt{P}\right)$                             |
| Cholesky                                  | -                                          |                                         | $O\left(\frac{n^2\log P}{\sqrt{P}}\right)$ B(         | $O\left(\sqrt{P}\log P\right)$<br>CC <sup>+</sup> 97 |
| LU                                        |                                            | $\Omega\left(rac{n^3}{PM^{3/2}} ight)$ | $O\left(\frac{n^2 \log P}{\sqrt{P}}\right)$ [DGX08]   | $O\left(\sqrt{P}\log P\right)$<br>DGX08              |
| QR                                        | $=\Omega\left(\frac{n^2}{\sqrt{P}}\right)$ | $=\Omega\left(\sqrt{P}\right)$          | $O\left(\frac{n^2 \log P}{\sqrt{P}}\right)$ [DGHL08a] | $O\left(\sqrt{P}\log^3 P\right)$<br>DGHL08a          |
| Symmetric Eigenvalues                     |                                            |                                         | $O\left(\frac{n^2\log P}{\sqrt{P}}\right)$            | $O(\sqrt{P}\log^3 P)$<br>DD09                        |
| SVD                                       |                                            |                                         | $O\left(\frac{n^2\log P}{\sqrt{P}}\right)$            | $O(\sqrt{P}\log^3 P)$<br>DD09                        |
| (Generalized) Nonsymmetric<br>Eigenvalues |                                            |                                         | $O\left(\frac{n^2 \log P}{\sqrt{P}}\right)$ [B]       | $O(\sqrt{P}\log^3 P)$<br>DD09]                       |

Figure: Established bounds on communication in linear algebra.  $M = \Theta(\frac{N^2}{P})$  (Ballard, et. al, 2009)

## Machine Balance



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

< 17 →

2

## Machine Balance



Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)

# Projections (CPU vs GPU)

|                               |             | doubling   | 10-year  |
|-------------------------------|-------------|------------|----------|
|                               | Keeneland   | time       | increase |
| Parameter                     | values      | (in years) | factor   |
| Cores: p <sub>cpu</sub>       | 12          | 1.87       | 40.7×    |
| p <sub>gpu</sub>              | 448         |            |          |
| Peak: $p_{cpu} \cdot C_{cpu}$ | 268 Gflop/s | 1.7        | 59.0×    |
| $p_{ m gpu} \cdot C_{ m gpu}$ | 1 Tflop/s   |            |          |
| Memory BW: $\beta_{cpu}$      | 25.6 GB/s   | 3.0        | 9.7×     |
| $\beta_{\sf gpu}$             | 144 GB/s    |            |          |
| Fast memory: $Z_{cpu}$        | 12 MB       | 2.0        | 32.0×    |
| Z <sub>gpu</sub>              | 2.7MB       |            |          |
| I/O device: $\beta_{I/O}$     | 8 GB/s      | 2.39       | 18.1×    |
| Network BW, $\beta_{link}$    | 10 GB/s     | 2.25       | 21.8×    |

Table: Using the hardware trends we can make predictions about relative performance of future hardware. (BW = bandwidth)

### Contact

Kent Czechowski kentcz (at) gatech

Casey Battaglino cbattaglino3 (at) gatech

Questions?

э

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc (Georgia Tech)