



# Generating Configurable Hardware From Parallel Patterns

**Raghu Prabhakar**, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Chris De Sa, Christos Kozyrakis, Kunle Olukotun

> Stanford University ASPLOS 2016





### Increasing interest to use FPGAs as accelerators

Key advantage: Performance/Watt





# Key domains:

 Big data analytics, image processing, financial analytics, scientific computing, search





- Verilog and VHDL too low level for software developers
- High level synthesis (HLS) tools need user pragmas to help discover parallelism
  - C-based input, pragmas requiring hardware knowledge
  - Limited in exploiting data locality
  - Difficult to synthesize complex data paths with nested parallelism







#### Add 512 integers stored in external DRAM

```
void(int* mem) {
    mem[512] = 0;
    for(int i=0; i<512; i++) {
        mem[512] += mem[i];
     }
}</pre>
```

27,236 clock cycles for computation Two-orders of magnitude too long!







302 clock cycles for computation





# Use Higher-level Abstractions

- Productivity: Developer focuses on application
- Performance:
  - Capture Locality to reduce off-chip memory traffic
  - **Exploit Parallelism** at multiple nesting levels

Smart compiler generates efficient hardware





 Constructs with special properties with respect to parallelism and memory access







- Concise
- Can express large class of workloads in the machine learning and data analytics domain
- Captures rich semantic information about parallelism and memory access patterns
- Enables powerful transformations using pattern matching and re-write rules
- Enables generating efficient code for different architectures





- A data-parallel language that supports parallel patterns
- Example application: k-means

```
val clusters = samples groupBy { sample =>
val dists = kMeans map { mean =>
mean.zip(sample){ (a,b) => sq(a - b) } reduce { (a,b) => a + b }
// Compute closest mean for each 'sample'
Range(01.diSemple gth) stance { with each mean
if/(d2stselectiste(jm)eane with shortest distance
}
val newKmeans = clusters map { e =>
val/&Compute caverate(vdfv2ach>oduste(v2){ (a,b) => a + b } }
val/&compute fum =of1allrassign(edapb)ints a + b }
// 2. Compute number of assigned points
sum/#a3.{ Divide #acburdinension of sum by count
}
```





























- Tiling using polyhedral analysis limits data access patterns to affine functions of loop indices
- Current parallel patterns cannot represent tiling
- New parallel pattern describes tiled computation



















- Transform parallel pattern  $\rightarrow$  nested patterns
  - Strip mined patterns enable computation reordering
- Insert copies to enhance locality
  - Copies guide creation of on-chip buffers

| Parallel Patterns             | Strip Mined Patterns                |
|-------------------------------|-------------------------------------|
| <b>map</b> (d) {i => 2*x(i) } | <pre>multiFold(d/b) {ii =&gt;</pre> |
|                               | xTile = x.copy(b + ii)              |
|                               | (i, <b>map</b> (b) {i => 2*xTile(i) |
|                               | <pre>}) }</pre>                     |











- Reorder nested patterns
  - Move 'copy' operations out toward outer pattern(s)
  - Improves locality and reuse of on-chip memory

| Strip Mined Patterns                        | Interchanged Patterns                       |
|---------------------------------------------|---------------------------------------------|
| <pre>multiFold(m/b0,n/b1){ii,jj =&gt;</pre> | <pre>multiFold(m/b0,n/b1){ii,jj =&gt;</pre> |
| xTl = x <b>.copy</b> (b0+ii, b1+jj)         | xTl = x.copy(b0+ii, b1+jj)                  |
| ((ii,jj), <b>map</b> (b0,b1){i,j =>         | ((ii,jj), <b>multiFold</b> (p/b2){kk =>     |
| <pre>multiFold(p/b2) { kk =&gt;</pre>       | yTl = y. <b>copy</b> (b1+jj, b2+kk)         |
| yTl = y <b>.copy</b> (b1+jj, b2+kk)         | (0, <b>map</b> (b0,b1){i,j =>               |
| (0, <b>multiFold</b> (b2) { k =>            | (0, <b>multiFold</b> (b2) { k =>            |
| (0, xTl(i,j)* yTl(j,k))                     | (0, xTl(i,j) * yTl(j,k))                    |
| $\{(a,b) => a + b\}$                        | $\{(a,b) => a + b\}$                        |
| }{(a,b) => a + b}                           | })                                          |
| })                                          | }{(a,b) =>                                  |
| }                                           | <pre>map(b0,b1){i,j =&gt;</pre>             |
|                                             | a(i,j) + b(i,j) }                           |
|                                             | })                                          |
|                                             | }                                           |

















| Pipe. Exec. Units | Description                                  | IR Construct             |
|-------------------|----------------------------------------------|--------------------------|
| Vector            | SIMD parallelism                             | Map over scalars         |
| Reduction tree    | Parallel reduction of associative operations | MultiFold over scalars   |
| Parallel FIFO     | Buffer ordered outputs of dynamic size       | FlatMap over scalars     |
| CAM               | Fully associative key-value store            | GroupByFold over scalars |

| Memories      | Description                                        | IR Construct           |
|---------------|----------------------------------------------------|------------------------|
| Buffer        | Scratchpad memory                                  | Statically sized array |
| Double buffer | Buffer coupling two stages in a metapipeline       | Metapipeline           |
| Cache         | Tagged memory exploits locality in random accesses | Non-affine accesses    |

| Controllers  | Description                                             | IR Construct                                        |  |
|--------------|---------------------------------------------------------|-----------------------------------------------------|--|
| Sequential   | Coordinates sequential execution                        | Sequential IR node                                  |  |
| Parallel     | Coordinates parallel execution                          | Independent IR nodes                                |  |
| Metapipeline | Execute nested parallel patterns in a pipelined fashion | Outer parallel pattern with multiple inner patterns |  |
| Tile memory  | Fetch tiles of data from off-chip memory                | Transformer-inserted array copy                     |  |





- Hierarchical pipeline: A "pipeline of pipelines"
  - Exploits nested parallelism
- Inner stages could be other nested patterns or combinational logic
  - Does not require iteration space to be known statically
  - Does not require complete unrolling of inner patterns
- Intermediate data from each stage automatically stored in double buffers
  - Allows stages to have variable execution times
- No need to calculate initiation interval (II)
  - Use asynchronous control signals to begin next iteration











- Detects Metapipelines in the tiled parallel pattern IR
- Detection
  - Chain of producer-consumer parallel patterns within the body of another parallel pattern
- Scheduling
  - Topological sort of IR of parallel pattern body
  - List of stages, where each stage consists of one or more independent parallel patterns
  - Promote intermediate buffers to double buffers







Similar to (and more general than) hand-written designs<sup>1</sup>

[1] Hussain et al, "Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data", AHS 2011





# Board:

- Altera Stratix V
- 48 GB DDR3 off-chip DRAM, 6 memory channels
- Board connected to host via PCI-e
- Execution time reported = FPGA execution time
  - CPU ← → FPGA communication, FPGA configuration time not included
- Goal: How beneficial is *tiling* and *metapipelining*?





- Baseline
  - Auto generated MaxJ
  - Representative of state-of-the-art HLS tools
- Baseline Optimizations
  - Pipelined execution of innermost loops
  - Parallelized (unrolled) inner loops
    - Parallelism factor chosen by hand
  - Data locality captured at the level of a DRAM burst (384 bytes)
- Parallelism factors are kept consistent across baseline and optimized versions from our flow



















- Speedup with tiling: up to 15.5x
- Speedup with tiling + metapipelining: up to 39.4x
- Minimal (often positive!) impact on resource usage
  - Tiled designs have fewer off-chip data loaders and storers





- Two key optimizations: tiling and metapipelining – to generate efficient FPGA designs from parallel patterns
- Automatic tiling transformations placing fewer restrictions on memory access patterns
- Analysis to automatically infer designs with metapipelines and double buffers
- Significant speedups of up to 39.4x with minimal impact on FPGA resource utilization