| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

# Towards Highly Parallel Event Processing through Reconfigurable Hardware

## Mohammad Sadoghi Harsh Singh Hans-Arno Jacobsen

# University of Toronto

June 13, 2011





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

### 2 Matching Problem





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

## 2 Matching Problem

## 3 An Overview of Our FPGA Designs





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

## 2 Matching Problem

- 3 An Overview of Our FPGA Designs
- 4 Experimental Framework





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

- 2 Matching Problem
- 3 An Overview of Our FPGA Designs
- 4 Experimental Framework
- 5 Conclusions





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

- 2 Matching Problem
- 3 An Overview of Our FPGA Designs
- 4 Experimental Framework
- 5 Conclusions





## Algorithm Trading

Motivation

Algorithmic trading is a computer-based approach to execute buy and sell orders on financial instruments such as securities (e.g., stocks and bonds.)





## Algorithm Trading

Motivation

Algorithmic trading is a computer-based approach to execute buy and sell orders on financial instruments such as securities (e.g., stocks and bonds.)

### **Algorithmic Trading Challenges**

- Sustain a high event rate because algorithmic trading dominates financial markets and accounts for over 70% of all trading
- 2 Minimize matching time because every 1-millisecond generates a staggering amount of \$100 million annually





## Real-time Event Processing Challenges

#### **Event Processing Requirements**

Motivation

An event processing platform must efficiently find all patterns or specifications (subscriptions) that match incoming events at a rate up to a million events per second.





## Real-time Event Processing Challenges

#### **Event Processing Requirements**

Motivation

An event processing platform must efficiently find all patterns or specifications (subscriptions) that match incoming events at a rate up to a million events per second.

### Our Solution

We propose a novel FPGA-based event processing platform to significantly accelerate event processing computations, namely, event parsing and event matching against patterns or specifications.



FPGA Desig

esigns

lation

Conclusions

## NetFPGA Card

Motivation





・ロト ・ 日 ト ・ 田 ト ・





-

## Verilog Snippet

Motivation

```
case (CurrentState)
       IDLE: begin
               if (Go)
                      NextState = SELECT CLUSTER ID;
               else
                      NextState = IDLE;
       ond
       SELECT CLUSTER ID: begin
                                                            // Select a valid cluster index
               if (curCluster > LAST CLUSTER)
                                                            // Finished reading last cluster
                                                            // When all clusters are invalid
                      NextState = WAIT;
               else
                      NextState = START ADDRESS:
       end
       START ADDRESS: begin
                                                            // Select cluster address
               if (curCluster > LAST CLUSTER)
                                                            // Finished reading last cluster
                                                            // When all clusters are invalid
                      NextState = WAIT;
               else if (can take more requests)
                      NextState = MEM BURST WAIT;
               else
                      NextState = START ADDRESS:
                                                            // If 'can take more requests' is low, wait
       ond
       NEXT ADDRESS: begin
                                                            // Check data from cluster terminator
               if (clusterEndFound)
                      NextState = SELECT_CLUSTER_ID;
               else if (can take more requests)
                      NextState = MEM BURST WAIT:
                                                            // Increment curAddr by 16 bytes
               else
                      NextState = NEXT ADDRESS;
                                                            // If 'can take more requests' is low, wait
       end
       default
               NextState = IDLE;
endcase
```





| Outline | Motivation                                                           | Language & Semantics                                                                                                                 | FPGA Designs                                                                                    |              | Conclusions |
|---------|----------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|--------------|-------------|
| Veril   | og Snippet                                                           |                                                                                                                                      |                                                                                                 |              |             |
| car     | else<br>end<br>SELECT_CLUSTER_ID: t<br>if (curClust<br>NextS<br>else | <pre>tate = SELECT_CLUSTER_ID;<br/>tate = IDLE;<br/>segin<br/>er &gt; LAST_CLUSTER)<br/>tate = WAIT;<br/>tate = START_ADDRESS;</pre> | // Select a valid cluster ind<br>// Finished reading last clus<br>// When all clusters are inva | ter          | *****       |
| S       | ELECT_CLUSTER_:<br>if (curClus                                       | ter > LAST_CLUSTER                                                                                                                   | //Select a valid<br>()//Finished readi<br>//When all clust                                      | ng last clus | ter         |

```
NextState = WAIT;
```

else

```
NextState = START_ADDRESS;
```

end

NEXISLACE - NEAT\_ADDRESS

default NextState = IDLE;

endcase





can\_take\_more\_r

<ロト </p>

| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| FPGAs   | Challeng   | es                   |              |             |

**1** The latest FPGA (e.g., 800MHz Xilinx Virtex 6) operates at significantly lower speed compared to CPUs (e.g., 3.46GHz Intel i7)





- The latest FPGA (e.g., 800MHz Xilinx Virtex 6) operates at significantly lower speed compared to CPUs (e.g., 3.46GHz Intel i7)
- 2 The accelerated application functionality has to be amenable to parallel processing





- The latest FPGA (e.g., 800MHz Xilinx Virtex 6) operates at significantly lower speed compared to CPUs (e.g., 3.46GHz Intel i7)
- 2 The accelerated application functionality has to be amenable to parallel processing
- 3 The memory bandwidth must keep up with chip processing speeds to realize a speedup by keeping the custom-built processing pipeline busy





- The latest FPGA (e.g., 800MHz Xilinx Virtex 6) operates at significantly lower speed compared to CPUs (e.g., 3.46GHz Intel i7)
- 2 The accelerated application functionality has to be amenable to parallel processing
- 3 The memory bandwidth must keep up with chip processing speeds to realize a speedup by keeping the custom-built processing pipeline busy

#### **Open Problems**

The true success of FPGAs is rooted in three distinctive features: hardware reconfigurability, hardware parallelism, and onboard packet processing.



| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Why     | FPGAs      |                      |              |             |

**1** *Hardware reconfigurability:* re-configuring the application on-demand into a highly parallel custom processors





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Why     | FPGAs      |                      |              |             |

- **1** *Hardware reconfigurability:* re-configuring the application on-demand into a highly parallel custom processors
- 2 Hardware parallelism: eliminating inter-processor signalling and message passing overhead associated with the concurrency management at the program and the OS level





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Why     | FPGAs      |                      |              |             |

- **1** *Hardware reconfigurability:* re-configuring the application on-demand into a highly parallel custom processors
- 2 Hardware parallelism: eliminating inter-processor signalling and message passing overhead associated with the concurrency management at the program and the OS level
- 3 Onboard packet processing: using multiple high bandwidth (giga-bit) I/O pins to eliminate the OS layer latency overhead in moving data between input and output ports



| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Why     | FPGAs      |                      |              |             |

- **1** *Hardware reconfigurability:* re-configuring the application on-demand into a highly parallel custom processors
- 2 Hardware parallelism: eliminating inter-processor signalling and message passing overhead associated with the concurrency management at the program and the OS level
- Onboard packet processing: using multiple high bandwidth (giga-bit) I/O pins to eliminate the OS layer latency overhead in moving data between input and output ports
- 4 Cost-effective and Energy-efficient





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

## 2 Matching Problem

## 3 An Overview of Our FPGA Designs

## 4 Experimental Framework

5 Conclusions





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Langi   | lage and [ | Data Model           |              |             |

• *Event* is modeled as a value assignment to attributes.





Mohammad Sadoghi (University of Toronto)

| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Langu   | logo ond   | Data Madal           |              |             |

- *Event* is modeled as a value assignment to attributes.
- *Subscription* is modeled as a Boolean expression.





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Long    | are and    | Data Madal           |              |             |

- Language and Data Model
  - *Event* is modeled as a value assignment to attributes.
  - Subscription is modeled as a Boolean expression.
  - A predicate P is a triple consisting of an attribute uniquely representing a dimension in n-dimensional space, an operator, and/or a set of values, denoted by P<sup>(attr,opt,val)</sup>(x) triplet or P<sup>(attr,opt,attr)</sup>(x) triplet.





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            | Data Madal           |              |             |

- *Event* is modeled as a value assignment to attributes.
- Subscription is modeled as a Boolean expression.
- A predicate P is a triple consisting of an attribute uniquely representing a dimension in n-dimensional space, an operator, and/or a set of values, denoted by P<sup>(attr,opt,val)</sup>(x) triplet or P<sup>(attr,opt,attr)</sup>(x) triplet.
- A predicate P(x) either accepts or rejects an input x such that  $P(x) : x \longrightarrow \{ \text{True}, \text{False} \}$ , where  $x \in \text{Dom}(P)$ .





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| 1       |            |                      |              |             |

- *Event* is modeled as a value assignment to attributes.
- Subscription is modeled as a Boolean expression.
- A predicate P is a triple consisting of an attribute uniquely representing a dimension in n-dimensional space, an operator, and/or a set of values, denoted by P<sup>(attr,opt,val)</sup>(x) triplet or P<sup>(attr,opt,attr)</sup>(x) triplet.
- A predicate P(x) either accepts or rejects an input x such that  $P(x) : x \longrightarrow \{ \text{True}, \text{False} \}$ , where  $x \in \text{Dom}(P)$ .
- Formally, a Boolean expression *e* is defined over an *n*-dimensional space as follows



| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| 1       |            |                      |              |             |

- *Event* is modeled as a value assignment to attributes.
- Subscription is modeled as a Boolean expression.
- A predicate P is a triple consisting of an attribute uniquely representing a dimension in n-dimensional space, an operator, and/or a set of values, denoted by P<sup>(attr,opt,val)</sup>(x) triplet or P<sup>(attr,opt,attr)</sup>(x) triplet.
- A predicate P(x) either accepts or rejects an input x such that  $P(x) : x \longrightarrow \{ \text{True}, \text{False} \}$ , where  $x \in \text{Dom}(P)$ .
- Formally, a Boolean expression *e* is defined over an *n*-dimensional space as follows

#### Definition

$$e = \left\{ P_1^{(\texttt{attr}_i,\texttt{opt},\texttt{val})} \wedge \dots \wedge P_k^{(\texttt{attr}_j,\texttt{opt},\texttt{attr}_1)} 
ight\}$$





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Match   | ning Sema  | antics               |              |             |

## Matching Problem

Given a event e and a set of subscriptions  $\mathbf{s}$ , find all subscriptions  $s_i \in \mathbf{s}$  satisfied by e.





 Outline
 Motivation
 Language & Semantics
 FPGA Designs
 Evaluation

## An Abstract View of Propagation Data Structure



Propagation (Fabret, Jacobsen, Llirbat, Pereira, Ross, and Shasha, SIGMOD'01)

- **1** Distribute subscriptions in disjoint clusters to achieve high degree of parallelism
- 2 Store subscriptions as contiguous blocks of memory, which enables fast sequential access to improve memory locality



Conclusions

| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

## 2 Matching Problem

## 3 An Overview of Our FPGA Designs

- 4 Experimental Framework
- 5 Conclusions





## An Overview of Our Design Space

Four key designs

- **I** Flexibility: providing ease of the development and deployment cycle
- 2 Adaptability: supporting subscription updates
- **3** Scalability: relying on horizontal data partitioning
- 4 Performance: achieving the highest level of parallelism





Outline

Conclusions

## Degrees of Offered Parallelism







Outline

FPGA Designs

s Evaluation

Conclusions

# Tuning for Flexibility

Motivation



- Eliminate software-to-hardware porting effort or the need for specialized hardware knowledge
- 2 Compile and execute the original PC source code on FPGA soft-core processors





Outline

Designs

luation

Conclusions

# Tuning for Adaptability

Motivation



- Employ a shared memory model to store propagation data structure
- 2 Support up to four matching units (custom processors) in parallel (parallelism limited by the off-chip memory-to-processor bandwidth)
- 3 Scale up to hundreds of thousands of subscriptions and support subscription updates



GROUI

# Tuning for Scalability

Motivation



- 1 Inherit most of the benefits of *tuning for adaptability design*
- 2 Obtain an unprecedented level parallelism through horizontal data partitioning
- 3 Assign each matching unit a dedicated on-chip memory to minimize processor idleness
- 4 Realize full memory-to-processor bandwidth match via direct interconnect to dedicated memory units



| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

## Horizontal Data Partitioning







Image: A math a math

| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

## Horizontal Data Partitioning







Image: A math a math

**FPGA** Designs

## Horizontal Data Partitioning







-∢ ∃ ▶

Conclusions

# Tuning for Performance

Motivation



- I Encode each subscription as a matching unit (a custom processor)
- 2 Sustain a high rate of matching due to lack of memory access
- 3 Achieve a highest degree of parallelism, since all subscriptions are executed in parallel
- 4 Scale up to hundreds of subscriptions



| Outline | Motivation | Language & Semantics | FPGA Designs | Evaluation | Conclusions |
|---------|------------|----------------------|--------------|------------|-------------|
|         |            |                      |              |            |             |

- 1 Real-time Event Processing Scenario
- 2 Matching Problem
- 3 An Overview of Our FPGA Designs
- 4 Experimental Framework
- 5 Conclusions





#### **Different Approaches**

- 1 PC: PC Solution
- 2 Flexibility: FPGA embedded system (soft-core)
- **3 Adaptability:** FPGA matching units (processors) + off-chip main memory
- Scalability: FPGA matching units (processors) + on-chip main memory + horizontal data partitioning
- **5 Performance:** Hardware encoded data + no on-/off-chip memory





| Outline | Motivation | Language & Semantics | FPGA Designs | Evaluation | Conclusions |
|---------|------------|----------------------|--------------|------------|-------------|
|         |            |                      |              |            |             |

#### **Evaluation Testbed**



- **Throughput** is the maximum sustainable input packet rate, determined through a bisection search, when no packets is dropped.
- **2** Latency is the interval between the time an event packet leaves the Event Monitor output queue to the time the action is received.





Outline Motivation Language & Semantics FPGA Designs Evaluation Conclusions

## Effect of the # of Matching Units (MUs) vs. Latency ( $\mu s$ )

| Scalability Design Results |          |      |      |     |  |  |  |  |  |
|----------------------------|----------|------|------|-----|--|--|--|--|--|
| Workload                   | 128x MUs |      |      |     |  |  |  |  |  |
| 250                        | 7.5      | 5.5  | 5.0  | 3.6 |  |  |  |  |  |
| 1K                         | 9.3      | 6.1  | 4.3  | 4.3 |  |  |  |  |  |
| 10K                        | 64.0     | 19.0 | 6.8  | 5.4 |  |  |  |  |  |
| 50K                        | 223.5    | 59.9 | 12.3 | 7.3 |  |  |  |  |  |





#### Effect of Workload Size vs. End-to-end Latency ( $\mu s$ )

| Workload | PC      | Flexibility | Adaptability | Scalability | Performance |
|----------|---------|-------------|--------------|-------------|-------------|
| 250      | 53.9    | 71.0        | 6.4          | 3.6         | 3.2         |
| 1K       | 60.7    | 199.4       | 7.5          | 4.3         | N/A         |
| 10K      | 150.0   | 1,617.8     | 87.8         | 5.4         | N/A         |
| 100K     | 2,001.2 | 16,422.8    | 1,307.3      | N/A         | N/A         |





## System Throughput (events/sec)

| Workload | РС      | Flexibility | Adaptability | Scalability | Performance |
|----------|---------|-------------|--------------|-------------|-------------|
| 250      | 122,654 | 14,671      | 282,142      | 740,740     | 1,024,590   |
| 1K       | 66760   | 5,089       | 202,500      | 487,804     | N/A         |
| 10K      | 9594    | 619         | 11,779       | 317,460     | N/A         |
| 100K     | 511     | 60          | 766          | N/A         | N/A         |





Evaluation

Conclusions

## Percentage (%) of Line-rate Utilization

| Workload | PC    | Flexibility | Adaptability | Scalability | Performance |
|----------|-------|-------------|--------------|-------------|-------------|
| 250      | 7.45  | 0.89        | 17.15        | 45.04       | 62.29       |
| 1K       | 4.01  | 3.09        | 12.31        | 29.66       | N/A         |
| 10K      | 0.58  | 0.04        | 0.72         | 19.30       | N/A         |
| 100K     | 0.031 | 0.01        | 0.05         | N/A         | N/A         |





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

#### 1 Real-time Event Processing Scenario

- 2 Matching Problem
- 3 An Overview of Our FPGA Designs
- 4 Experimental Framework
- 5 Conclusions





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Conclu  | isions     |                      |              |             |





Mohammad Sadoghi (University of Toronto)

・ロト ・聞 ト ・ ヨト ・ ヨト

| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Concl   | usions     |                      |              |             |

**1** Reconfigurable hardware (FPGA)

- accelerate using custom logic circuit
- utilize hardware parallelism





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| Canal   |            |                      |              |             |

#### Conclusions

- 1 Reconfigurable hardware (FPGA)
  - accelerate using custom logic circuit
  - utilize hardware parallelism
- 2 Line-rate event processing
  - eliminate OS layer latency
  - leverage on-board packet processing





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
| C       |            |                      |              |             |

- Conclusions
  - 1 Reconfigurable hardware (FPGA)
    - accelerate using custom logic circuit
    - utilize hardware parallelism
  - 2 Line-rate event processing
    - eliminate OS layer latency
    - leverage on-board packet processing
  - 3 Effective data placement of subscriptions
    - horizontally partition the data (propagation data structure)
    - increase the memory bandwidth
    - maximize the level of parallelism



| C       |            |                      |              |             |
|---------|------------|----------------------|--------------|-------------|
| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |

#### Conclusions

- 1 Reconfigurable hardware (FPGA)
  - accelerate using custom logic circuit
  - utilize hardware parallelism
- 2 Line-rate event processing
  - eliminate OS layer latency
  - leverage on-board packet processing
- 3 Effective data placement of subscriptions
  - horizontally partition the data (propagation data structure)
  - increase the memory bandwidth
  - maximize the level of parallelism
- 4 Other FPGAs benefits
  - cost-effective
  - energy-efficient





| Outline | Motivation | Language & Semantics | FPGA Designs | Conclusions |
|---------|------------|----------------------|--------------|-------------|
|         |            |                      |              |             |

Thank You,





Mohammad Sadoghi (University of Toronto)

イロト イヨト イヨト イ