



## Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Mohammad Abdel-Majeed\* **Daniel Wong**\* Murali Annavaram

Ming Hsieh Department of Electrical Engineering University of Southern California

USC School of Engineering \* Equal Contribution

MICRO-2013 University of Southern California

### **Problem Overview**

Execution unit accounts for majority of energy consumption in GPGPU, even more than Mem and Reg!

Leakage energy is becoming a greater concern with technology scaling



# Traditional microprocessor power gating techniques are ineffective in GPGPUs

[1] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: enabling energy optimizations in GPGPUs," presented at the ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.



**Overview | 2** University of Southern California

### **GPGPU Overview (GTX480)**





SP accounts for 98% of Execution Unit Leakage Energy Execution units account for 68% of total on chip area



**Overview | 3** University of Southern California

### **Power Gating Overview**



- Cuts off leakage current that flows through a circuit block
- Power gate at SP granularity
- **Important Parameters:**

School of Engineering

- Wakeup Delay Time to return to Vdd (3 cycles)
- Breakeven Time # of consecutive power gated cycles required to compensate PG energy overhead (9-24 cycles)
- Idle Detect # of idle cycles before power gating<sup>[2]</sup>



**Overview | 4** University of Southern California



### **Power Gating Challenges in GPGPUs**



**Challenges | 5** University of Southern California

### **Power Gating Challenges in GPGPUs**



- Traditional microprocessors experience idle periods many 10s of cycles long<sup>[3]</sup>
- Int. Unit Idle period length distribution for hotspot

Assume 5 idle detect, 14 BET

School of Engineering





### **Power Gating Challenges in GPGPUs**



University of Southern California

- Traditional microprocessors experience idle periods many 10s of cycles long<sup>[3]</sup>
- Int. Unit Idle period length distribution for hotspot
  - Assume 5 idle detect, 14 BET

School of Engineering



### Warp Scheduler Effect on Power Gating

### Need to coalesce warp issues by resource type

Idle periods interrupted by instructions that are greedily scheduled







#### **Challenges | 8** University of Southern California



### **GATES:**

### **Gating Aware Two-level Scheduler**

Issue warps based on execution unit resource type



**GATES | 9** University of Southern California

### **Gating Aware Two-level Scheduler (GATES)**







**GATES | 10** University of Southern California

### **Gating Aware Two-level Scheduler (GATES)**



- Per instruction type active warps subset
- Instruction Issue Priority
- Dynamic priority switching
  - Switch highest priority when it out of ready warps











### **Blackout Power Gating**

#### Forced idleness of execution units to meet BET



Blackout | 13 University of Southern California

### **Blackout Power Gating**



Force idleness until break even time has passed

- Even when there are pending instructions
- Would this not cause performance loss?
  - No, because of GPGPU-specific large heterogeneity of execution units and good mix of instruction types





### **Blackout Power Gating**





~2.4x increase in positive PG events over GATES (GATES ~3x w.r.t. baseline)





Naïve Blackout

### GATES and Blackout is independent



# Can lead to overaggressive power gating

























#### **Coordinated Blackout**

Warp Scheduler (GATES)

Dynamic priority switching is Blackout aware





#### Blackout | 20 University of Southern California



### **Coordinated Blackout**

Warp Scheduler (GATES)

Dynamic priority switching is Blackout aware

PG only when active warps count = 0







### **Coordinated Blackout**





### **Impact of Blackout**





Some benchmarks still show poor performance

Not enough active warps to hide forced idleness

Goal is as close to 0% overhead as possible





### **Adaptive Idle Detect**

**Reducing Worst Case Blackout Impact** 



Adaptive Idle Detect | 24 University of Southern California

### **Adaptive Idle Detect**

Dynamically change idle detect to avoid aggressive PG Infer performance loss due to Blackout

"Critical Wakeup" – Wakeup that occur the moment blackout period ends





### **Adaptive Idle Detect**



- Independent idle detect values for INT and FP pipelines
- Break execution time into epoch (1000 cycles)
- If critical wakeup > threshold, idleDetect++
- Conservatively decrement idleDetect every 4 epochs
- Bound idle detect between 5 10 cycles





Adaptive Idle Detect | 26 University of Southern California

### **Architectural Support**







Architectural Support | 27 University of Southern California



### **Evaluation**



**Evaluation | 28** University of Southern California

### **Evaluation Methodology**



GPGPU-Sim v3.0.2

- Nvidia GTX480
- GPUWattch and McPAT for Energy and Area estimation
  - 18 Benchmarks from ISPASS, Rodinia, Parboil
- Power Gating parameters
  - Wakeup delay 3 cycles
  - Breakeven time 14 cycles
  - Idle detect 5 cycles



### **Power Gating Wakeups / Overhead**



Coalescing idle periods – fewer, but longer, idle periods Blackout reduces PG overhead by 26%

Warped Gates reduces PG overhead by 46%



### **Integer Unit Static Energy Savings**



Blackout/Warped Gates is able to save energy when ConvPG cannot

Warped Gates saves ~1.5x static energy w.r.t. ConvPG



### **FP Unit Static Energy Savings**





**Evaluation | 32** University of Southern California

### **Performance Impact**



Naïve Blackout has high overhead due to aggressive PG Both ConvPG and Warped Gates has ~1% overhead





### Conclusion



- Execution units largest energy usage in GPGPUs
- Static energy becoming increasingly important
- Traditional microprocessor power gating techniques in GPGPUs due to short idle periods
- GATES Scheduler level technique to increase idle periods by coalescing instruction type issues
- Blackout Forced idleness of execution unit to avoid negative power gating events
- Adaptive Idle Detect Limit performance impact
- Warped Gates able to save 1.5x more static power than traditional microprocessor techniques, with negligible performance loss





## Thank you!

### Questions?



**Conclusion | 35** University of Southern California