# Using Compression to Improve Chip Multiprocessor Performance ## Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet #### Motivation - Architectural trends - Multi-threaded workloads - Memory wall - Pin bandwidth bottleneck - CMP design trade-offs - Number of Cores - Cache Size - Pin Bandwidth - Are these trade-offs zero-sum? - No, compression helps cache size and pin bandwidth - ➡ However, hardware compression raises a few questions #### Thesis Contributions - Question: Is compression's overhead too high for caches? - Contribution #1: Simple compressed cache design - Compression Scheme: Frequent Pattern Compression - Cache Design: Decoupled Variable-Segment Cache - Question: Can cache compression hurt performance? - + Reduces miss rate - Increases hit latency - Contribution #2: Adaptive compression - Adapt to program behavior - Cache compression only when it helps 3 #### Thesis Contributions (Cont.) - Question: Does compression help CMP performance? - Contribution #3: Evaluate CMP cache and link compression - Cache compression improves CMP throughput - Link compression reduces pin bandwidth demand - Question: How does compression and prefetching interact? - Contribution #4: Compression interacts positively with prefetching - Speedup (Compr, Pref) > Speedup (Compr) x Speedup (Pref) - Question: How do we balance CMP cores and caches? - Contribution #5: Model CMP cache and link compression - Compression improves optimal CMP configuration #### **Outline** - Background - Technology and Software Trends - Compression Addresses CMP Design Challenges - Compressed Cache Design - Adaptive Compression - CMP Cache and Link Compression - Interactions with Hardware Prefetching - Balanced CMP Design - Conclusions 5 #### Technology and Software Trends - Technology trends: - Memory Wall: Increasing gap between processor and memory speeds - Pin Bottleneck: Bandwidth demand > Bandwidth Supply #### **Using Compression** - On-chip Compression - Cache Compression: Increases effective cache size - Link Compression: Increases effective pin bandwidth - Compression Requirements - Lossless - Low decompression (compression) overhead - Efficient for small block sizes - Minimal additional complexity - Thesis addresses CMP design with compression support 9 #### **Outline** - Background - Compressed Cache Design - Compressed Cache Hierarchy - Compression Scheme: FPC - Decoupled Variable-Segment Cache - Adaptive Compression - CMP Cache and Link Compression - Interactions with Hardware Prefetching - Balanced CMP Design - Conclusions ## Frequent Pattern Compression (FPC) - A significance-based compression algorithm - Compresses each 32-bit word separately - Suitable for short (32-256 byte) cache lines - Compressible Patterns: zeros, sign-ext. 4,8,16-bits, zero-padded half-word, two SE half-words, repeated byte - Pattern detected ⇒ Store pattern prefix + significant bits - A 64-byte line is decompressed in a five-stage pipeline ### Decoupled Variable-Segment Cache - Each set contains twice as many tags as uncompressed lines - Data area divided into 8-byte segments - Each tag is composed of: - Address tagPermissionsSame as uncompressed cache - CStatus: 1 if the line is compressed, 0 otherwise - CSize: Size of compressed line in segments - LRU/replacement bits #### **Outline** - Background - Compressed Cache Design - Adaptive Compression - Key Insight - Classification of Cache Accesses - Performance Evaluation - CMP Cache and Link Compression - Interactions with Hardware Prefetching - Balanced CMP Design - Conclusions 15 #### **Adaptive Compression** Use past to predict future - Key Insight: - LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts #### **Compression Predictor** - Estimate: Benefit(Compression) Cost(Compression) - Single counter : Global Compression Predictor (GCP) - Saturating up/down 19-bit counter - GCP updated on each cache access - Benefit: Increment by memory latency - Cost: Decrement by decompression latency - Optimization: Normalize to memory\_lat / decompression\_lat, 1 - Cache Allocation - Allocate compressed line if GCP ≥ 0 - Allocate uncompressed lines if GCP < 0 23 #### Simulation Setup - Workloads: - Commercial workloads [Computer'03, CAECW'02]: - OLTP: IBM DB2 running a TPC-C like workload - SPECJBB - · Static Web serving: Apache and Zeus - SPEC2000 benchmarks: - · SPECint: bzip, gcc, mcf, twolf - · SPECfp: ammp, applu, equake, swim - Simulator: - Simics full system simulator; augmented with: - Multifacet General Execution-driven Multiprocessor Simulator (GEMS) [Martin, et al., 2005, http://www.cs.wisc.edu/gems/] ### System configuration #### Configuration parameters: | L1 Cache | Split I&D, 64KB each, 4-way SA, 64B line, 3-cycles/access | |-----------|---------------------------------------------------------------------------------------------------------------------| | L2 Cache | Unified 4MB, <i>8-way</i> SA, 64B line, access latency 15 cycles + <i>5-cycle decompression latency</i> (if needed) | | Memory | 4GB DRAM, 400-cycle access time, 16 outstanding requests | | Processor | Dynamically scheduled SPARC V9, 4-wide superscalar, 64-entry Instruction Window, 128-entry reorder buffer | 25 ### Simulated Cache Configurations - Always: All compressible lines are stored in compressed format - Decompression penalty for all compressed lines - Never: All cache lines are stored in uncompressed format - Cache is 8-way set associative with half the number of sets - Does not incur decompression penalty - Adaptive: Adaptive compression scheme ### **Optimal Adaptive Compression?** > Optimal: Always with no decompression penalty #### Adaptive Compression: Summary - Cache compression increases cache capacity but slows down cache hit time - Helps some benchmarks (e.g., apache, mcf) - Hurts other benchmarks (e.g., gcc, ammp) - Adaptive compression - Uses (LRU) replacement stack to determine whether compression helps or hurts - Updates a single global saturating counter on cache accesses - Adaptive compression performs similar to the better of Always Compress and Never Compress #### **Outline** - Background - Compressed Cache Design - Adaptive Compression - CMP Cache and Link Compression - Interactions with Hardware Prefetching - Balanced CMP Design - Conclusions #### **Link Compression** - On-chip L3/Memory Controller transfers compressed messages - Data Messages - 1-8 sub-messages (flits), 8bytes each - Off-chip memory controller combines flits and stores to memory #### Hardware Stride-Based Prefetching - L2 Prefetching - + Hides memory latency - Increases pin bandwidth demand - L1 Prefetching - + Hides L2 latency - Increases L2 contention and on-chip bandwidth demand - Triggers L2 fill requests ⇒ Increases pin bandwidth demand #### • Questions: - Does compression interfere positively or negatively with hardware prefetching? - How does a system with both compare to a system with only compression or only prefetching? #### Interactions Terminology Assume a base system S with two architectural enhancements A and B, All systems run program P ``` Speedup(A) = Runtime(P, S) / Runtime(P, A) ``` $$Speedup(A, B) = Speedup(A) \times Speedup(B)$$ 39 ## Compression and Prefetching Interactions #### Positive Interactions: - + L1 prefetching hides part of decompression overhead - + Link compression reduces increased bandwidth demand because of prefetching - + Cache compression increases effective L2 size, L2 prefetching increases working set size #### Negative Interactions: - L2 prefetching and L2 compression can eliminate the same misses - ⇒ Is Interaction(Compression, Prefetching) positive or negative? #### Evaluation - 8-core CMP - Cores: single-threaded, out-of-order superscalar with a 64-entry IW, 128-entry ROB, 5 GHz clock frequency - L1 Caches: 64K instruction, 64K data, 4-way SA, 320 GB/sec total on-chip bandwidth (to/from L1), 3-cycle latency - Shared L2 Cache: 4 MB, 8-way SA (uncompressed), 15-cycle uncompressed latency, 128 outstanding misses - Memory: 400 cycles access latency, 20 GB/sec memory bandwidth - Prefetching: - Similar to prefetching in IBM's Power4 and Power5 - 8 unit/negative/non-unit stride streams for L1 and L2 for each processor - Issue 6 L1 prefetches on L1 miss - Issue 25 L2 prefetches on L2 miss # Compression and Prefetching: Summary - More cores on a CMP increase demand for: - On-chip (shared) caches - Off-chip pin bandwidth - Prefetching further increases demand on both resources - Cache and link compression alleviate such demand - Compression interacts positively with hardware prefetching 61 #### **Outline** - Background - Compressed Cache Design - Adaptive Compression - CMP Cache and Link Compression - Interactions with Hardware Prefetching - Balanced CMP Design - Analytical Model - Simulation - Conclusions #### Simple Analytical Model - Provides intuition on core vs. cache trade-off - Model simplifying assumptions: - Pin bandwidth demand follows an M/D/1 model - Miss rate decreases with square root of increase in cache size - Blocking in-order processor - Some parameters are fixed with change in #processors - Uses IPC instead of a work-related metric ### Balanced CMP Design: Summary - Analytical model can qualitatively predict throughput - Can provide intuition into trade-off - Quickly analyzes sensitivity to CMP parameters - Not accurate enough to estimate throughput - Compression improves throughput across all configurations - Larger improvement for "optimal" configuration - Compression can shift balance towards more cores - Compression interacts positively with prefetching for most configurations #### Related Work (1/2) - Memory Compression - IBM MXT technology - Compression schemes: X-Match, X-RL - Significance-based compression: Ekman and Stenstrom - Virtual Memory Compression - Wilson et al.: varying compression cache size - Cache Compression - Selective compressed cache: compress blocks to half size - Frequent value cache: frequent L1 values stored in cache - Hallnor and Reinhardt: Use indirect indexed cache for compression 69 #### Related Work (2/2) - Link Compression - Farrens and Park: address compaction - Citron and Rudolph: table-based approach for address & data - Prefetching in CMPs - IBM's Power4 and Power5 stride-based prefetching - Beckmann and Wood: prefetching improves 8-core performance - Gunasov and Burtscher: One CMP core dedicated to prefetching - Balanced CMP Design - Huh et al.: Pin bandwidth a first-order constraint - Davis et al.: Simple Chip multi-threaded cores maximize throughput #### Conclusions - CMPs increase demand on caches and pin bandwidth - Prefetching further increases such demand - Cache Compression - + Increases effective cache size Increases cache access time - Link Compression decreases bandwidth demand - Adaptive Compression - Helps programs that benefit from compression - Does not hurt programs that are hurt by compression - CMP Cache and Link Compression - Improve CMP throughput - Interact positively with hardware prefetching - Compression improves CMP performance 71 ## Backup Slides - Moore's Law: CPU vs. Memory S - Moore's Law (1965) - Software Trends - Decoupled Variable-Segment Cache Classification of L2 Accesses Compression Ratios Generativity to #Cortes Analytical Model: IPC Model Parameters Model Sensitivity to - Compression Ratios - Seg. Compr. Ratios SPECint SPECfp Commercial Model Sensitivity to Pin Bandwidth - Frequent Pattern Histogram - Segment Histogram - (LRU) Stack Replacement - Cache Bits Read or Written - Sensitivity to L2 Associativity - Sensitivity to Memory Latency - Sensitivity to Decompression Latency - Sensitivity to Cache Line Size - Phase Behavior - CMP Compression: Sensitivity to Memory Latency - CMP Compression: Sensitivity to Pin Bandwidth - Sensitivity to #Cores OLTP - Sensitivity to #Cores Apache - Model Sensitivity to Memory Latency - Model Sensitivity to L2 Miss rate - Model-Sensitivity to Compression Ratio - Model Sensitivity to Decompression Penalty - Model Sensitivity to Perfect CPI Simulation (20 GB/sec bandwidth) apache - Simulation (20 GB/sec bandwidth) oltp Simulation (20 GB/sec bandwidth) jbb - Simulation (10 GB/sec bandwidth) zeus - Simulation (10 GB/sec bandwidth) apache - CMP Compression: Sensitivity to L2 Size CMP Compression: Sensitivity to L2 Size CMP Compression: Sensitivity to L2 Size CMP Compression: Sensitivity to L2 Size - Online Transaction Processing (OLTP) - Java Server Workload (SPECjbb) - Static Web Content Serving: Apache #### Classification of L2 Accesses - Cache hits: - Unpenalized hit: Hit to an uncompressed line that would have hit without compression - Penalized hit: Hit to a compressed line that would have hit without compression - + Avoided miss: Hit to a line that would NOT have hit without compression - Cache misses: - + Avoidable miss: Miss to a line that would have hit with compression - Unavoidable miss: Miss to a line that would have missed even with compression Seg. Compression Ratios - SPECint Seg. Compression Ratios - SPECfp # Seg. Compression Ratios - Commercial #### Frequent Pattern Histogram 100 80 Uncompressible 60 Pattern % Compr. 16-bits Compr. 8-bits Compr. 4-bits Zeros 20 bzip gcc mcf twolf ammp applu equake swim apache zeus oltp jbb (LRU) Stack Replacement - Differentiate penalized hits and avoided misses? - Only hits to top half of the tags in the LRU stack are penalized hits - Differentiate avoidable and unavoidable misses? $Avoidable\_Miss(k) \Leftrightarrow \sum_{LRU(i)=1}^{LRU(i)=LRU(k)} CSize(i) \leq 16$ - Is not dependent on LRU replacement - Any replacement algorithm for top half of tags - Any stack algorithm for the remaining tags # Cache Bits Read or Written 85 # Sensitivity to L2 Associativity # Sensitivity to Memory Latency 87 # Sensitivity to Decompression Latency ### **Phase Behavior** ### Commercial CMP Designs - IBM Power5 Chip: - Two processor cores, each 2-way multi-threaded - ~1.9 MB on-chip L2 cache - < 0.5 MB per thread with no sharing - Compare with 0.75 MB per thread in Power4+ - Est. ~16GB/sec. max. pin bandwidth - Sun's Niagara Chip: - Eight processor cores, each 4-way multi-threaded - 3 MB L2 cache - < 0.4 MB per core, < 0.1 MB per thread with no sharing - Est. ~22 GB/sec. pin bandwidth 91 #### CMP Compression – Miss Rates 1.0 Normalized Missrate No Compression L2 Compression 15.2 11.3 3.3 3.5 75.2 9.3 17.5 5.1 mgrid apache oltp jbb apsi fma3d zeus art u # CMP Compression: Pin Bandwidth **Demand** 93 # CMP Compression: Sensitivity to L2 Size # CMP Compression: Sensitivity to Memory Latency 95 # CMP Compression: Sensitivity to Pin Bandwidth | Benc<br>hmark | L1 I Cache | | | L1 D Cache | | | L2 Cache | | | |---------------|------------|----------|----------|------------|----------|----------|----------|----------|----------| | | PF rate | coverage | accuracy | PF rate | coverage | accuracy | PF rate | coverage | accuracy | | apache | 4.9 | 16.4 | 42.0 | 6.1 | 8.8 | 55.5 | 10.5 | 37.7 | 57.9 | | zeus | 7.1 | 14.5 | 38.9 | 5.5 | 17.7 | 79.2 | 8.2 | 44.4 | 56.0 | | oltp | 13.5 | 20.9 | 44.8 | 2.0 | 6.6 | 58.0 | 2.4 | 26.4 | 41.5 | | jbb | 1.8 | 24.6 | 49.6 | 4.2 | 23.1 | 60.3 | 5.5 | 34.2 | 32.4 | | art | 0.05 | 9.4 | 24.1 | 56.3 | 30.9 | 81.3 | 49.7 | 56.0 | 85.0 | | apsi | 0.04 | 15.7 | 30.7 | 8.5 | 25.5 | 96.9 | 4.6 | 95.8 | 97.6 | | fma3d | 0.06 | 7.5 | 14.4 | 7.3 | 27.5 | 80.9 | 8.8 | 44.6 | 73.5 | # Sensitivity to #Cores - OLTP ### Sensitivity to #Cores - Apache apache 99 # Analytical Model: IPC $$IPC(N) = \frac{N}{CPI_{PerfectL2} + dp + MissPenalty_{L2}.\alpha.\sqrt{\frac{N - sharers_{av}(N) + 1}{c.(m - k_p.N)}}}$$ $MissLatency_{L2} = MemoryLatency + LinkLatency \\$ $$LinkLatency = \overline{X} + \frac{IPC(N).Missrate(S_{L2p}).\overline{X}^{2}}{2.(1 - IPC(N).Missrate(S_{L2p}).\overline{X})}$$ #### **Model Parameters** - Divide chip area between cores and caches - Area of one (in-order) core = 0.5 MB L2 cache - Total chip area = 16 cores, or 8 MB cache - Core frequency = 5 GHz - Available bandwidth = 20 GB/sec. - Model Parameters (hypothetical benchmark) - Compression Ratio = 1.75 - Decompression penalty = 0.4 cycles per instruction - Miss rate = 10 misses per 1000 instructions for 1proc, 8 MB Cache - IPC for one processor, perfect cache = 1 - Average #sharers per block = 1.3 (for #proc > 1) 101 #### Model - Sensitivity to Memory Latency - Compression's impact similar on both extremes - **Compression can shift optimal configuration towards more cores (though not significantly)** ### Model - Sensitivity to Pin Bandwidth 103 ij. # Model - Sensitivity to L2 Miss rate ### Model-Sensitivity to Compression Ratio 105 ## Model - Sensitivity to Decompression Penalty ### Model - Sensitivity to Perfect CPI 107 # Simulation (20 GB/sec bandwidth) - apache กล ### Simulation (20 GB/sec bandwidth) oltp # Simulation (20 GB/sec bandwidth) - jbb # Simulation (10 GB/sec bandwidth) - apache L.B ### Simulation (10 GB/sec bandwidth) oltp 113 ### Simulation (10 GB/sec bandwidth) - jbb # Compression & Prefetching Interaction – 10 GB/sec pin bandwidth ⊃ Interaction is positive for most configurations (and all "optimal" configurations) #### Model Error – oltp, jbb 117 # Online Transaction Processing (OLTP) #### DB2 with a TPC-C-like workload. - Based on the TPC-C v3.0 benchmark. - We use IBM's DB2 V7.2 EEE database management system and an IBM benchmark kit to build the database and emulate users. - 5 GB 25000-warehouse database on eight raw disks and an additional dedicated database log disk. - We scaled down the sizes of each warehouse by maintaining the reduced ratios of 3 sales districts per warehouse, 30 customers per district, and 100 items per warehouse (compared to 10, 30,000 and 100,000 required by the TPC-C specification). - Think and keying times for users are set to zero. - 16 users per processor - Warmup interval: 100,000 transactions #### Java Server Workload (SPECjbb) #### SpecJBB. - We used Sun's HotSpot 1.4.0 Server JVM and Solaris's native thread implementation - The benchmark includes driver threads to generate transactions - System heap size to 1.8 GB and the new object heap size to 256 MB to reduce the frequency of garbage collection - 24 warehouses, with a data size of approximately 500 MB. 119 # Static Web Content Serving: #### Apache. - We use Apache 2.0.39 for SPARC/Solaris 9 configured to use pthread locks and minimal logging at the web server - We use the Scalable URL Request Generator (SURGE) as the client. **Apache** - SURGE generates a sequence of static URL requests which exhibit representative distributions for document popularity, document sizes, request sizes, temporal and spatial locality, and embedded document count - We use a repository of 20,000 files (totaling ~500 MB) - Clients have zero think time - We compiled both Apache and Surge using Sun's WorkShop C 6.1 with aggressive optimization