Managing Wire Delay in Large Chip-Multiprocessor Caches
read more
Citations
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
Multifacets General Execution-Driven Multiprocessor Simulator (GEMS) Toolset
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Reactive NUCA: near-optimal block placement and replication in distributed caches
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory
References
The SPLASH-2 programs: characterization and methodological considerations
Simics: A full system simulation platform
The future of wires
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
Niagara: a 32-way multithreaded Sparc processor
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the contributions in "Managing wire delay in large chip-multiprocessor caches" ?
First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. In this paper, the authors develop L2 cache designs for CMPs that incorporate these three latency management techniques. The authors use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, the authors demonstrate that block migration is less effective for CMPs because 40-60 % of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, the authors observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, the authors show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, the authors present a hybrid design—combining all three techniques— that improves performance by an additional 2 % to 19 % over prefetching alone.
Q3. What is the role of wire delay in the design of a CMP?
Design partitioning, along with the integration of more metal layers, allows wire dimensions to decrease slower than transistor dimensions, thus keeping wire delay controllable for short distances [20, 42].
Q4. What is the reason for the increase in resistance of wires?
wire resistance increases due to the smaller cross-sectional area and sidewall capacitance increases due to the greater surface area exposed to adjacent wires.
Q5. How can a LC range be used to communicate data?
In the LC range, data can be communicated by propagating an incident wave across the transmission line instead of charging the capacitance across a series of wire segments.
Q6. How many banks are used to control the latency of the L2 cache?
the 16 MB L2 storage array is partitioned into 256 banks to control bank access latency [1] and to provide sufficient bandwidth to support up to 128 simultaneous on-chip processor requests.
Q7. What is the main reason why architects are turning to on-chip delay?
As wire delays continue to increase, architects will turn to additional techniques such as block migration or transmission lines to manage on-chip delay.
Q8. How many MB of L2 cache would be used?
The authors estimate eight 4-wide superscalar processors would occupy 120 mm2 [29] and 16 MB of L2 cache storage would occupy 64 mm2 [16].
Q9. Why do barnes, apsi, and fma3d encounter?
Due to a lack of frequently repeatable requests, barnes, apsi, and fma3d encounter 30% to 62% of L2 hits in the distant 10 bankclusters.
Q10. What is the importance of separating partial tags?
More importantly, separate partial tag structures require a complex coherence scheme that updates address location state in the partial tags with block migrations.
Q11. What is the effect of hardware-directed strided prefetching?
While current systems perform hardware-directed strided prefetching [19, 21, 43], its effectiveness is workload dependent [10, 22, 46, 49].
Q12. How does separating branch predictor histories improve prefetcher performance?
Similar to separating branch predictor histories per thread [39], separating the L2 miss streams by processor significantly improves prefetcher performance (up to 14 times for the workload ocean).
Q13. What is the potential benefit of block migration in a CMP cache?
the potential benefit of block migration in a CMP cache is fundamentally limited by the large amount of inter-processor sharing that exists in some workloads.