Reducing design complexity of the load/store queue

doi:10.5555/956417.956555

Open AccessProceedings ArticleDOI

Reducing design complexity of the load/store queue

Il Park, +2 more

- pp 411-422

Chats0

TLDR

This study introduces novel techniques to scale the load/store queue, and proposes two techniques, store-loadpair predictor and load buffer, to reduce the search bandwidth requirement; and one technique, segmentation, toscale the size.

Abstract:

With faster CPU clocks and wider pipelines, all relevant microarchitecture components should scale accordingly. There have been many proposals for scaling the issue queue, register file, and cache hierarchy. However, nothing has been done for scaling the load/store queue, despite the increasing pressure on the load/store queue in terms of capacity and search bandwidth. The load/store queue is a CAM structure which holds in-flight memory instructions and supports simultaneous searches to honor memory dependencies and memory consistency models. Therefore, it is difficult to scale the load/store queue. In this study, we introduce novel techniques to scale the load/store queue. We propose two techniques, store-load pair predictor and load buffer, to reduce the search bandwidth requirement; and one technique, segmentation, to scale the size. We show that a load/store queue using our predictor and load buffer needs only one port to outperform a conventional two-ported load/store queue. Compared to the same base case, segmentation alone achieves speedups of 5% for integer benchmarks and 19% for floating point benchmarks. A one-ported load/store queue using all of our techniques improves performance on average by 6% and 23%, and up to 15% and 59%, for integer and floating-point benchmarks, respectively, over a two-ported conventional load/store queue.

Reducing design complexity of the load/store queue

Citations

Mechanisms for store-wait-free multiprocessors

Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Scalable hardware memory disambiguation for high ILP processors

Dual-core execution: building a highly scalable single-thread instruction window

Scalable Store-Load Forwarding via Store Queue Index Prediction

References

Shared memory consistency models: a tutorial

POWER4 system microarchitecture

Memory dependence prediction using store sets

Energy-effective issue logic

Dynamic speculation and synchronization of data dependences

Related Papers (5)

Memory dependence prediction using store sets

Checkpoint processing and recovery: towards scalable large instruction window processors

Two Techniques to Enhance the Performance of Memory Consistency Models.

POWER4 system microarchitecture

The SimpleScalar tool set, version 2.0