Reducing design complexity of the load/store queue
Il Park,Chong-Liang Ooi,T. N. Vijaykumar +2 more
- pp 411-422
Reads0
Chats0
TLDR
This study introduces novel techniques to scale the load/store queue, and proposes two techniques, store-loadpair predictor and load buffer, to reduce the search bandwidth requirement; and one technique, segmentation, toscale the size.Abstract:
With faster CPU clocks and wider pipelines, all relevant microarchitecture components should scale accordingly. There have been many proposals for scaling the issue queue, register file, and cache hierarchy. However, nothing has been done for scaling the load/store queue, despite the increasing pressure on the load/store queue in terms of capacity and search bandwidth. The load/store queue is a CAM structure which holds in-flight memory instructions and supports simultaneous searches to honor memory dependencies and memory consistency models. Therefore, it is difficult to scale the load/store queue. In this study, we introduce novel techniques to scale the load/store queue. We propose two techniques, store-load pair predictor and load buffer, to reduce the search bandwidth requirement; and one technique, segmentation, to scale the size. We show that a load/store queue using our predictor and load buffer needs only one port to outperform a conventional two-ported load/store queue. Compared to the same base case, segmentation alone achieves speedups of 5% for integer benchmarks and 19% for floating point benchmarks. A one-ported load/store queue using all of our techniques improves performance on average by 6% and 23%, and up to 15% and 59%, for integer and floating-point benchmarks, respectively, over a two-ported conventional load/store queue.read more
Citations
More filters
Proceedings ArticleDOI
Mechanisms for store-wait-free multiprocessors
TL;DR: This work proposes the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers, and atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses.
Proceedings ArticleDOI
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation
TL;DR: A new, software-based, defect detection and diagnosis technique, called access-control extension (ACE), that can access and control the microprocessor's internal state and can diagnose and locate a hardware defect, and then activate system repair through resource reconfiguration.
Proceedings ArticleDOI
Scalable hardware memory disambiguation for high ILP processors
TL;DR: A new class of solutions yields an order-of-magnitude reduction in the energy required to properly order loads and stores for windows of hundreds to thousands of in-flight instructions.
Proceedings ArticleDOI
Dual-core execution: building a highly scalable single-thread instruction window
TL;DR: Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor.
Proceedings ArticleDOI
Scalable Store-Load Forwarding via Store Queue Index Prediction
TL;DR: This work improves SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search, and uses prediction to identify the single SQ entry from which each dynamic load is most likely to forward.
References
More filters
Journal ArticleDOI
Shared memory consistency models: a tutorial
TL;DR: This work describes an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations, and most of these models emphasize the system optimizations they support.
Journal ArticleDOI
POWER4 system microarchitecture
TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.
Proceedings ArticleDOI
Memory dependence prediction using store sets
George Z. Chrysos,Joel Emer +1 more
TL;DR: It is shown that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies.
Proceedings ArticleDOI
Energy-effective issue logic
TL;DR: It is shown that on average the effective instruction queue size can be reduced by a factor of 26% with minimal impact on performance, and this reduction together with the energy saved for empty and ready entries result in about 90.7% reduction in the energy consumed.
Proceedings ArticleDOI
Dynamic speculation and synchronization of data dependences
TL;DR: This paper proposes dynamic data dependence speculation techniques to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and to provide the synchronization needed to avoid amisspeculation.