scispace - formally typeset
Open AccessProceedings ArticleDOI

Reducing design complexity of the load/store queue

Reads0
Chats0
TLDR
This study introduces novel techniques to scale the load/store queue, and proposes two techniques, store-loadpair predictor and load buffer, to reduce the search bandwidth requirement; and one technique, segmentation, toscale the size.
Abstract
With faster CPU clocks and wider pipelines, all relevant microarchitecture components should scale accordingly. There have been many proposals for scaling the issue queue, register file, and cache hierarchy. However, nothing has been done for scaling the load/store queue, despite the increasing pressure on the load/store queue in terms of capacity and search bandwidth. The load/store queue is a CAM structure which holds in-flight memory instructions and supports simultaneous searches to honor memory dependencies and memory consistency models. Therefore, it is difficult to scale the load/store queue. In this study, we introduce novel techniques to scale the load/store queue. We propose two techniques, store-load pair predictor and load buffer, to reduce the search bandwidth requirement; and one technique, segmentation, to scale the size. We show that a load/store queue using our predictor and load buffer needs only one port to outperform a conventional two-ported load/store queue. Compared to the same base case, segmentation alone achieves speedups of 5% for integer benchmarks and 19% for floating point benchmarks. A one-ported load/store queue using all of our techniques improves performance on average by 6% and 23%, and up to 15% and 59%, for integer and floating-point benchmarks, respectively, over a two-ported conventional load/store queue.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Mechanisms for store-wait-free multiprocessors

TL;DR: This work proposes the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers, and atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses.
Proceedings ArticleDOI

Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

TL;DR: A new, software-based, defect detection and diagnosis technique, called access-control extension (ACE), that can access and control the microprocessor's internal state and can diagnose and locate a hardware defect, and then activate system repair through resource reconfiguration.
Proceedings ArticleDOI

Scalable hardware memory disambiguation for high ILP processors

TL;DR: A new class of solutions yields an order-of-magnitude reduction in the energy required to properly order loads and stores for windows of hundreds to thousands of in-flight instructions.
Proceedings ArticleDOI

Dual-core execution: building a highly scalable single-thread instruction window

TL;DR: Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor.
Proceedings ArticleDOI

Scalable Store-Load Forwarding via Store Queue Index Prediction

TL;DR: This work improves SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search, and uses prediction to identify the single SQ entry from which each dynamic load is most likely to forward.
References
More filters
Journal ArticleDOI

Shared memory consistency models: a tutorial

TL;DR: This work describes an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations, and most of these models emphasize the system optimizations they support.
Journal ArticleDOI

POWER4 system microarchitecture

TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.
Proceedings ArticleDOI

Memory dependence prediction using store sets

TL;DR: It is shown that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies.
Proceedings ArticleDOI

Energy-effective issue logic

TL;DR: It is shown that on average the effective instruction queue size can be reduced by a factor of 26% with minimal impact on performance, and this reduction together with the energy saved for empty and ready entries result in about 90.7% reduction in the energy consumed.
Proceedings ArticleDOI

Dynamic speculation and synchronization of data dependences

TL;DR: This paper proposes dynamic data dependence speculation techniques to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and to provide the synchronization needed to avoid amisspeculation.