Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

doi:10.1109/MICRO.2006.26

Proceedings ArticleDOI

Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

- pp 273-284

TLDR

The original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance, and the relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition.

Abstract:

Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "Fire-and- Forget" (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ.

Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Citations

Dynamically Specialized Datapaths for energy efficient computing

Mechanisms for store-wait-free multiprocessors

Late-binding: enabling unordered load-store queues

NoSQ: Store-Load Communication without a Store Queue

Store-to-load forwarding based on load/store address computation source information comparisons

References

MiBench: A free, commercially representative embedded benchmark suite

MediaBench: a tool for evaluating and synthesizing multimedia and communications systems

SimpleScalar: an infrastructure for computer system modeling

Efficient detection of all pointer and array access errors

Modern Processor Design: Fundamentals of Superscalar Processors

Related Papers (5)

Memory dependence prediction using store sets

Scalable hardware memory disambiguation for high ILP processors

Reducing design complexity of the load/store queue

Wattch: a framework for architectural-level power analysis and optimizations

The Alpha 21264 microprocessor