scispace - formally typeset
Proceedings ArticleDOI

Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

TLDR
The original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance, and the relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition.
Abstract
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "Fire-and- Forget" (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ.

read more

Citations
More filters
Proceedings ArticleDOI

Dynamically Specialized Datapaths for energy efficient computing

TL;DR: D Dynamically Specialized Datapaths are proposed to improve the energy efficiency of general purpose programmable processors and show that in most cases two DySER blocks can achieve the same performance as having a specialized hardware module for each path-tree.
Proceedings ArticleDOI

Mechanisms for store-wait-free multiprocessors

TL;DR: This work proposes the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers, and atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses.
Proceedings ArticleDOI

Late-binding: enabling unordered load-store queues

TL;DR: It is shown how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses, and it is shown that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSZs into address-interleaved banks.
Proceedings ArticleDOI

NoSQ: Store-Load Communication without a Store Queue

TL;DR: NoSQ is a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine and slightly outperforms a conventional store-queue based design on most benchmarks.
Patent

Store-to-load forwarding based on load/store address computation source information comparisons

TL;DR: In this paper, the authors propose a control logic coupled to the queue, configured to encounter a load instruction, which detects that the load information matches the store information held in a valid one of the plurality of queue entries and responsively predicts that the microprocessor should forward to the load instruction the store data specified by the store instruction whose store information matching the load instructions.
References
More filters
Proceedings ArticleDOI

MiBench: A free, commercially representative embedded benchmark suite

TL;DR: A new version of SimpleScalar that has been adapted to the ARM instruction set is used to characterize the performance of the benchmarks using configurations similar to current and next generation embedded processors.
Proceedings ArticleDOI

MediaBench: a tool for evaluating and synthesizing multimedia and communications systems

TL;DR: The MediaBench benchmark suite as discussed by the authors is a benchmark suite that has been designed to fill the gap between the compiler community and embedded applications developers, which has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement, and integration with system synthesis algorithms to establish usefulness.
Journal ArticleDOI

SimpleScalar: an infrastructure for computer system modeling

TL;DR: The SimpleScalar tool set provides an infrastructure for simulation and architectural modeling that can model a variety of platforms ranging from simple unpipelined processors to detailed dynamically scheduled microarchitectures with multiple-level memory hierarchies.
Proceedings ArticleDOI

Efficient detection of all pointer and array access errors

TL;DR: This work presents a pointer and array access checking technique that provides complete error coverage through a simple set of program transformations, and is the first technique that detects all spatial and temporal access errors.
Book

Modern Processor Design: Fundamentals of Superscalar Processors

TL;DR: This book brings together the numerous microarchitectural techniques for harvesting more instruction-level parallelism (ILP) to achieve better processor performance that have been proposed and implemented in real machines.