Load-store queue management: an energy-efficient design based on a state-filtering mechanism
Summary (4 min read)
1 Introduction
- Modern out-of-order processors usually employ an array of sophisticated techniques to allow early execution of loads to improve performance.
- To ensure safe out-of-order execution of memory instructions, conventional implementations employ extensive buffering and cross-checking through what is referred to as the load-store queue, often implemented as two separate queues, the load queue (LQ) and the store queue (SQ).
- The authors also explore hardware support and runtime policies in this design.
- After factoring in the energy lost due to the slowdown, the processor as a whole still makes net energy savings.
2.1 Highlight of Conventional Design
- High-performance microprocessors typically employ very aggressive strategies for out-of-order memory instruction execution.
- Loads access memory before prior stores have committed their data (buffered in the SQ) to the memory subsystem.
- This involves associative searching of the SQ to match the address and finding out the closest producer store through priority encoding.
- Such speculative load execution may be incorrect if the unresolved stores turn out to access the same location.
- In typical implementations [11, 18], a replay squashes and reexecutes all instructions younger than the offending load.
2.2 Rationale
- While load forwarding, bypassing, and speculative execution improve performance, they also require a large amount of hardware that increases energy consumption.
- This new approach is based on the following observations: 1. Memory-based dependences are relatively infrequent.
- The authors experiments indicate that only around 12% of dynamic load instances need forwarding.
- Therefore, their contribution to the dynamic energy spent by the disambiguation hardware is greater than that of the stores.
- They either frequently communicate with an in-flight store or almost never do so.
2.3 Overall Structure
- Based on the above observations, the authors propose a design where they identify those loads that rarely communicate with in-flight stores and handle these loads using a different queue specially optimized for them.
- It provides fields for the address and the load data, as well as control logic to perform associative searches.
- The BNLQ is a nonassociative buffer that holds the load address and the load data.
- The banking of the BNLQ is purely for energy savings and not based on address.
- Though independent loads rarely communicate with in-flight stores, the authors still need to detect any communication and ensure that a load gets the correct data.
2.4 Using BNLQ and EBF to Handle Independent Loads
- According to its dependency prediction (Section 2.5), a load is sent to the ALQ or the BNLQ.
- When a load in the BNLQ is issued, it accesses the EBF based on its address and increments the corresponding counter.
- The authors study a design that mitigates the impact of such false hits in Section 2.6.
- Additionally, when wrong-path instructions or replayed instructions are flushed from the system, their modifications to the EBF should be undone, otherwise the “residue” will quickly “clog” the filter.
- This “upgrade” increases the energy expenditure for that load instruction, it avoids stalling dispatch which not only slows down the program but also increases energy consumption.
2.5.1 Profiling-Based Predictor
- This way, load dependence prediction is tied to the static instructions.
- The dynamic instances of these instructions constitute a slightly lower 88% of all dynamic load instructions.
- To explore the optimal threshold Th for each application, in this paper, the authors simply perform brute-force search in a region and find the one that best balances energy savings with performance degradation.
- Once an optimal threshold is selected for an application, static loads are marked in the program binary as dependent or independent based on the threshold.
- This, of course, implies ISA (instruction set architecture) support.
2.5.2 Dynamic Predictor
- In their approach, the prediction infor- mation is stored in a PC-indexed table similar to the one used in Alpha 21264 [8] to delay certain loads to reduce replay frequency.
- A second policy includes periodic refreshing of the table, restoring all predictions to “independent” [7].
- The idea behind this policy is that even loads that rarely communicate with an in-flight store will be predicted as dependent after the first instance of an EBF match, forcing all subsequent instances of the loads to the ALQ.
- After the squash, load instructions are inspected at commit time in order to find out those that triggered the squash.
- The DPU mode terminates when the number of matching loads reaches the saved count.
2.6 Handling of EBF False Hits
- The price for the simple and fast membership test using the bloom filter is the existence of false positives: an EBF match does not necessarily suggest a true data dependence.
- In their base design described above, a false dependence is treated just like a true dependence.
- If an address match happens, the authors squash the offending load and all subsequent instructions.
- Assuming the authors can only read out one load address from one bank of the BNLQ, the bandwidth of this searching is only a few loads per cycle.
- This search happens in the background and during this period, normal processing continues.
3 Experimental Framework
- The authors evaluate their proposed load-store queue design on a simulated, generic out-of-order processor.
- The main parameters used in the simulations as well as the applications used from SPEC CPU2000 suite are summarized in Table 2.
- As the evaluation tool, the authors use a heavily modified version of SimpleScalar [4] that incorporates their LQ-SQ model and a Wattch framework [3] that models the energy consumption throughout the processor.
- Profiling is performed using the train inputs from the SPEC CPU2000 distribution, whereas the production runs are performed using ref input and single sim-point regions [17] of one hundred million instructions.
4.1 Main Results
- The authors first present some broad-brushed comparison of their LQ-SQ design versus the conventional design.
- Indeed, although infrequent, memory-based dependences still exist and require efficient handling.
- The authors results also show that in option A with a dynamic predictor, even when all falsely-dependent loads are moved into the ALQ, the average false EBF hit rate is still about 28%.
- The authors see that there is variation across different applications as expected, but the variation is not significant.
- The authors can see that indeed even this moderate increase in the size of BNLQ can reduce the performance penalty and thus further improve processor-wide energy savings.
4.2 Dependence Prediction
- In order to compare the profile-based and the dynamic dependence predictors, and to study the effect of thresholds used in the predictor, the authors follow Grunwald et al. and employ the following metrics used in confidence estimation [10]: Predictive Value of a Positive test (PVP).
- It identifies the probability that a load dependence prediction is correct.
- This reduction translates into higher energy savings.
- Therefore, in their design, only very low PVN values are acceptable.
- When refreshing is applied, the PVP for the dynamic predictor is improved significantly, while the PVN does not degrade noticeably.
4.3 Threshold Exploration in Profiling
- From Figure 5 the authors can also see that when the threshold is above 0.1, the result of a static predictor quickly deteriorates: PVP remains much the same (or even reduces a little bit) and PVN sharply increases.
- Hence, in the profiling stage, the authors only need to explore the range between 0 and 0.1 to find the best threshold.
- The authors stop when the ratio between processor energy savings and performance degradation starts to reduce.
- In general, as the authors increase the threshold, they are putting more loads into the BNLQ.
- Figure 6 shows a typical example that visualizes the above discussion.
4.4 Refreshing Period Exploration
- As explained before, in the design using a dynamic predictor without refreshing [6], once a prediction changes to “dependent” for a load instruction, it remains that way for the duration of the program.
- On the one hand, the sooner (more frequent) refreshing happens, the fewer instances of those “occasionally-dependent” loads are sent to the ALQ.
- On the other hand, the more frequent refreshing is, the more squashes there are due to retraining.
- From this figure, the authors can see that a period of 100,000 cycles tends to be a good choice for all configurations.
- This is the setting the authors use in the results shown earlier in this section.
6 Conclusions
- The authors have proposed a split-LQ design where the conventional associative load queue (LQ) is replaced with a smaller associative LQ (ALQ) and a banked non-associative LQ (BNLQ).
- Loads are processed differently and accommodated in different queues based on the prediction whether they are dependent on an in-flight store.
- Dependence enforcement for the ALQ is the same as in the conventional design, whereas that for the BNLQ is done through a bloom filter that is inexact and conservative but energy-efficient for the common case where there is no dependence.
- A profile-based predictor is able to fine-tune the prediction threshold based on each application’s characteristics and this leads to better results than a basic dynamic predictor where a load is always predicted as dependent if a past instance has been dependent on an in-flight store.
- Overall, the several design options all achieve significant energy savings in the LQ-SQ (about 35-50%) with a negligible average performance penalty of about 1%.
Did you find this useful? Give us your feedback
Citations
13 citations
12 citations
11 citations
7 citations
Cites background from "Load-store queue management: an ene..."
...JSPSpaper_final.tex; 24/09/2009; 11:02; p.36 Load/store queues have been used in modern out-of-order microprocessors (Park et al., 2003; Sethumadhavan et al., 2007; Castro et al., 2005; Subramaniam and Loh, 2006)....
[...]
4 citations
References
43 citations
30 citations
18 citations
4 citations
"Load-store queue management: an ene..." refers background or methods in this paper
...One possible choice, which we explored in [6], is to hold this prediction for the rest of the execution....
[...]
...We use this policy as it improves performance at a very small hardware cost [6]....
[...]
...As explained before, in the design using a dynamic predictor without refreshing [6], once a prediction changes to “dependent” for a load instruction, it remains that way for the duration of the program....
[...]