Proceedings ArticleDOI
Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation
Akihiro Yamamoto,Yusuke Tanaka,Hideki Ando,Toshio Shimada +3 more
- pp 33-40
Reads0
Chats0
TLDR
The evaluation results show that the combined scheme significantly improve performance over a processor with an automatic prefetcher and the strength on prefetch of data with an irregular access pattern offers the best use of the scheme.Abstract:
This paper proposes an instruction pre-execution scheme that reduces latency and early scheduling of loads for a high performance processor. Our scheme exploits the difference between the available amount of instruction-level parallelism with an unlimited number of physical registers and that with an actual number of physical registers. We introduce a scheme called two-step physical register deallocation. Our scheme deallocates physical registers at the renaming stage as a first step, and eliminates pipeline stalls caused by a physical register shortage. Instructions wait for the final deallocation as a second step in the instruction window. While waiting, the scheme allows pre-execution of instructions. This enables prefetching of load data and early calculation of memory effective addresses. In particular, our execution-based scheme has the strength on prefetch of data with an irregular access pattern. Considering the strength of an automatic prefetcher for a regular access pattern, combining it with our scheme offers the best use of our scheme. The evaluation results show that the combined scheme significantly improve performance over a processor with an automatic prefetcher.read more
Citations
More filters
Posted Content
Design Space Exploration to Find the Optimum Cache and Register File Size for Embedded Applications
TL;DR: Experimental results show that although having bigger cache and register file is one of the performance improvement approaches in embedded processors, however, by increasing the size of these parameters over a threshold level, performance improvement is saturated and then, decreased.
Proceedings ArticleDOI
Reducing register file size through instruction pre-execution enhanced by value prediction
Yusuke Tanaka,Hideki Ando +1 more
TL;DR: The use of value prediction is proposed to solve the problems of MLP exploitable under the unlimited number of physical registers, and has the advantage over the conventional way of the usage for enhancing ILP, that there is no need to recover from misspeculation.
Journal ArticleDOI
Two-Step Physical Register Deallocation for Data Prefetching and Address Pre-Calculation
TL;DR: The two-step physical register deallocation scheme is introduced, which deallocates physical registers at the renaming stage as a first step, and eliminates pipeline stalls caused by a shortage of physical registers.
Journal ArticleDOI
MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism
TL;DR: A dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP is proposed, which achieves nearly the best performance possible with fixed-size resources.
Journal ArticleDOI
Effect of Thread Level Parallelism on the Performance of Optimum Architecture for Embedded Applications
Mehdi Alipour,Hojjat Taghdisi +1 more
TL;DR: In this paper, the authors perform a comprehensive design space exploration for an optimum uni-thread embedded processor based on the limited area and power budgets, and run multiple threads on this architecture to find out the maximum thread level parallelism (TLP) based on performance per power and area optimum un-thread architecture.
References
More filters
Journal ArticleDOI
Hitting the memory wall: implications of the obvious
William A. Wulf,Sally A. McKee +1 more
TL;DR: This work proposes an exact analysis, removing all remaining uncertainty, based on model checking, using abstract-interpretation results to prune down the model for scalability, and notably improves precision upon classical abstract interpretation at reasonable cost.
Proceedings ArticleDOI
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI
Value locality and load value prediction
TL;DR: This paper introduces the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describes how to effectively capture and exploit it in order to perform load value prediction.
Proceedings ArticleDOI
Prefetching using Markov predictors
Doug Joseph,Dirk Grunwald +1 more
TL;DR: The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.
Proceedings ArticleDOI
Evaluating stream buffers as a secondary cache replacement
TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.