scispace - formally typeset
Proceedings ArticleDOI

Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

Reads0
Chats0
TLDR
The evaluation results show that the combined scheme significantly improve performance over a processor with an automatic prefetcher and the strength on prefetch of data with an irregular access pattern offers the best use of the scheme.
Abstract
This paper proposes an instruction pre-execution scheme that reduces latency and early scheduling of loads for a high performance processor. Our scheme exploits the difference between the available amount of instruction-level parallelism with an unlimited number of physical registers and that with an actual number of physical registers. We introduce a scheme called two-step physical register deallocation. Our scheme deallocates physical registers at the renaming stage as a first step, and eliminates pipeline stalls caused by a physical register shortage. Instructions wait for the final deallocation as a second step in the instruction window. While waiting, the scheme allows pre-execution of instructions. This enables prefetching of load data and early calculation of memory effective addresses. In particular, our execution-based scheme has the strength on prefetch of data with an irregular access pattern. Considering the strength of an automatic prefetcher for a regular access pattern, combining it with our scheme offers the best use of our scheme. The evaluation results show that the combined scheme significantly improve performance over a processor with an automatic prefetcher.

read more

Citations
More filters
Posted Content

Design Space Exploration to Find the Optimum Cache and Register File Size for Embedded Applications

TL;DR: Experimental results show that although having bigger cache and register file is one of the performance improvement approaches in embedded processors, however, by increasing the size of these parameters over a threshold level, performance improvement is saturated and then, decreased.
Proceedings ArticleDOI

Reducing register file size through instruction pre-execution enhanced by value prediction

TL;DR: The use of value prediction is proposed to solve the problems of MLP exploitable under the unlimited number of physical registers, and has the advantage over the conventional way of the usage for enhancing ILP, that there is no need to recover from misspeculation.
Journal ArticleDOI

Two-Step Physical Register Deallocation for Data Prefetching and Address Pre-Calculation

TL;DR: The two-step physical register deallocation scheme is introduced, which deallocates physical registers at the renaming stage as a first step, and eliminates pipeline stalls caused by a shortage of physical registers.
Journal ArticleDOI

MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism

TL;DR: A dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP is proposed, which achieves nearly the best performance possible with fixed-size resources.
Journal ArticleDOI

Effect of Thread Level Parallelism on the Performance of Optimum Architecture for Embedded Applications

TL;DR: In this paper, the authors perform a comprehensive design space exploration for an optimum uni-thread embedded processor based on the limited area and power budgets, and run multiple threads on this architecture to find out the maximum thread level parallelism (TLP) based on performance per power and area optimum un-thread architecture.
References
More filters
Journal ArticleDOI

Hitting the memory wall: implications of the obvious

TL;DR: This work proposes an exact analysis, removing all remaining uncertainty, based on model checking, using abstract-interpretation results to prune down the model for scalability, and notably improves precision upon classical abstract interpretation at reasonable cost.
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI

Value locality and load value prediction

TL;DR: This paper introduces the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describes how to effectively capture and exploit it in order to perform load value prediction.
Proceedings ArticleDOI

Prefetching using Markov predictors

TL;DR: The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.
Proceedings ArticleDOI

Evaluating stream buffers as a secondary cache replacement

TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.
Related Papers (5)