Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

doi:10.1145/1327171.1327175

Proceedings ArticleDOI

Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

Akihiro Yamamoto, +3 more

- pp 33-40

Chats0

TLDR

The evaluation results show that the combined scheme significantly improve performance over a processor with an automatic prefetcher and the strength on prefetch of data with an irregular access pattern offers the best use of the scheme.

Abstract:

This paper proposes an instruction pre-execution scheme that reduces latency and early scheduling of loads for a high performance processor. Our scheme exploits the difference between the available amount of instruction-level parallelism with an unlimited number of physical registers and that with an actual number of physical registers. We introduce a scheme called two-step physical register deallocation. Our scheme deallocates physical registers at the renaming stage as a first step, and eliminates pipeline stalls caused by a physical register shortage. Instructions wait for the final deallocation as a second step in the instruction window. While waiting, the scheme allows pre-execution of instructions. This enables prefetching of load data and early calculation of memory effective addresses. In particular, our execution-based scheme has the strength on prefetch of data with an irregular access pattern. Considering the strength of an automatic prefetcher for a regular access pattern, combining it with our scheme offers the best use of our scheme. The evaluation results show that the combined scheme significantly improve performance over a processor with an automatic prefetcher.

Citations

PDF

Open Access

More filters

Posted Content

Design Space Exploration to Find the Optimum Cache and Register File Size for Embedded Applications

Mehdi Alipour, +2 more

- 09 May 2012 -

arXiv: Hardware Architecture

TL;DR: Experimental results show that although having bigger cache and register file is one of the performance improvement approaches in embedded processors, however, by increasing the size of these parameters over a threshold level, performance improvement is saturated and then, decreased.

...read moreread less

Proceedings ArticleDOI

Reducing register file size through instruction pre-execution enhanced by value prediction

Yusuke Tanaka, +1 more

TL;DR: The use of value prediction is proposed to solve the problems of MLP exploitable under the unlimited number of physical registers, and has the advantage over the conventional way of the usage for enhancing ILP, that there is no need to recover from misspeculation.

...read moreread less

Journal ArticleDOI

Two-Step Physical Register Deallocation for Data Prefetching and Address Pre-Calculation

Akihiro Yamamoto, +4 more

- 21 Aug 2008 -

Ipsj Online Transactions

TL;DR: The two-step physical register deallocation scheme is introduced, which deallocates physical registers at the renaming stage as a first step, and eliminates pipeline stalls caused by a shortage of physical registers.

...read moreread less

Journal ArticleDOI

MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism

Yuya Kora, +2 more

- 01 Dec 2014 -

IEICE Transactions on Information and Sy...

TL;DR: A dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP is proposed, which achieves nearly the best performance possible with fixed-size resources.

...read moreread less

Journal ArticleDOI

Effect of Thread Level Parallelism on the Performance of Optimum Architecture for Embedded Applications

Mehdi Alipour, +1 more

TL;DR: In this paper, the authors perform a comprehensive design space exploration for an optimum uni-thread embedded processor based on the limited area and power budgets, and run multiple threads on this architecture to find out the maximum thread level parallelism (TLP) based on performance per power and area optimum un-thread architecture.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Hitting the memory wall: implications of the obvious

William A. Wulf, +1 more

- 01 Mar 1995 -

ACM Sigarch Computer Architecture News

TL;DR: This work proposes an exact analysis, removing all remaining uncertainty, based on model checking, using abstract-interpretation results to prune down the model for scalability, and notably improves precision upon classical abstract interpretation at reasonable cost.

...read moreread less

Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Norman P. Jouppi

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.

...read moreread less

Proceedings ArticleDOI

Value locality and load value prediction

Mikko H. Lipasti, +2 more

TL;DR: This paper introduces the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describes how to effectively capture and exploit it in order to perform load value prediction.

...read moreread less

Proceedings ArticleDOI

Prefetching using Markov predictors

Doug Joseph, +1 more

TL;DR: The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.

...read moreread less

Proceedings ArticleDOI

Evaluating stream buffers as a secondary cache replacement

Subbarao Palacharla, +1 more

TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.

...read moreread less