(PDF) Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers (1990) | Norman P. Jouppi

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Micro-pages: increasing DRAM efficiency with locality-aware data placement

[...]

Kshitij Sudan¹, Niladrish Chatterjee¹, David Nellans¹, Manu Awasthi¹, Rajeev Balasubramonian¹, Al Davis¹ - Show less +2 more•Institutions (1)

University of Utah¹

13 Mar 2010

TL;DR: The schemes presented here are motivated by the observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks, which will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time.

...read moreread less

Abstract: Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems. The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks. Thus, the co-location of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9% (max. 18%) and reduces memory energy consumption by 15% (max. 70%).

...read moreread less

170 citations

Proceedings Article•DOI•

Efficient simulation of caches under optimal replacement with applications to miss characterization

[...]

Rabin A. Sugumar, Santosh G. Abraham

01 Jun 1993

TL;DR: The OPT model is proposed that uses cache simulation under optimal (OPT) replacement to obtain a finer and more accurate characterization of misses than the three Cs model, and three new techniques for optimal cache simulation are presented.

...read moreread less

Abstract: Cache miss characterization models such as the three Cs model are useful in developing schemes to reduce cache misses and their penalty. In this paper we propose the OPT model that uses cache simulation under optimal (OPT) replacement to obtain a finer and more accurate characterization of misses than the three Cs model. However, current methods for optimal cache simulation are slow and difficult to use. We present three new techniques for optimal cache simulation. First, we propose a limited lookahead strategy with error fixing, which allows one pass simulation of multiple optimal caches. Second, we propose a scheme to group entries in the OPT stack, which allows efficient tree based fully-associative cache simulation under OPT. Third, we propose a scheme for exploiting partial inclusion in set-associative cache simulation under OPT. Simulators based on these algorithms were used to obtain cache miss characterizations using the OPT model for nine SPEC benchmarks. The results indicate that miss ratios under OPT are substantially lower than those under LRU replacement, by up to 70% in fully-associative caches, and up to 32% in two-way set-associative caches.

...read moreread less

168 citations

Journal Article•DOI•

Going the distance for TLB prefetching: an application-driven study

[...]

Gokul B. Kandiraju¹, Anand Sivasubramaniam¹•Institutions (1)

Pennsylvania State University¹

01 May 2002

TL;DR: A novel prefetching mechanism is proposed, called Distance Prefetching, that attempts to capture patterns in the reference behavior in a smaller space than earlier proposals, and is demonstrated with detailed simulations of a wide variety of applications.

...read moreread less

Abstract: The importance of the Translation Lookaside Buffer (TLB) on system performance is well known There have been numerous prior efforts addressing TLB design issues for cutting down access times and lowering miss rates However, it was only recently that the first exploration [26] on prefetching TLB entries ahead of their need was undertaken and a mechanism called Recency Prefetching was proposed There is a large body of literature on prefetching for caches, and it is not clear how they can be adapted (or if the issues are different) for TLBs, how well suited they are for TLB prefetching, and how they compare with the recency prefetching mechanismThis paper presents the first detailed comparison of different prefetching mechanisms (previously proposed for caches) - arbitrary stride prefetching, and markov prefetching - for TLB entries, and evaluates their pros and cons In addition, this paper proposes a novel prefetching mechanism, called Distance Prefetching, that attempts to capture patterns in the reference behavior in a smaller space than earlier proposals Using detailed simulations of a wide variety of applications (56 in all) from different benchmark suites and all the SPEC CPU2000 applications, this paper demonstrates the benefits of distance prefetching

...read moreread less

167 citations

Journal Article•DOI•

T-crest

[...]

Martin Schoeberl¹, Sahar Abbaspour¹, Benny Akesson², Neil Audsley³, Raffaele Capasso, Jamie Garside³, Kees Goossens⁴, Sven Goossens⁴, Scott Hansen⁵, Reinhold Heckmann, Stefan Hepp⁶, Benedikt Huber⁶, Alexander Jordan¹, Evangelia Kasapaki¹, Jens Knoop⁶, Yonghui Li⁴, Daniel Prokesch⁶, Wolfgang Puffitsch¹, Peter Puschner⁶, Andre Rocha, Claudio Silva, Jens Sparsø¹, Alessandro Tocchi - Show less +19 more•Institutions (6)

Technical University of Denmark¹, Czech Technical University in Prague², University of York³, Eindhoven University of Technology⁴, Open Group⁵, Vienna University of Technology⁶

01 Oct 2015

TL;DR: Within the T-CREST project the authors propose novel solutions for time-predictable multi-core architectures that are optimized for the WCET instead of the average-case execution time.

...read moreread less

Abstract: Real-time systems need time-predictable platforms to allow static analysis of the worst-case execution time (WCET). Standard multi-core processors are optimized for the average case and are hardly analyzable. Within the T-CREST project we propose novel solutions for time-predictable multi-core architectures that are optimized for the WCET instead of the average-case execution time. The resulting time-predictable resources (processors, interconnect, memory arbiter, and memory controller) and tools (compiler, WCET analysis) are designed to ease WCET analysis and to optimize WCET performance. Compared to other processors the WCET performance is outstanding.The T-CREST platform is evaluated with two industrial use cases. An application from the avionic domain demonstrates that tasks executing on different cores do not interfere with respect to their WCET. A signal processing application from the railway domain shows that the WCET can be reduced for computation-intensive tasks when distributing the tasks on several cores and using the network-on-chip for communication. With three cores the WCET is improved by a factor of 1.8 and with 15 cores by a factor of 5.7.The T-CREST project is the result of a collaborative research and development project executed by eight partners from academia and industry. The European Commission funded T-CREST.

...read moreread less

166 citations

Journal Article•DOI•

Using a user-level memory thread for correlation prefetching

[...]

Yan Solihin¹, Jaejin Lee², Josep Torrellas¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Michigan State University²

01 May 2002

TL;DR: This paper introduces the idea of using a User-Level Memory Thread (ULMT) for correlation prefetching, and shows that the scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46.

...read moreread less

Abstract: This paper introduces the idea of using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: the correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide usability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.

...read moreread less

164 citations

Collapse

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Citations

References

Related Papers (5)