DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

doi:10.1109/MICRO.2006.21

Home
/
Papers
/
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings Article•DOI•

DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Fernando Castro¹, Luis Piñuel¹, Daniel Chaver¹, Manuel Prieto¹, Michael C. Huang², Francisco Tirado¹ - Show less +2 more•Institutions (2)

Complutense University of Madrid¹, University of Rochester²

09 Dec 2006-pp 297-308

TL;DR: This paper introduces two new management schemes, a filtering scheme based on simple age-tracking and delayed memory dependence checking, which can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers and cuts the energy spent on LQ by an average of 95%.

read less

Abstract: One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution of memory instructions. Traditional CAM-based associative queues can be very slow and energy hungry. In this paper we introduce two new management schemes. The first one is a filtering scheme based on simple age-tracking. This scheme can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers. This translates into significant power savings. More importantly, however, this filtering makes our second scheme, Delayed Memory Dependence Checking (DMDC), practical. With a small hash table, DMDC completely avoids the need for an associative LQ and relies on indexing-based checking at the commit phase and hence cuts the energy spent on LQ by an average of 95%. At an average of about 0.3%, the performance impact is negligible. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-8%, depending on the configuration and the applications.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Late-binding: enabling unordered load-store queues

[...]

Simha Sethumadhavan¹, Franziska Roesner¹, Joel Emer², Doug Burger¹, Stephen W. Keckler¹ - Show less +1 more•Institutions (2)

University of Texas at Austin¹, Intel²

09 Jun 2007

TL;DR: It is shown how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses, and it is shown that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSZs into address-interleaved banks.

...read moreread less

Abstract: Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue ("late binding"), rather than when they are dispatched. This approach enables lower occupancy and thus smaller LSQs. Efficient implementations of late-binding LSQs, however, require the entries in the LSQ to be unordered with respect to age. In this paper, we show how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses. We show that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSQs into address-interleaved banks. To handle the increased overflows, we apply classic network flow control techniques to the processor micronetworks, enabling low-overhead recovery mechanisms from bank overflows. We evaluate three such mechanisms: instruction replay, skid buffers, an dvirtual-channel buffering in the on-chip memory network. We show that for an 80-instruction window, the LSQ can be reduced to 32 entries. For a 1024-instruction window, the unordered, late-binding LSQ works well with four banks of 48 entries each. By applying a Bloom filter as well, this design achieves full hardware memory disambiguation for a 1,024 instruction window while requiring low average power per load and store access of 8 and 12 CAM entries, respectively.

...read moreread less

73 citations

Patent•

Processor Including Age Tracking of Issue Queue Instructions

[...]

James Wilson Bishop¹, Mary D. Brown¹, Jeffrey C. Brownscheidle¹, Robert A. Cordes¹, Maureen A. Delaney¹, Jafar Nahidi¹, Dung Quoc Nguyen¹, Joel Abraham Silberman¹ - Show less +4 more•Institutions (1)

IBM¹

03 Apr 2009

TL;DR: In this article, the issue queue IQ maintains or stores instructions that may issue out-of-order in an internal data store IDS, and an age matrix of the IQ maintains a record of relative instruction aging for those instructions within the IDS.

...read moreread less

Abstract: An information handling system includes a processor with an instruction issue queue (IQ) that may perform age tracking operations. The issue queue IQ maintains or stores instructions that may issue out-of-order in an internal data store IDS. The IDS organizes instructions in a queue position (QPOS) addressing arrangement. An age matrix of the IQ maintains a record of relative instruction aging for those instructions within the IDS. The age matrix updates latches or other memory cell data to reflect the changes in IDS instruction ages during a dispatch operation into the IQ. During dispatch of one or more instructions, the age matrix may update only those latches that require data change to reflect changing IDS instruction ages. The age matrix employs row and column data and clock controls to individually update those latches requiring update. The issue queue may selectively clock a row and a column of cells of the age matrix that correspond to a dispatched instruction's queue position while leaving other cells unclocked to conserve power.

...read moreread less

29 citations

Proceedings Article•DOI•

PEEP: Exploiting predictability of memory dependences in SMT processors

[...]

Samantika Subramaniam¹, Milos Prvulovic¹, Gabriel H. Loh¹•Institutions (1)

Georgia Institute of Technology¹

24 Oct 2008

TL;DR: An early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves.

...read moreread less

Abstract: Simultaneous multithreading (SMT) attempts to keep a dynamically scheduled processorpsilas resources busy with work from multiple independent threads. Threads with long-latency stalls, however, can lead to a reduction in overall throughput because they occupy many of the critical processor resources. In this work, we first study the interaction between stalls caused by ambiguous memory dependences and SMT processing. We then propose the technique of proactive exclusion (PE) where the SMT fetch unit stops fetching from a thread when a memory dependence is predicted to exist. However, after the dependence has been resolved, the thread is delayed waiting for new instructions to be fetched and delivered down the front-end pipeline. So we introduce an early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves. We show that combining these two techniques (PEEP) yields a 16.9% throughput improvement on a 4-way SMT processor that supports speculative memory disambiguation. These strong results indicate that a fetch policy that is cognizant of future stalls considerably improves the throughput of an SMT machine.

...read moreread less

16 citations

Cites background from "DMDC: Delayed Memory Dependence Che..."

...Prior work on memory dependence prediction has demonstrated that the relationships between load and stores are highly predictable, and the predictability of these patterns has been exploited in other memory scheduling research [3, 17, 18, 22]....
[...]

Journal Article•

Memory Disambiguation Hardware: a Review

[...]

Fernando Castro, Daniel Chaver, Luis Piñuel, Manuel Prieto, Francisco Tirado Fernández - Show less +1 more

01 Oct 2008-Journal of Computer Science and Technology

TL;DR: The most significant proposals in this research field are reviewed, focusing on the own contributions on optimizing address-based memory disambiguation logic, namely the load-store queue.

...read moreread less

Abstract: One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.

...read moreread less

11 citations

Cites methods from "DMDC: Delayed Memory Dependence Che..."

...This design [ 23 ][24] replaces the LQ with an address-based table in which only write a very small fraction of all stores....
[...]

Patent•

Speculative load issue

[...]

Hugh Jackson, Anand Khot

31 Jan 2014

TL;DR: In this article, the load and store buffer are used to determine whether there are any unresolved store instructions in the store buffer that are older than the load instruction for a load instruction.

...read moreread less

Abstract: A method and load and store buffer for issuing a load instruction to a data cache. The method includes determining whether there are any unresolved store instructions in the store buffer that are older than the load instruction. If there is at least one unresolved store instruction in the store buffer older than the load instruction, it is determined whether the oldest unresolved store instruction in the store buffer is within a speculation window for the load instruction. If the oldest unresolved store instruction is within the speculation window for the load instruction, the load instruction is speculatively issued to the data cache. Otherwise, the load instruction is stalled until any unresolved store instructions outside the speculation window are resolved. The speculation window is a short window that defines a number of instructions or store instructions that immediately precede the load instruction.

...read moreread less

11 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

The SimpleScalar tool set, version 2.0

[...]

Doug Burger¹, Todd Austin¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jun 1997-ACM Sigarch Computer Architecture News

TL;DR: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors.

...read moreread less

Abstract: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, better documentation, easier installation, improved portability, and higher performance. This paper contains a complete description of the tool set, including retrieval and installation instructions, a description of how to use the tools, a description of the target SimpleScalar architecture, and many details about the internals of the tools and how to customize them. With this guide, the tool set can be brought up and generating results in under an hour (on supported platforms).

...read moreread less

3,079 citations

Proceedings Article•DOI•

Wattch: a framework for architectural-level power analysis and optimizations

[...]

David Brooks¹, Vivek Tiwari², Margaret Martonosi¹•Institutions (2)

Princeton University¹, Intel²

01 May 2000

TL;DR: Wattch is presented, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level and opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

...read moreread less

Abstract: Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high accuracy by calculating power estimates for designs only after layout or floorplanning are complete. In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.This paper presents Wattch, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level. Wattch is 1000X or more faster than existing layout-level power tools, and yet maintains accuracy within 10% of their estimates as verified using industry tools on leading-edge designs. This paper presents several validations of Wattch's accuracy. In addition, we present three examples that demonstrate how architects or compiler writers might use Wattch to evaluate power consumption in their design process.We see Wattch as a complement to existing lower-level tools; it allows architects to explore and cull the design space early on, using faster, higher-level tools. It also opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

...read moreread less

2,848 citations

Proceedings Article•DOI•

Automatically characterizing large scale program behavior

[...]

Timothy Sherwood¹, Erez Perelman¹, Greg Hamerly¹, Brad Calder¹•Institutions (1)

University of California, San Diego¹

01 Oct 2002

TL;DR: This work quantifies the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explores the large scale behavior of several programs, and develops a set of algorithms based on clustering capable of analyzing this behavior.

...read moreread less

Abstract: Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and compiler techniques, from thread scheduling, to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we must first develop the analytical tools necessary to automatically and efficiently analyze program behavior over large sections of execution.Our goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explore the large scale behavior of several programs, and develop a set of algorithms based on clustering capable of analyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research.

...read moreread less

1,702 citations

Journal Article•DOI•

POWER4 system microarchitecture

[...]

J. M. Tendler¹, John Steven Dodson¹, J. S. Fields¹, Hung Qui Le¹, Balaram Sinharoy¹ - Show less +1 more•Institutions (1)

IBM¹

01 Jan 2002-Ibm Journal of Research and Development

TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.

...read moreread less

Abstract: The IBM POWER4 is a new microprocessor organized in a system structure that includes new technology to form systems. The name POWER4 as used in this context refers not only to a chip, but also to the structure used to interconnect chips to form systems. In this paper we describe the processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor.

...read moreread less

685 citations

Proceedings Article•DOI•

Memory dependence prediction using store sets

[...]

George Z. Chrysos, Joel Emer

16 Apr 1998

TL;DR: It is shown that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies.

...read moreread less

Abstract: For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store instructions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and communicate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's "store set". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the implementation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance.

...read moreread less

332 citations