scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

TL;DR: This paper introduces two new management schemes, a filtering scheme based on simple age-tracking and delayed memory dependence checking, which can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers and cuts the energy spent on LQ by an average of 95%.
Abstract: One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution of memory instructions. Traditional CAM-based associative queues can be very slow and energy hungry. In this paper we introduce two new management schemes. The first one is a filtering scheme based on simple age-tracking. This scheme can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers. This translates into significant power savings. More importantly, however, this filtering makes our second scheme, Delayed Memory Dependence Checking (DMDC), practical. With a small hash table, DMDC completely avoids the need for an associative LQ and relies on indexing-based checking at the commit phase and hence cuts the energy spent on LQ by an average of 95%. At an average of about 0.3%, the performance impact is negligible. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-8%, depending on the configuration and the applications.
Citations
More filters
Proceedings ArticleDOI
09 Jun 2007
TL;DR: It is shown how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses, and it is shown that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSZs into address-interleaved banks.
Abstract: Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue ("late binding"), rather than when they are dispatched. This approach enables lower occupancy and thus smaller LSQs. Efficient implementations of late-binding LSQs, however, require the entries in the LSQ to be unordered with respect to age. In this paper, we show how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses. We show that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSQs into address-interleaved banks. To handle the increased overflows, we apply classic network flow control techniques to the processor micronetworks, enabling low-overhead recovery mechanisms from bank overflows. We evaluate three such mechanisms: instruction replay, skid buffers, an dvirtual-channel buffering in the on-chip memory network. We show that for an 80-instruction window, the LSQ can be reduced to 32 entries. For a 1024-instruction window, the unordered, late-binding LSQ works well with four banks of 48 entries each. By applying a Bloom filter as well, this design achieves full hardware memory disambiguation for a 1,024 instruction window while requiring low average power per load and store access of 8 and 12 CAM entries, respectively.

73 citations

Patent
03 Apr 2009
TL;DR: In this article, the issue queue IQ maintains or stores instructions that may issue out-of-order in an internal data store IDS, and an age matrix of the IQ maintains a record of relative instruction aging for those instructions within the IDS.
Abstract: An information handling system includes a processor with an instruction issue queue (IQ) that may perform age tracking operations. The issue queue IQ maintains or stores instructions that may issue out-of-order in an internal data store IDS. The IDS organizes instructions in a queue position (QPOS) addressing arrangement. An age matrix of the IQ maintains a record of relative instruction aging for those instructions within the IDS. The age matrix updates latches or other memory cell data to reflect the changes in IDS instruction ages during a dispatch operation into the IQ. During dispatch of one or more instructions, the age matrix may update only those latches that require data change to reflect changing IDS instruction ages. The age matrix employs row and column data and clock controls to individually update those latches requiring update. The issue queue may selectively clock a row and a column of cells of the age matrix that correspond to a dispatched instruction's queue position while leaving other cells unclocked to conserve power.

29 citations

Proceedings ArticleDOI
24 Oct 2008
TL;DR: An early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves.
Abstract: Simultaneous multithreading (SMT) attempts to keep a dynamically scheduled processorpsilas resources busy with work from multiple independent threads. Threads with long-latency stalls, however, can lead to a reduction in overall throughput because they occupy many of the critical processor resources. In this work, we first study the interaction between stalls caused by ambiguous memory dependences and SMT processing. We then propose the technique of proactive exclusion (PE) where the SMT fetch unit stops fetching from a thread when a memory dependence is predicted to exist. However, after the dependence has been resolved, the thread is delayed waiting for new instructions to be fetched and delivered down the front-end pipeline. So we introduce an early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves. We show that combining these two techniques (PEEP) yields a 16.9% throughput improvement on a 4-way SMT processor that supports speculative memory disambiguation. These strong results indicate that a fetch policy that is cognizant of future stalls considerably improves the throughput of an SMT machine.

16 citations


Cites background from "DMDC: Delayed Memory Dependence Che..."

  • ...Prior work on memory dependence prediction has demonstrated that the relationships between load and stores are highly predictable, and the predictability of these patterns has been exploited in other memory scheduling research [3, 17, 18, 22]....

    [...]

Journal Article
TL;DR: The most significant proposals in this research field are reviewed, focusing on the own contributions on optimizing address-based memory disambiguation logic, namely the load-store queue.
Abstract: One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.

11 citations


Cites methods from "DMDC: Delayed Memory Dependence Che..."

  • ...This design [ 23 ][24] replaces the LQ with an address-based table in which only write a very small fraction of all stores....

    [...]

Patent
31 Jan 2014
TL;DR: In this article, the load and store buffer are used to determine whether there are any unresolved store instructions in the store buffer that are older than the load instruction for a load instruction.
Abstract: A method and load and store buffer for issuing a load instruction to a data cache. The method includes determining whether there are any unresolved store instructions in the store buffer that are older than the load instruction. If there is at least one unresolved store instruction in the store buffer older than the load instruction, it is determined whether the oldest unresolved store instruction in the store buffer is within a speculation window for the load instruction. If the oldest unresolved store instruction is within the speculation window for the load instruction, the load instruction is speculatively issued to the data cache. Otherwise, the load instruction is stalled until any unresolved store instructions outside the speculation window are resolved. The speculation window is a short window that defines a number of instructions or store instructions that immediately precede the load instruction.

11 citations

References
More filters
Journal ArticleDOI
TL;DR: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors.
Abstract: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, better documentation, easier installation, improved portability, and higher performance. This paper contains a complete description of the tool set, including retrieval and installation instructions, a description of how to use the tools, a description of the target SimpleScalar architecture, and many details about the internals of the tools and how to customize them. With this guide, the tool set can be brought up and generating results in under an hour (on supported platforms).

3,079 citations

Proceedings ArticleDOI
01 May 2000
TL;DR: Wattch is presented, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level and opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.
Abstract: Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high accuracy by calculating power estimates for designs only after layout or floorplanning are complete. In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.This paper presents Wattch, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level. Wattch is 1000X or more faster than existing layout-level power tools, and yet maintains accuracy within 10% of their estimates as verified using industry tools on leading-edge designs. This paper presents several validations of Wattch's accuracy. In addition, we present three examples that demonstrate how architects or compiler writers might use Wattch to evaluate power consumption in their design process.We see Wattch as a complement to existing lower-level tools; it allows architects to explore and cull the design space early on, using faster, higher-level tools. It also opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

2,848 citations

Proceedings ArticleDOI
01 Oct 2002
TL;DR: This work quantifies the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explores the large scale behavior of several programs, and develops a set of algorithms based on clustering capable of analyzing this behavior.
Abstract: Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and compiler techniques, from thread scheduling, to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we must first develop the analytical tools necessary to automatically and efficiently analyze program behavior over large sections of execution.Our goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explore the large scale behavior of several programs, and develop a set of algorithms based on clustering capable of analyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research.

1,702 citations

Journal ArticleDOI
J. M. Tendler1, John Steven Dodson1, J. S. Fields1, Hung Qui Le1, Balaram Sinharoy1 
TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.
Abstract: The IBM POWER4 is a new microprocessor organized in a system structure that includes new technology to form systems. The name POWER4 as used in this context refers not only to a chip, but also to the structure used to interconnect chips to form systems. In this paper we describe the processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor.

685 citations

Proceedings ArticleDOI
16 Apr 1998
TL;DR: It is shown that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies.
Abstract: For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store instructions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and communicate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's "store set". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the implementation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance.

332 citations