scispace - formally typeset
Search or ask a question
Journal Article

Memory Disambiguation Hardware: a Review

01 Oct 2008-Journal of Computer Science and Technology (Facultad de Informática)-Vol. 8, Iss: 3, pp 132-138
TL;DR: The most significant proposals in this research field are reviewed, focusing on the own contributions on optimizing address-based memory disambiguation logic, namely the load-store queue.
Abstract: One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.

Content maybe subject to copyright    Report

Citations
More filters
Patent
10 Jun 2013
TL;DR: In this article, a disambiguation-free out-of-order load store queue method is proposed, which includes implementing a memory resource that can be accessed by a plurality of asynchronous cores, implementing a store retirement buffer, and implementing speculative execution, wherein results of speculative execution can be saved in the store reorder buffer as a speculative state.
Abstract: In a processor, a disambiguation-free out of order load store queue method. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores; implementing a store retirement buffer, wherein stores from a store queue have entries in the store retirement buffer in original program order; and implementing speculative execution, wherein results of speculative execution can be saved in the store retirement/reorder buffer as a speculative state. The method further includes, upon dispatch of a subsequent load from a load queue, searching the store retirement buffer for address matching; and, in cases where there are a plurality of address matches, locating a correct forwarding entry by scanning for the store retirement buffer for a first match, and forwarding data from the first match to the subsequent load. Once speculative outcomes are known, the speculative state is retired to memory.

16 citations

Patent
30 Mar 2012
TL;DR: In this paper, a method of memory disambiguation hardware to support software binary translation is provided, which includes unrolling a set of instructions to be executed within a processor, the sets of instructions having a number of memory operations.
Abstract: A method of memory disambiguation hardware to support software binary translation is provided. This method includes unrolling a set of instructions to be executed within a processor, the set of instructions having a number of memory operations. An original relative order of memory operations is determined. Then, possible reordering problems are detected and identified in software. The reordering problem being when a first memory operation has been reordered prior to and aliases to a second memory operation with respect to the original order of memory operations. The reordering problem is addressed and a relative order of memory operations to the processor is communicated.

15 citations

Patent
10 Jun 2013
TL;DR: In this article, a thread agnostic unified store queue and a unified load queue method for out-order loads in a memory consistency model using shared memory resources is proposed. But it does not support out-of-order caches.
Abstract: In a processor, a thread agnostic unified store queue and a unified load queue method for out of order loads in a memory consistency model using shared memory resources. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores, wherein the plurality of cores share a unified store queue and a unified load queue; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load, wherein the cache line includes the memory resource, wherein the load sets a mask bit within the access mask when accessing a word of the cache line, and wherein the mask bit blocks accesses from other loads from a plurality of cores. The method further includes checking the access mask upon execution of subsequent stores from the plurality of cores to the cache line, wherein stores from different threads can forward to loads of different threads while still maintaining in order memory consistency semantics; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask, wherein the subsequent store will signal a load queue entry corresponding to that load by using a tracker register and a thread ID register.

14 citations

Patent
14 Jun 2013
TL;DR: In this article, a disambiguation-free out-of-order load store queue method is proposed, which includes implementing a memory resource that can be accessed by a plurality of asynchronous cores, implementing a store retirement buffer, and forwarding data from the first match to the subsequent load.
Abstract: In a processor, a disambiguation- free out of order load store queue method. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores; implementing a store retirement buffer, wherein stores from a store queue have entries in the store retirement buffer in original program order; and upon dispatch of a subsequent load from a load queue, searching the store retirement buffer for address matching. The method further includes in cases where there are a plurality of address matches, locating a correct forwarding entry by scanning for the store retirement buffer for a first match; and forwarding data from the first match to the subsequent load.

9 citations

Patent
12 Jun 2013
TL;DR: In this article, a lock-based method for out-of-order loads in a memory consistency model using shared memory resources is proposed, which includes implementing a memory resource that can be accessed by a plurality of cores; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load.
Abstract: In a processor, a lock-based method for out of order loads in a memory consistency model using shared memory resources. The method includes implementing a memory resource that can be accessed by a plurality of cores; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load, wherein the cache line includes the memory resource, wherein the load sets a mask bit within the access mask when accessing a word of the cache line, and wherein the mask bit blocks accesses from other loads from a plurality of cores. The method further includes checking the access mask upon execution of subsequent stores from the plurality of cores to the cache line; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask, wherein the subsequent store will signal a load queue entry corresponding to that load by using a tracker register and a thread ID register.

8 citations

References
More filters
Journal ArticleDOI
TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Abstract: In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods.Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

7,390 citations

Journal ArticleDOI
TL;DR: A unique combination of high clock speeds and advanced microarchitectural techniques, including many forms of out-of-order and speculative execution, provide exceptional core computational performance in the 21264.
Abstract: Alpha microprocessors have been performance leaders since their introduction in 1992. The first generation 21064 and the later 21164 raised expectations for the newest generation-performance leadership was again a goal of the 21264 design team. Benchmark scores of 30+ SPECint95 and 58+ SPECfp95 offer convincing evidence thus far that the 21264 achieves this goal and will continue to set a high performance standard. A unique combination of high clock speeds and advanced microarchitectural techniques, including many forms of out-of-order and speculative execution, provide exceptional core computational performance in the 21264. The processor also features a high-bandwidth memory system that can quickly deliver data values to the execution core, providing robust performance for a wide range of applications, including those without cache locality. The advanced performance levels are attained while maintaining an installed application base. All Alpha generations are upward-compatible. Database, real-time visual computing, data mining, medical imaging, scientific/technical, and many other applications can utilize the outstanding performance available with the 21264.

828 citations


"Memory Disambiguation Hardware: a R..." refers background in this paper

  • ...To simplify implementation, processors typically replay many more instructions (such as all instructions following the store [1]), as these premature loads are rare in general and sometimes extra logic is employed to further reduce their occurrence [2]....

    [...]

Journal ArticleDOI
J. M. Tendler1, John Steven Dodson1, J. S. Fields1, Hung Qui Le1, Balaram Sinharoy1 
TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.
Abstract: The IBM POWER4 is a new microprocessor organized in a system structure that includes new technology to form systems. The name POWER4 as used in this context refers not only to a chip, but also to the structure used to interconnect chips to form systems. In this paper we describe the processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor.

685 citations


"Memory Disambiguation Hardware: a R..." refers background in this paper

  • ...To simplify implementation, processors typically replay many more instructions (such as all instructions following the store [1]), as these premature loads are rare in general and sometimes extra logic is employed to further reduce their occurrence [2]....

    [...]

Proceedings ArticleDOI
16 Apr 1998
TL;DR: It is shown that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies.
Abstract: For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store instructions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and communicate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's "store set". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the implementation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance.

332 citations

Proceedings ArticleDOI
03 Dec 2003
TL;DR: A novel Checkpoint Processing and Recovery (CPR) microarchitecture is proposed, and how to implement alarge instruction window processor without requiring largestructures thus permitting a high clock frequency is shown.
Abstract: Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly degrade the cycle time. This paper proposes a checkpoint processing and recovery (CPR) microarchitecture, and shows how to implement a large instruction window processor without requiring large structures thus permitting a high clock frequency. We focus of four critical aspects of a microarchitecture: 1) scheduling instructions; 2) recovering from branch mispredicts; 3) buffering a large number of stores and forwarding data from stores to any dependent load; and 4) reclaiming physical registers. While scheduling window size is important, we show the performance of large instruction windows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitecture scheme for addressing these design issues-a selective checkpoint mechanism for recovering from mispredicts, a hierarchical store queue organization for fast store-load forwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windows of thousands of instructions without requiring large cycle-critical hardware structures.

317 citations