Memory Disambiguation Hardware: a Review

Home
/
Papers
/
Memory Disambiguation Hardware: a Review

Journal Article•

Memory Disambiguation Hardware: a Review

Fernando Castro, Daniel Chaver, Luis Piñuel, Manuel Prieto, Francisco Tirado Fernández - Show less +1 more

01 Oct 2008-Journal of Computer Science and Technology (Facultad de Informática)-Vol. 8, Iss: 3, pp 132-138

TL;DR: The most significant proposals in this research field are reviewed, focusing on the own contributions on optimizing address-based memory disambiguation logic, namely the load-store queue.

read less

Abstract: One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Patent•

Reordered speculative instruction sequences with a disambiguation-free out of order load store queue

[...]

Mohammad A. Abdallah

10 Jun 2013

TL;DR: In this article, a disambiguation-free out-of-order load store queue method is proposed, which includes implementing a memory resource that can be accessed by a plurality of asynchronous cores, implementing a store retirement buffer, and implementing speculative execution, wherein results of speculative execution can be saved in the store reorder buffer as a speculative state.

...read moreread less

Abstract: In a processor, a disambiguation-free out of order load store queue method. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores; implementing a store retirement buffer, wherein stores from a store queue have entries in the store retirement buffer in original program order; and implementing speculative execution, wherein results of speculative execution can be saved in the store retirement/reorder buffer as a speculative state. The method further includes, upon dispatch of a subsequent load from a load queue, searching the store retirement buffer for address matching; and, in cases where there are a plurality of address matches, locating a correct forwarding entry by scanning for the store retirement buffer for a first match, and forwarding data from the first match to the subsequent load. Once speculative outcomes are known, the speculative state is retired to memory.

...read moreread less

16 citations

Patent•

Memory Disambiguation Hardware To Support Software Binary Translation

[...]

Muawya M. Al-Otoom¹, Paul Caprioli, Abhay S. Kanhere, Arvind Krishnaswamy, Omar M. Shaikh - Show less +1 more•Institutions (1)

Intel¹

30 Mar 2012

TL;DR: In this paper, a method of memory disambiguation hardware to support software binary translation is provided, which includes unrolling a set of instructions to be executed within a processor, the sets of instructions having a number of memory operations.

...read moreread less

Abstract: A method of memory disambiguation hardware to support software binary translation is provided. This method includes unrolling a set of instructions to be executed within a processor, the set of instructions having a number of memory operations. An original relative order of memory operations is determined. Then, possible reordering problems are detected and identified in software. The reordering problem being when a first memory operation has been reordered prior to and aliases to a second memory operation with respect to the original order of memory operations. The reordering problem is addressed and a relative order of memory operations to the processor is communicated.

...read moreread less

15 citations

Patent•

A load store buffer agnostic to threads implementing forwarding from different threads based on store seniority

[...]

Mohammad Abdallah

10 Jun 2013

TL;DR: In this article, a thread agnostic unified store queue and a unified load queue method for out-order loads in a memory consistency model using shared memory resources is proposed. But it does not support out-of-order caches.

...read moreread less

Abstract: In a processor, a thread agnostic unified store queue and a unified load queue method for out of order loads in a memory consistency model using shared memory resources. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores, wherein the plurality of cores share a unified store queue and a unified load queue; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load, wherein the cache line includes the memory resource, wherein the load sets a mask bit within the access mask when accessing a word of the cache line, and wherein the mask bit blocks accesses from other loads from a plurality of cores. The method further includes checking the access mask upon execution of subsequent stores from the plurality of cores to the cache line, wherein stores from different threads can forward to loads of different threads while still maintaining in order memory consistency semantics; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask, wherein the subsequent store will signal a load queue entry corresponding to that load by using a tracker register and a thread ID register.

...read moreread less

14 citations

Patent•

A disambiguation-free out of order load store queue

[...]

Abdallah Mohammad

14 Jun 2013

TL;DR: In this article, a disambiguation-free out-of-order load store queue method is proposed, which includes implementing a memory resource that can be accessed by a plurality of asynchronous cores, implementing a store retirement buffer, and forwarding data from the first match to the subsequent load.

...read moreread less

Abstract: In a processor, a disambiguation- free out of order load store queue method. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores; implementing a store retirement buffer, wherein stores from a store queue have entries in the store retirement buffer in original program order; and upon dispatch of a subsequent load from a load queue, searching the store retirement buffer for address matching. The method further includes in cases where there are a plurality of address matches, locating a correct forwarding entry by scanning for the store retirement buffer for a first match; and forwarding data from the first match to the subsequent load.

...read moreread less

9 citations

Patent•

A lock-based and synch-based method for out of order loads in a memory consistency model using shared memory resources

[...]

Abdallah Mohammad

12 Jun 2013

TL;DR: In this article, a lock-based method for out-of-order loads in a memory consistency model using shared memory resources is proposed, which includes implementing a memory resource that can be accessed by a plurality of cores; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load.

...read moreread less

Abstract: In a processor, a lock-based method for out of order loads in a memory consistency model using shared memory resources. The method includes implementing a memory resource that can be accessed by a plurality of cores; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load, wherein the cache line includes the memory resource, wherein the load sets a mask bit within the access mask when accessing a word of the cache line, and wherein the mask bit blocks accesses from other loads from a plurality of cores. The method further includes checking the access mask upon execution of subsequent stores from the plurality of cores to the cache line; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask, wherein the subsequent store will signal a load queue entry corresponding to that load by using a tracker register and a thread ID register.

...read moreread less

8 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Space/time trade-offs in hash coding with allowable errors

[...]

Burton H. Bloom

01 Jul 1970-Communications of The ACM

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

...read moreread less

Abstract: In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods.Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

...read moreread less

7,390 citations

Journal Article•DOI•

The Alpha 21264 microprocessor

[...]

R.E. Kessler

01 Mar 1999-IEEE Micro

TL;DR: A unique combination of high clock speeds and advanced microarchitectural techniques, including many forms of out-of-order and speculative execution, provide exceptional core computational performance in the 21264.

...read moreread less

Abstract: Alpha microprocessors have been performance leaders since their introduction in 1992. The first generation 21064 and the later 21164 raised expectations for the newest generation-performance leadership was again a goal of the 21264 design team. Benchmark scores of 30+ SPECint95 and 58+ SPECfp95 offer convincing evidence thus far that the 21264 achieves this goal and will continue to set a high performance standard. A unique combination of high clock speeds and advanced microarchitectural techniques, including many forms of out-of-order and speculative execution, provide exceptional core computational performance in the 21264. The processor also features a high-bandwidth memory system that can quickly deliver data values to the execution core, providing robust performance for a wide range of applications, including those without cache locality. The advanced performance levels are attained while maintaining an installed application base. All Alpha generations are upward-compatible. Database, real-time visual computing, data mining, medical imaging, scientific/technical, and many other applications can utilize the outstanding performance available with the 21264.

...read moreread less

828 citations

"Memory Disambiguation Hardware: a R..." refers background in this paper

...To simplify implementation, processors typically replay many more instructions (such as all instructions following the store [1]), as these premature loads are rare in general and sometimes extra logic is employed to further reduce their occurrence [2]....
[...]

Journal Article•DOI•

POWER4 system microarchitecture

[...]

J. M. Tendler¹, John Steven Dodson¹, J. S. Fields¹, Hung Qui Le¹, Balaram Sinharoy¹ - Show less +1 more•Institutions (1)

IBM¹

01 Jan 2002-Ibm Journal of Research and Development

TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.

...read moreread less

Abstract: The IBM POWER4 is a new microprocessor organized in a system structure that includes new technology to form systems. The name POWER4 as used in this context refers not only to a chip, but also to the structure used to interconnect chips to form systems. In this paper we describe the processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor.

...read moreread less

685 citations

"Memory Disambiguation Hardware: a R..." refers background in this paper

...To simplify implementation, processors typically replay many more instructions (such as all instructions following the store [1]), as these premature loads are rare in general and sometimes extra logic is employed to further reduce their occurrence [2]....
[...]

Proceedings Article•DOI•

Memory dependence prediction using store sets

[...]

George Z. Chrysos, Joel Emer

16 Apr 1998

TL;DR: It is shown that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies.

...read moreread less

Abstract: For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store instructions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and communicate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's "store set". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the implementation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance.

...read moreread less

332 citations

Proceedings Article•DOI•

Checkpoint processing and recovery: towards scalable large instruction window processors

[...]

Haitham Akkary¹, Ravi Rajwar¹, Srikanth T. Srinivasan¹•Institutions (1)

Intel¹

03 Dec 2003

TL;DR: A novel Checkpoint Processing and Recovery (CPR) microarchitecture is proposed, and how to implement alarge instruction window processor without requiring largestructures thus permitting a high clock frequency is shown.

...read moreread less

Abstract: Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly degrade the cycle time. This paper proposes a checkpoint processing and recovery (CPR) microarchitecture, and shows how to implement a large instruction window processor without requiring large structures thus permitting a high clock frequency. We focus of four critical aspects of a microarchitecture: 1) scheduling instructions; 2) recovering from branch mispredicts; 3) buffering a large number of stores and forwarding data from stores to any dependent load; and 4) reclaiming physical registers. While scheduling window size is important, we show the performance of large instruction windows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitecture scheme for addressing these design issues-a selective checkpoint mechanism for recovering from mispredicts, a hierarchical store queue organization for fast store-load forwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windows of thousands of instructions without requiring large cycle-critical hardware structures.

...read moreread less

317 citations