scispace - formally typeset
Search or ask a question
Author

Michael Huang

Bio: Michael Huang is an academic researcher from Complutense University of Madrid. The author has contributed to research in topics: Scheduling (computing) & Queue. The author has an hindex of 1, co-authored 1 publications receiving 12 citations.

Papers
More filters
Proceedings ArticleDOI
04 Oct 2006
TL;DR: This paper explicitly record age information in address-indexed hash tables to achieve the same functionality of detecting premature loads, and eliminates associative searches and significantly reduces the energy consumption of the load queue.
Abstract: Buffering more in-flight instructions in an out-of-order microprocessor is a straightforward and effective method to help tolerate the long latencies generally associated with off-chip memory accesses. One of the main challenges of buffering a large number of instructions, however, is the implementation of a scalable anid efficient mechanism to detect memory access order violations as a result of out-of-order scheduling of load and store instructions. Traditional CAM-based associative queues can be very slow and energy consuming. In this paper, instead of using the traditional age-based load queue to record load addresses, we explicitly record age information in address-indexed hash tables to achieve the same functionality of detecting premature loads. This alternative design eliminates associative searches and significantly reduces the energy consumption of the load queue. With simple techniques to reduce the number of false positives, performance degradation is kept at a minimum.

12 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This work describes creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores, allowing the architecture to efficiently support a wider range of parallelism.
Abstract: Manycore architectures designed for parallel workloads are likely to use simple, highly multithreaded, in-order cores. This maximizes throughput, but only with enough threads to keep hardware utilized. For applications or phases with more limited parallelism, we describe creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores. We reuse the large register file in the multithreaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. The resulting federated core provides twice the single-thread performance of the underlying in-order core, allowing the architecture to efficiently support a wider range of parallelism.

15 citations

01 Jan 2007
TL;DR: Federating each pair of neighboring, scalar cores provides a scalable, energy-efficient, and area-efficient solution for limited thread counts, with the ability to boost performance across a wide range ofthread counts, until thread count returns to a level at which the baseline, multithreaded, “throughput mode” can resume.
Abstract: Manycore architectures with dozens, hundreds, or thousands of threads are likely to use single-issue, in-order execution cores with simple pipelines but multiple thread contexts per core. This approach is beneficial for throughput but only with thread counts high enough to keep most thread contexts occupied. If these manycore architectures do not want to be limited to niches with embarrassing levels of parallelism, they must cope with the case when thread count is limited: too many threads for dedicated, high-performance cores (which come at high area cost), but too few to exploit the huge number of thread contexts. The only solution is to augment the simple, scalar cores. This paper describes how to create an out-of-order processor on the fly by “federating” each pair of neighboring, scalar cores.This adds a few new structures between each pair but otherwise repurposes the existing cores. It can be accomplished with less than 2KB of extra hardware per pair, nearly doubling the performance of a single, scalar core and approaching that of a traditional, dedicated 2-way out-of-order core. The key insights that make this possible are the use of the large number of registers in multi-threaded scalar cores to support out-of-order execution and the removal of large, associative structures. Federation provides a scalable, energy-efficient, and area-efficient solution for limited thread counts, with the ability to boost performance across a wide range of thread counts, until thread count returns to a level at which the baseline, multithreaded, “throughput mode” can resume.

15 citations

Proceedings ArticleDOI
09 Dec 2006
TL;DR: This paper introduces two new management schemes, a filtering scheme based on simple age-tracking and delayed memory dependence checking, which can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers and cuts the energy spent on LQ by an average of 95%.
Abstract: One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution of memory instructions. Traditional CAM-based associative queues can be very slow and energy hungry. In this paper we introduce two new management schemes. The first one is a filtering scheme based on simple age-tracking. This scheme can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers. This translates into significant power savings. More importantly, however, this filtering makes our second scheme, Delayed Memory Dependence Checking (DMDC), practical. With a small hash table, DMDC completely avoids the need for an associative LQ and relies on indexing-based checking at the commit phase and hence cuts the energy spent on LQ by an average of 95%. At an average of about 0.3%, the performance impact is negligible. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-8%, depending on the configuration and the applications.

13 citations

Journal Article
TL;DR: The most significant proposals in this research field are reviewed, focusing on the own contributions on optimizing address-based memory disambiguation logic, namely the load-store queue.
Abstract: One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.

11 citations