scispace - formally typeset
Open Access

Federation: Out-of-Order Execution using Simple In-Order Cores

TLDR
Federating each pair of neighboring, scalar cores provides a scalable, energy-efficient, and area-efficient solution for limited thread counts, with the ability to boost performance across a wide range ofthread counts, until thread count returns to a level at which the baseline, multithreaded, “throughput mode” can resume.
Abstract
Manycore architectures with dozens, hundreds, or thousands of threads are likely to use single-issue, in-order execution cores with simple pipelines but multiple thread contexts per core. This approach is beneficial for throughput but only with thread counts high enough to keep most thread contexts occupied. If these manycore architectures do not want to be limited to niches with embarrassing levels of parallelism, they must cope with the case when thread count is limited: too many threads for dedicated, high-performance cores (which come at high area cost), but too few to exploit the huge number of thread contexts. The only solution is to augment the simple, scalar cores. This paper describes how to create an out-of-order processor on the fly by “federating” each pair of neighboring, scalar cores.This adds a few new structures between each pair but otherwise repurposes the existing cores. It can be accomplished with less than 2KB of extra hardware per pair, nearly doubling the performance of a single, scalar core and approaching that of a traditional, dedicated 2-way out-of-order core. The key insights that make this possible are the use of the large number of registers in multi-threaded scalar cores to support out-of-order execution and the removal of large, associative structures. Federation provides a scalable, energy-efficient, and area-efficient solution for limited thread counts, with the ability to boost performance across a wide range of thread counts, until thread count returns to a level at which the baseline, multithreaded, “throughput mode” can resume.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Federation: repurposing scalar cores for out-of-order instruction issue

TL;DR: A way to repurpose a pair of scalar cores into a 2-way out-of-order issue core with minimal area overhead and achieves comparable performance to a dedicated out- of-order core and dissipates less power as well.
Journal ArticleDOI

Scaling Power and Performance viaProcessor Composability

TL;DR: The study shows that composing multiple dual-issue cores (up to eight) provides performance scaling that is as energy-efficient as frequency scaling in a balanced microarchitecture, and is considerably more efficient than scaling the voltage to achieve additional performance once the maximum frequency at the minimum voltage is attained.
Journal ArticleDOI

Multitasking workload scheduling on flexible core chip multiprocessors

TL;DR: This paper describes a new resource allocation and scheduling problem which must determine how many logical processors should be configured, how powerful each processor should be, and where/when each task should run, and examines and evaluates several algorithms appropriate for such flexible-core CMPs.
Proceedings ArticleDOI

Strategies for mapping dataflow blocks to distributed hardware

TL;DR: By choosing an appropriate runtime block mapping strategy, average performance can be increased by 18%, while simultaneously reducing average operand communication by 70%, saving energy as well as improving performance.
Proceedings ArticleDOI

Multitasking workload scheduling on flexible-core chip multiprocessors

TL;DR: Flexible-core CMPs introduce a new resource allocation and scheduling problem which must determine how many logical processors should be configured, how powerful each processor should be, and where/when each task should run.
References
More filters
Proceedings ArticleDOI

Wattch: a framework for architectural-level power analysis and optimizations

TL;DR: Wattch is presented, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level and opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

The Landscape of Parallel Computing Research: A View from Berkeley

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Journal ArticleDOI

A Survey of General-Purpose Computation on Graphics Hardware

TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.
Proceedings ArticleDOI

Automatically characterizing large scale program behavior

TL;DR: This work quantifies the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explores the large scale behavior of several programs, and develops a set of algorithms based on clustering capable of analyzing this behavior.
Journal ArticleDOI

Niagara: a 32-way multithreaded Sparc processor

TL;DR: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.
Related Papers (5)