scispace - formally typeset
Journal ArticleDOI

Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip

Reads0
Chats0
TLDR
This work proposes a Lightweight Chip Multi-Threaded (LCMT) architecture that further exploits thread-level parallelism (TLP) by incorporating direct architectural support for an “unlimited” number of dynamically created lightweight threads with very low thread management and synchronization overhead.
Abstract
Irregular and dynamic applications, such as graph problems and agent-based simulations, often require fine-grained parallelism to achieve good performance. However, current multicore processors only provide architectural support for coarse-grained parallelism, making it necessary to use software-based multithreading environments to effectively implement fine-grained parallelism. Although these software-based environments have demonstrated superior performance over heavyweight, OS-level threads, they are still limited by the significant overhead involved in thread management and synchronization. In order to address this, we propose a Lightweight Chip Multi-Threaded (LCMT) architecture that further exploits thread-level parallelism (TLP) by incorporating direct architectural support for an “unlimited” number of dynamically created lightweight threads with very low thread management and synchronization overhead. The LCMT architecture can be implemented atop a mainstream architecture with minimum extra hardware to leverage existing legacy software environments. We compare the LCMT architecture with a Niagara-like baseline architecture. Our results show up to 1.8X better scalability, 1.91X better performance, and more importantly, 1.74X better performance per watt, using the LCMT architecture for irregular and dynamic benchmarks, when compared to the baseline architecture. The LCMT architecture delivers similar performance to the baseline architecture for regular benchmarks.

read more

Citations
More filters
Proceedings ArticleDOI

System implications of memory reliability in exascale computing

TL;DR: This paper studies the impact of various ECC schemes (SECDED, BCH, and chip-kill) in conjunction with checkpointing on future exascale systems and proposes to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical.
Patent

Methods and apparatus to perform error detection and correction

TL;DR: In this article, a memory controller is enabled to operate in one of a tagged memory mode or a non-tagged memory mode to perform error detection and correction on data, and a disclosed example method involves enabling the memory controller to operate either in a tagged memory or non-tagsized memory mode.

Communication centric, multi-core, fine-grained processor architecture

TL;DR: This dissertation presents an architecture designed to enable scalable fine-grained computation that is communication aware (allowing a programmer to optimise for communication) and discusses the need for multi-core architecture and the major issues faced in their construction.
Proceedings ArticleDOI

Recomposing an Irregular Algorithm Using a Novel Low-Level PGAS Model

TL;DR: This paper shows how a novel, low-level partitioned global address space (PGAS) programming model facilitated the transformation of the well-studied minimum spanning forest algorithm into a new MSF algorithm which allowed for million way well-behaved parallelism on a novel multithreaded architecture.
Journal ArticleDOI

VMT: Virtualized Multi-Threading for Accelerating Graph Workloads on Commodity Processors

TL;DR: Virtualized Multi-Threading (VMT) as discussed by the authors is a low-overhead multi-threading paradigm for accelerating graph workloads on commodity processors, which maps a large number of logical software threads to a small number of physical hardware threads while maintaining the architecture state of the logical threads in the processor's cache hierarchy.
References
More filters
Journal ArticleDOI

A hierarchical O(N log N) force-calculation algorithm

TL;DR: A novel method of directly calculating the force on N bodies that grows only as N log N is described, using a tree-structured hierarchical subdivision of space into cubic cells, each is recursively divided into eight subcells whenever more than one particle is found to occupy the same cell.
Journal ArticleDOI

OpenMP: an industry standard API for shared-memory programming

L. Dagum, +1 more
TL;DR: At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism) and leaves the base language unspecified.
Proceedings ArticleDOI

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Proceedings ArticleDOI

The implementation of the Cilk-5 multithreaded language

TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.
Book

Intel Threading Building Blocks

TL;DR: The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications, and the information here is subject to change without notice.
Related Papers (5)