scispace - formally typeset
Proceedings ArticleDOI

Compiler-assisted data distribution for chip multiprocessors

TLDR
This paper presents a compiler-based approach used for analyzing data access behavior in multi-threaded applications and shows a 20% speedup over shared caching and 5% speed up over the closest runtime approximation, “first touch”.
Abstract
Data access latency, a limiting factor in the performance of chip multiprocessors, grows significantly with the number of cores in non-uniform cache architectures with distributed cache banks. To mitigate this effect, it is necessary to leverage the data access locality and choose an optimum data placement. Achieving this is especially challenging when other constraints such as cache capacity, coherence messages and runtime overhead need to be considered. This paper presents a compiler-based approach used for analyzing data access behavior in multi-threaded applications. The proposed experimental compiler framework employs novel compilation techniques to discover and represent multi-threaded memory access patterns (MMAPs). At run time, symbolic MMAPs are resolved and used by a partitioning algorithm to choose a partition of allocated memory blocks among the forked threads in the analyzed application. This partition is used to enforce data ownership by associating the data with the core that executes the thread owning the data. We demonstrate how this information can be used in an experimental architecture to accelerate applications. In particular, our compiler assisted approach shows a 20% speedup over shared caching and 5% speedup over the closest runtime approximation, "first touch".

read more

Citations
More filters
Proceedings ArticleDOI

Complexity-effective multicore coherence

TL;DR: A virtually costless coherence that outperforms a MESI directory protocol while at the same time reducing shared cache and network energy consumption for 15 parallel benchmarks, on 16 cores is shown.
Patent

System and method for simplifying cache coherence using multiple write policies

TL;DR: In this article, a multi-core cache coherence system with a local/shared cache hierarchy is described. The system includes multiple processor cores, a main memory, and a local cache memory associated with each core for storing cache lines accessible only by the associated core.
Proceedings ArticleDOI

A software approach for combating asymmetries of non-volatile memories

TL;DR: This paper proposes software dispatch, a cross-layer approach to distribute data to appropriate memory resources based on an application's data access characteristics, and demonstrates the application of the proposed technique through a case study system with hybrid memory caches.
Proceedings ArticleDOI

Compiler support for selective page migration in NUMA architectures

TL;DR: This paper proposes compiler analyses and code generation methods to support a lightweight runtime system that dynamically migrates memory pages to improve data locality and compares the approach against Minas, a middleware that supports NUMA-aware data allocation, and shows that it can outperform it by up to 50% in some cases.
Proceedings ArticleDOI

Practically private: enabling high performance CMPs through compiler-assisted data classification

TL;DR: It is demonstrated that practically private data is ubiquitous in parallel applications and leveraging this classification provides opportunities to benefit performance, and a novel compiler-based approach to speculatively detect a third classification: practically private is developed.
References
More filters
Book

Compilers: Principles, Techniques, and Tools

TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.
Proceedings ArticleDOI

The PARSEC benchmark suite: characterization and architectural implications

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Journal ArticleDOI

Simics: A full system simulation platform

TL;DR: Simics is a platform for full system simulation that can run actual firmware and completely unmodified kernel and driver code, and it provides both functional accuracy for running commercial workloads and sufficient timing accuracy to interface to detailed hardware models.
Journal ArticleDOI

SUIF: an infrastructure for research on parallelizing and optimizing compilers

TL;DR: The SUIF compiler is built into a powerful, flexible system that may be useful for many other researchers and the authors invite you to use and welcome your contributions to this infrastructure.
Related Papers (5)