Showing papers in "ACM Transactions on Architecture and Code Optimization in 2017"
••
TL;DR: A tool is designed that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies, and a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels is introduced.
Abstract: Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory hierarchy, with the introduction of many new memory products—buffer-on-board, LRDIMM, HMC, HBM, and NVMs, to name a few. Given the plethora of choices, it is expected that different vendors will adopt different strategies for their high-capacity memory systems, often deviating from DDR standards and/or integrating new functionality within memory systems. These strategies will likely differ in their choice of interconnect and topology, with a significant fraction of memory energy being dissipated in I/O and data movement. To make the case for memory interconnect specialization, this paper makes three contributions.First, we design a tool that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models, and is integrated into version 7 of the popular CACTI package. Our analysis with the tool shows that several design parameters have a significant impact on I/O power.We then use the tool to help craft novel specialized memory system channels. We introduce a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels. We show that this simple change to the channel topology can improve performance by 22% for DDR DRAM and lower cost by up to 65% for DDR DRAM. This new architecture does not require any changes to DIMMs, and it efficiently supports hybrid DRAM/NVM systems.Finally, as an example of a more disruptive architecture, we design a custom DIMM and parallel bus that moves away from the DDR3/DDR4 standards. To reduce energy and improve performance, the baseline data channel is split into three narrow parallel channels and the on-DIMM interconnects are operated at a lower frequency. In addition, this allows us to design a two-tier error protection strategy that reduces data transfers on the interconnect. This architecture yields a performance improvement of 18% and a memory power reduction of 23%.The cascaded channel and narrow channel architectures serve as case studies for the new tool and show the potential for benefit from re-organizing basic memory interconnects.
217 citations
••
TL;DR: The image processing language Halide is extended so users can specify which portions of their applications should become hardware accelerators, and a compiler is provided that uses this code to automatically create the accelerator along with the “glue” code needed for the user’s application to access this hardware.
Abstract: Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality But creating, “programming,” and integrating this hardware into a hardware/software system is difficult We address this problem by extending the image processing language Halide so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler that uses this code to automatically create the accelerator along with the “glue” code needed for the user’s application to access this hardware Starting with Halide not only provides a very high-level functional description of the hardware but also allows our compiler to generate a complete software application, which accesses the hardware for acceleration when appropriate Our system also provides high-level semantics to explore different mappings of applications to a heterogeneous system, including the flexibility of being able to change the throughput rate of the generated hardwareWe demonstrate our approach by mapping applications to a commercial Xilinx Zynq system Using its FPGA with two low-power ARM cores, our design achieves up to 6× higher performance and 38× lower energy compared to the quad-core ARM CPU on an NVIDIA Tegra K1, and 35× higher performance with 12× lower energy compared to the K1’s 192-core GPU
93 citations
••
TL;DR: This article proposes an automatic optimization framework called MiCOMP, which Mi tigates the Com piler P hase-ordering problem, and performs phase ordering of the optimizations in LLVM’s highest optimization level using optimization sub-sequences and machine learning.
Abstract: Recent compilers offer a vast number of multilayered optimizations targeting different code segments of an application. Choosing among these optimizations can significantly impact the performance of the code being optimized. The selection of the right set of compiler optimizations for a particular code segment is a very hard problem, but finding the best ordering of these optimizations adds further complexity. Finding the best ordering represents a long standing problem in compilation research, named the phase-ordering problem. The traditional approach of constructing compiler heuristics to solve this problem simply cannot cope with the enormous complexity of choosing the right ordering of optimizations for every code segment in an application.This article proposes an automatic optimization framework we call MiCOMP, which Mitigates the Compiler Phase-ordering problem. We perform phase ordering of the optimizations in LLVM’s highest optimization level using optimization sub-sequences and machine learning. The idea is to cluster the optimization passes of LLVM’s O3 setting into different clusters to predict the speedup of a complete sequence of all the optimization clusters instead of having to deal with the ordering of more than 60 different individual optimizations. The predictive model uses (1) dynamic features, (2) an encoded version of the compiler sequence, and (3) an exploration heuristic to tackle the problem.Experimental results using the LLVM compiler framework and the Cbench suite show the effectiveness of the proposed clustering and encoding techniques to application-based reordering of passes, while using a number of predictive models. We perform statistical analysis on the results and compare against (1) random iterative compilation, (2) standard optimization levels, and (3) two recent prediction approaches. We show that MiCOMP’s iterative compilation using its sub-sequences can reach an average performance speedup of 1.31 (up to 1.51). Additionally, we demonstrate that MiCOMP’s prediction model outperforms the -O1, -O2, and -O3 optimization levels within using just a few predictions and reduces the prediction error rate down to only 5%. Overall, it achieves 90% of the available speedup by exploring less than 0.001% of the optimization space.
61 citations
••
TL;DR: This article analyzes the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and proposes CAIRO, a compiler-assisted technique and decision model for enabling instruction- level offloading of PIM without any burden on programmers.
Abstract: Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the memory. Several memory vendors have also started to integrate computation logics into the memory, such as Hybrid Memory Cube (HMC), the latest version of which supports up to 18 in-memory atomic instructions. Although industry prototypes have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. As a result, compiler support for recognizing offloading candidates and utilizing instruction-level PIM offloading is unavailable. In this article, we analyze the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers. To develop CAIRO, we analyzed how instruction offloading enables performance gain in both CPU and GPU workloads. Our studies show that performance gain from bandwidth savings, the ratio of number of cache misses to total cache accesses, and the overhead of host atomic instructions are the key factors in selecting an offloading candidate. Based on our analytical models, we characterize the properties of beneficial and nonbeneficial candidates for offloading. We evaluate CAIRO with 27 multithreaded CPU and 36 GPU benchmarks. In our evaluation, CAIRO not only doubles the speedup for a set of PIM-beneficial workloads by exploiting HMC-atomic instructions but also prevents slowdown caused by incorrect offloading decisions for other workloads.
61 citations
••
TL;DR: In this article, a low overhead scheme that synergistically integrates pattern-based compression with expansion coding to realize simultaneous energy, latency, and lifetime improvements in MLC/TLC NVMs is proposed.
Abstract: Multilevel/triple-level cell nonvolatile memories (MLC/TLC NVMs) such as phase-change memory (PCM) and resistive RAM (RRAM) are the subject of active research and development as replacement candidates for DRAM, which is limited by its high refresh power and poor scaling potential. In addition to the benefits of nonvolatility (low refresh power) and improved scalability, MLC/TLC NVMs offer high data density and memory capacity over DRAM. However, the viability of MLC/TLC NVMs is limited primarily due to the high programming energy and latency as well as the low endurance of NVM cells; these are primarily attributed to the iterative program-and-verify procedure necessary for programming the NVM cells.This article proposes compression-expansion (CompEx) coding, a low overhead scheme that synergistically integrates pattern-based compression with expansion coding to realize simultaneous energy, latency, and lifetime improvements in MLC/TLC NVMs. CompEx coding is agnostic to the choice of compression technique; in this work, we evaluate CompEx coding using both frequent pattern compression (FPC) and base-delta-immediate (BΔI) compression. CompEx coding integrates FPC/BΔI with (k,m)q “expansion” coding; expansion codes are a class of q-ary linear block codes that encode data using only the low energy states of a q-ary NVM cell. CompEx coding simultaneously reduces energy and latency and improves lifetime for negligible-to-no memory overhead and negligible logic overhead (a 10k gates, which is
39 citations
••
TL;DR: This article proposes a generic compile-time loop hardening scheme based on the duplication of termination conditions and of the computations involved in the evaluation of such conditions, and implemented this algorithm in LLVM 4.0 at the Intermediate Representation (IR) level in the backend.
Abstract: Secure elements widely used in smartphones, digital consumer electronics, and payment systems are subject to fault attacks. To thwart such attacks, software protections are manually inserted requiring experts and time. The explosion of the Internet of Things (IoT) in home, business, and public spaces motivates the hardening of a wider class of applications and the need to offer security solutions to non-experts. This article addresses the automated protection of loops at compilation time, covering the widest range of control- and data-flow patterns, in both shape and complexity. The security property we consider is that a sensitive loop must always perform the expected number of iterations; otherwise, an attack must be reported. We propose a generic compile-time loop hardening scheme based on the duplication of termination conditions and of the computations involved in the evaluation of such conditions. We also investigate how to preserve the security property along the compilation flow while enabling aggressive optimizations. We implemented this algorithm in LLVM 4.0 at the Intermediate Representation (IR) level in the backend. On average, the compiler automatically hardens 95% of the sensitive loops of typical security benchmarks, and 98% of these loops are shown to be robust to simulated faults. Performance and code size overhead remain quite affordable, at 12.5% and 14%, respectively.
31 citations
••
TL;DR: A suite of compiler-related methods based on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing.
Abstract: Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program This difficulty restricts the ability of developers to annotate code that they have not written themselves This article provides a suite of compiler-related methods to mitigate this problem Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users Such annotations lead to speedups of over 100× in an Nvidia architecture and over 50× in an ARM architecture
30 citations
••
TL;DR: In this paper, the authors focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements.
Abstract: Energy transparency is a concept that makes a program’s energy consumption visible, from hardware up to software, through the different system layers. Such transparency can enable energy optimizations at each layer and between layers, as well as help both programmers and operating systems make energy-aware decisions. In this article, we focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements. Our novel mapping technique enables software energy consumption estimations at a higher level than the Instruction Set Architecture (ISA), namely the LLVM intermediate representation (IR) level, and therefore introduces energy transparency directly to the LLVM optimizer. We apply our energy estimation techniques to a comprehensive set of benchmarks, including single- and multithreaded embedded programs from two commonly used concurrency patterns: task farms and pipelines. Using SRA, our LLVM IR results demonstrate a high accuracy with a deviation in the range of 1% from the ISA SRA. Our profiling technique captures the actual energy consumption at the LLVM IR level with an average error of 3%.
29 citations
••
TL;DR: This article demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations and native implementations.
Abstract: High-level parallel programming is an active research topic aimed at promoting parallel programming methodologies that provide the programmer with high-level abstractions to develop complex parallel software with reduced time to solution. Pattern-based parallel programming is based on a set of composable and customizable parallel patterns used as basic building blocks in parallel applications. In recent years, a considerable effort has been made in empowering this programming model with features able to overcome shortcomings of early approaches concerning flexibility and performance. In this article, we demonstrate that the approach is flexible and efficient enough by applying it on 12 out of 13 PARSEC applications. Our analysis, conducted on three different multicore architectures, demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations (i.e., Openmp and OmpSs) and native implementations (i.e., Pthreads). Regarding the programming effort, we also demonstrate a considerable reduction in lines of code and code churn compared to Pthreads and comparable results with respect to other existing implementations.
27 citations
••
TL;DR: In this article, the authors analyzed the memory capacity requirements of important HPC benchmarks and applications and found that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but also detect applications and use cases that require gigabytes per core.
Abstract: An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.
26 citations
••
TL;DR: The concept of exponentially separable mapping (ESM), which defines a set of task mapping constraints on a many-core that can be defragmented optimally in polynomial time, is introduced.
Abstract: Many-cores can execute multiple multithreaded tasks in parallel. A task performs most efficiently when it is executed over a spatially connected and compact subset of cores so that performance loss due to communication overhead imposed by the task’s threads spread across the allocated cores is minimal. Over a span of time, unallocated cores can get scattered all over the many-core, creating fragments in the task mapping. These fragments can prevent efficient contiguous mapping of incoming new tasks leading to loss of performance. This problem can be alleviated by using a task defragmenter, which consolidates smaller fragments into larger fragments wherein the incoming tasks can be efficiently executed. Optimal defragmentation of a many-core is an NP-hard problem in the general case. Therefore, we simplify the original problem to a problem that can be solved optimally in polynomial time. In this work, we introduce a concept of exponentially separable mapping (ESM), which defines a set of task mapping constraints on a many-core. We prove that an ESM enforcing many-core can be defragmented optimally in polynomial time.
••
TL;DR: ALEA overcomes the limitations of coarse-grained power-sensing instruments to associate energy information effectively with source code at a fine-grains level and achieves a worst-case error of only 2% for coarse- Grained code structures and 6% for fine- grained ones, with less than 1% runtime overhead.
Abstract: Energy efficiency is becoming increasingly important, yet few developers understand how source code changes affect the energy and power consumption of their programs. To enable them to achieve energy savings, we must associate energy consumption with software structures, especially at the fine-grained level of functions and loops. Most research in the field relies on direct power/energy measurements taken from on-board sensors or performance counters. However, this coarse granularity does not directly provide the needed fine-grained measurements. This article presents ALEA, a novel fine-grained energy profiling tool based on probabilistic analysis for fine-grained energy accounting. ALEA overcomes the limitations of coarse-grained power-sensing instruments to associate energy information effectively with source code at a fine-grained level. We demonstrate and validate that ALEA can perform accurate energy profiling at various granularity levels on two different architectures: Intel Sandy Bridge and ARM big.LITTLE. ALEA achieves a worst-case error of only 2% for coarse-grained code structures and 6% for fine-grained ones, with less than 1% runtime overhead. Our use cases demonstrate that ALEA supports energy optimizations, with energy savings of up to 2.87 times for a latency-critical option pricing workload under a given power budget.
••
TL;DR: A hybrid memory aware cache partitioning technique (HAP) to dynamically adjust the cache spaces for DRAM and NVM data based on TMPKI, which can exactly reflect the LLC performance on the top of hybrid memories.
Abstract: Data-center servers benefit from large-capacity memory systems to run multiple processes simultaneously. Hybrid DRAM-NVM memory is attractive for increasing memory capacity by exploiting the scalability of Non-Volatile Memory (NVM). However, current LLC policies are unaware of hybrid memory. Cache misses to NVM introduce high cost due to long NVM latency. Moreover, evicting dirty NVM data suffer from long write latency. We propose hybrid memory aware cache partitioning to dynamically adjust cache spaces and give NVM dirty data more chances to reside in LLC. Experimental results show Hybrid-memory-Aware Partition (HAP) improves performance by 46.7% and reduces energy consumption by 21.9% on average against LRU management. Moreover, HAP averagely improves performance by 9.3% and reduces energy consumption by 6.4% against a state-of-the-art cache mechanism.
••
TL;DR: The evaluation indicates significant potential for both cache and memory compression, with higher compressibility in memory due to abundance of zero blocks, and suggests that compression could be of general use both in main memory and caches.
Abstract: Recent proposals present compression as a cost-effective technique to increase cache and memory capacity and bandwidth. While these proposals show potentials of compression, there are several open questions to adopt these proposals in real systems including the following: (1) Do these techniques work for real-world workloads running for long time? (2) Which application domains would potentially benefit the most from compression? (3) At which level of memory hierarchy should we apply compression: caches, main memory, or both?In this article, our goal is to shed light on some main questions on applicability of compression. We evaluate compression in the memory hierarchy for selected examples from different application classes. We analyze real applications with real data and complete runs of several benchmarks. While simulators provide a pretty accurate framework to study potential performance/energy impacts of ideas, they mostly limit us to a small range of workloads with short runtimes. To enable studying real workloads, we introduce a fast and simple methodology to get samples of memory and cache contents of a real machine (a desktop or a server). Compared to a cycle-accurate simulator, our methodology allows us to study real workloads as well as benchmarks. Our toolset is not a replacement for simulators but mostly complements them. While we can use a simulator to measure performance/energy impact of a particular compression proposal, here with our methodology we can study the potentials with long running workloads in early stages of the design.Using our toolset, we evaluate a collection of workloads from different domains, such as a web server of CS department of UW—Madison for 24h, Google Chrome (watching a 1h-long movie on YouTube), and Linux games (playing for about an hour). We also use several benchmarks from different domains, including SPEC, mobile, and big data. We run these benchmarks to completion.Using these workloads and our toolset, we analyze different compression properties for both real applications and benchmarks. We focus on eight main hypotheses on compression, derived from previous work on compression. These properties (Table 2) act as foundation of several proposals on compression, so performance of those proposals depends very much on these basic properties.Overall, our results suggest that compression could be of general use both in main memory and caches. On average, the compression ratio is ≥2 for 64% and 54% of workloads, respectively, for memory and cache data. Our evaluation indicates significant potential for both cache and memory compression, with higher compressibility in memory due to abundance of zero blocks. Among application domains we studied, servers show on average the highest compressibility, while our mobile benchmarks show the lowest compressibility.For comparing benchmarks with real workloads, we show that (1) it is critical to run benchmarks to completion or considerably long runtimes to avoid biased conclusions, and (2) SPEC benchmarks are good representative of real Desktop applications in terms of compressibility of their datasets. However, this does not hold for all compression properties. For example, SPEC benchmarks have much better compression locality (i.e., neighboring blocks have similar compressibility) than real workloads. Thus, it is critical for designers to consider wider range of workloads, including real applications, to evaluate their compression techniques.
••
TL;DR: The approach of iterative optimization outperforms existing optimization techniques in that it finds loop transformations that yield significantly higher performance and is evaluated against existing iterative and model-driven optimization strategies.
Abstract: The polyhedron model is a powerful model to identify and apply systematically loop transformations that improve data locality (e.g., via tiling) and enable parallelization. In the polyhedron model, a loop transformation is, essentially, represented as an affine function. Well-established algorithms for the discovery of promising transformations are based on performance models. These algorithms have the drawback of not being easily adaptable to the characteristics of a specific program or target hardware. An iterative search for promising loop transformations is more easily adaptable and can help to learn better models. We present an iterative optimization method in the polyhedron model that targets tiling and parallelization. The method enables either a sampling of the search space of legal loop transformations at random or a more directed search via a genetic algorithm. For the latter, we propose a set of novel, tailored reproduction operators. We evaluate our approach against existing iterative and model-driven optimization strategies. We compare the convergence rate of our genetic algorithm to that of random exploration. Our approach of iterative optimization outperforms existing optimization techniques in that it finds loop transformations that yield significantly higher performance. If well configured, then random exploration turns out to be very effective and reduces the need for a genetic algorithm.
••
TL;DR: This article presents a novel data-parallel solution designed and optimized for the GPU architecture that uses value-based checking, which detects the races reported in previous work, finds new races, and supports race-free deterministic GPU execution.
Abstract: Data race detection has become an important problem in GPU programming. Previous designs of CPU race-checking tools are mainly task parallel and incur high overhead on GPUs due to access instrumentation, especially when monitoring many thousands of threads routinely used by GPU programs.This article presents a novel data-parallel solution designed and optimized for the GPU architecture. It includes compiler support and a set of runtime techniques. It uses value-based checking, which detects the races reported in previous work, finds new races, and supports race-free deterministic GPU execution. More important, race checking is massively data parallel and does not introduce divergent branching or atomic synchronization. Its slowdown is less than 5 × for over half of the tests and 10 × on average, which is orders of magnitude more efficient than the cuda-memcheck tool by Nvidia and the methods that use fine-grained access instrumentation.
••
TL;DR: A new definition of ideal energy-proportional computing is introduced, new metrics to quantify computational energy waste, and new SLA-aware OS governors that seek Pareto optimality to achieve power-efficient performance are introduced.
Abstract: The original definition of energy-proportional computing does not characterize the energy efficiency of recent reconfigurable computers, resulting in nonintuitive “super-proportional” behavior. This article introduces a new definition of ideal energy-proportional computing, new metrics to quantify computational energy waste, and new SLA-aware OS governors that seek Pareto optimality to achieve power-efficient performance.
••
TL;DR: A new metric called the victim footprint (VFP) is presented, measured once per program in its solo execution and can be combined to compute the performance of any exclusive cache hierarchy, replacing parallel testing with theoretical analysis.
Abstract: A problem on multicore systems is cache sharing, where the cache occupancy of a program depends on the cache usage of peer programs. Exclusive cache hierarchy as used on AMD processors is an effective solution to allow processor cores to have a large private cache while still benefitting from shared cache. The shared cache stores the “victims” (i.e., data evicted from private caches). The performance depends on how victims of co-run programs interact in shared cache.This article presents a new metric called the victim footprint (VFP). It is measured once per program in its solo execution and can then be combined to compute the performance of any exclusive cache hierarchy, replacing parallel testing with theoretical analysis. The work evaluates the VFP by using it to analyze cache sharing by parallel mixes of sequential programs, comparing the accuracy of the theory to hardware counter results, and measuring the benefit of exclusivity-aware analysis and optimization.
••
TL;DR: This work proposes a set of extensions and modifications to Halide to generate code for DSP C compilers, focusing specifically on diverse SIMD target instruction sets and heterogeneous scratchpad memory hierarchies, and implements said techniques for a commercial DSP found in an Intel Image Processing Unit.
Abstract: Specialized Digital Signal Processors (DSPs), which can be found in a wide range of modern devices, play an important role in power-efficient, high-performance image processing. Applications including camera sensor post-processing and computer vision benefit from being (partially) mapped onto such DSPs. However, due to their specialized instruction sets and dependence on low-level code optimization, developing applications for DSPs is more time-consuming and error-prone than for general-purpose processors. Halide is a domain-specific language (DSL) that enables low-effort development of portable, high-performance imaging pipelines—a combination of qualities that is hard, if not impossible, to find among DSP programming models. We propose a set of extensions and modifications to Halide to generate code for DSP C compilers, focusing specifically on diverse SIMD target instruction sets and heterogeneous scratchpad memory hierarchies. We implement said techniques for a commercial DSP found in an Intel Image Processing Unit (IPU), demonstrating that this solution can be used to achieve performance within 20% of highly tuned, manually written C code, while leading to a reduction in code complexity. By comparing performance of Halide algorithms using our solution to results on CPU and GPU targets, we confirm the value of using DSP targets with Halide.
••
TL;DR: This work presents a novel method, Machine Learned Machines (MLM), by using online reinforcement learning (RL) to perform dynamic partitioning of the last level cache (LLC) along with dynamic voltage and frequency scaling (DVFS) of the core and uncore (interconnection network and LLC).
Abstract: Modern multi-core systems provide huge computational capabilities, which can be used to run multiple processes concurrently. To achieve the best possible performance within limited power budgets, the various system resources need to be allocated effectively. Any mismatch between runtime resource requirement and allocation leads to a sub-optimal energy-delay product (EDP). Different optimization techniques exist for addressing the problem of mismatch between the dynamic requirement and runtime allocation of the system resources. Choosing between multiple optimizations at runtime is complex due to the non-additive effects, making the scenario suitable for the application of machine learning techniques. We present a novel method, Machine Learned Machines (MLM), by using online reinforcement learning (RL) to perform dynamic partitioning of the last level cache (LLC), along with dynamic voltage and frequency scaling (DVFS) of the core and uncore (interconnection network and LLC). We have proposed and evaluated three different MLM co-optimization techniques based on independent and cooperative multi-agent learners. We show that the co-optimization results in a much lower system EDP than any of the techniques applied individually. We explore various RL models targeted toward optimization of different system metrics and study their effects on a system EDP, system throughput (STP), and Fairness. The various proposed techniques have been extensively evaluated with a mix of 20 workloads on a 4-core system using Spec2006 benchmarks. We have further evaluated our cooperative MLM techniques on a 16-core system. The results show an average of 20.5% and 19.1% system EDP improvement on a 4-core and 16-core system, respectively, with limited degradation of STP and Fairness.
••
TL;DR: EXtended-ECS (XECS), a page-level error correction architecture, is proposed, which employs dynamic on-demand ECS allocation and opportunistic pattern-based data compression to improve NVM lifetime by 2t and is also compatible with MLC/TLC NVMs, where it complements drift-induced soft error reduction techniques like ECC and incomplete data mapping to simultaneously extend N VM lifetime.
Abstract: Emerging nonvolatile memories (NVMs) suffer from low write endurance, resulting in early cell failures (hard errors), which reduce memory lifetime. It was recognized early on that conventional error-correcting codes (ECCs), which are designed for soft errors, are a poor choice for addressing hard errors in NVMs. This led to the evolution of hard error correction schemes like dynamically replicated memory (DRM), error-correcting pointers (ECPs), SAFER, FREE-p, PAYG, and Zombie memory to improve NVM lifetime. Whereas these approaches made significant inroads in addressing hard errors and low memory lifetime in NVMs, overcoming the challenges of underutilization of error-correcting resources and/or implementation overhead (e.g., codec latency, hardware support) remain areas of active research and development.This article proposes error-correcting strings (ECSs) as a high-utilization, low-latency solution for hard error correction in single-/multi-/triple-level cell (SLC/MLC/TLC) NVMs. At its core, ECS adopts a base-offset approach to store pointers to the failed memory cells; in this work, base is the address of the first failed cell in a memory block and offsets are the distances between successive failed cells in that memory block. Unlike ECP, which uses fixed-length pointers, ECS uses variable-length offsets to point to the failed cells, thereby realizing more pointers to tolerate more hard errors per memory block. Further, this article proposes eXtended-ECS (XECS), a page-level error correction architecture, which employs dynamic on-demand ECS allocation and opportunistic pattern-based data compression to improve NVM lifetime by 2× over ECP-6 for comparable overhead and negligible impact to system performance. Finally, this article demonstrates that ECS is a drop-in replacement for ECP to extend the lifetime of state-of-the-art ECP-based techniques like PAYG and Zombie memory; ECS is also compatible with MLC/TLC NVMs, where it complements drift-induced soft error reduction techniques like ECC and incomplete data mapping to simultaneously extend NVM lifetime.
••
TL;DR: MBZip is a synergistic mechanism that compresses multiple data blocks into one single block (called a zipped block), both at the LLC and DRAM, and improves the system performance by 21.9%, with a maximum of 90.3% on a 4-core system.
Abstract: Compression techniques at the last-level cache and the DRAM play an important role in improving system performance by increasing their effective capacities. A compressed block in DRAM also reduces the transfer time over the memory bus to the caches, reducing the latency of a LLC cache miss. Usually, compression is achieved by exploiting data patterns present within a block. But applications can exhibit data locality that spread across multiple consecutive data blocks. We observe that there is significant opportunity available for compressing multiple consecutive data blocks into one single block, both at the LLC and DRAM. Our studies using 21 SPEC CPU applications show that, at the LLC, around 25% (on average) of the cache blocks can be compressed into one single cache block when grouped together in groups of 2 to 8 blocks. In DRAM, more than 30% of the columns residing in a single DRAM page can be compressed into one DRAM column, when grouped together in groups of 2 to 6. Motivated by these observations, we propose a mechanism, namely, MBZip, that compresses multiple data blocks into one single block (called a zipped block), both at the LLC and DRAM. At the cache, MBZip includes a simple tag structure to index into these zipped cache blocks and the indexing does not incur any redirectional delay. At the DRAM, MBZip does not need any changes to the address computation logic and works seamlessly with the conventional/existing logic. MBZip is a synergistic mechanism that coordinates these zipped blocks at the LLC and DRAM. Further, we also explore silent writes at the DRAM and show that certain writes need not access the memory when blocks are zipped. MBZip improves the system performance by 21.9%, with a maximum of 90.3% on a 4-core system.
••
TL;DR: SCALO improves throughput when other state-of-the-art approaches fail and outperforms them by up to 40% when they succeed, by including dynamic contention effects on scalability and by controlling the parallelism during the execution of parallel regions.
Abstract: Shared memory machines continue to increase in scale by adding more parallelism through additional cores and complex memory hierarchies. Often, executing multiple applications concurrently, dividing among them hardware threads, provides greater efficiency rather than executing a single application with large thread counts. However, contention for shared resources can limit the improvement of concurrent application execution: orchestrating the number of threads used by each application and is essential.In this article, we contribute SCALO, a solution to orchestrate concurrent application execution to increase throughput. SCALO monitors co-executing applications at runtime to evaluate their scalability. Its optimizing thread allocator analyzes these scalability estimates to adapt the parallelism of each program. Unlike previous approaches, SCALO differs by including dynamic contention effects on scalability and by controlling the parallelism during the execution of parallel regions. Thus, it improves throughput when other state-of-the-art approaches fail and outperforms them by up to 40% when they succeed.
••
TL;DR: A new metric for assessing the similarity of statistical distributions of event counts is presented and it is shown that the execution profiling approach performs significantly better than Hardware Event Multiplexing.
Abstract: Collecting hardware event counts is essential to understanding program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only a small fraction of hardware events can be monitored simultaneously. We present new techniques to acquire counts for all available hardware events with high accuracy by multiplexing PMCs across multiple executions of the same program, then carefully reconciling and merging the multiple profiles into a single, coherent profile. We present a new metric for assessing the similarity of statistical distributions of event counts and show that our execution profiling approach performs significantly better than Hardware Event Multiplexing.
••
TL;DR: The MATOG auto-tuner abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs, independent of the used GPU generation and without the need to manually tune the code.
Abstract: Optimal code performance is (besides correctness and accuracy) the most important objective in compute intensive applications. In many of these applications, Graphic Processing Units (GPUs) are used because of their high amount of compute power. However, caused by their massively parallel architecture, the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In reality, this is usually not the case as productive code is normally at least several years old and nobody has the time to continuously adjust existing code to new hardware. In recent years more and more approaches have emerged that automatically tune the performance of applications toward the underlying hardware. In this article, we present the MATOG auto-tuner and its concepts. It abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs. MATOG only requires few profiling runs to analyze even complex applications, while achieving significant speedups over non-optimized code, independent of the used GPU generation and without the need to manually tune the code.
••
TL;DR: In this paper, the authors present a formal framework that can determine the inherent local and global symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis, based on the mathematical theory of groups and a generalization called inverse semigroups.
Abstract: With the surge of multi- and many-core systems, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. While most such approaches leverage the inherent symmetry of architectures and applications, they do it in a problem-specific and intuitive way. However, intuitive approaches become impractical with growing hardware complexity, like Network-on-Chip interconnect or heterogeneous cores. In this article, we present a formal framework that can determine the inherent local and global symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis. Our approach is based on the mathematical theory of groups and a generalization called inverse semigroups. We evaluate our approach in two state-of-the-art mapping frameworks. Even for the platforms with a handful of cores of today and moderate-sized benchmarks, our approach consistently yields reductions of the overall execution time of algorithms. We obtain a speedup of more than 10 × for one use-case and saved 10% of time in another.
••
TL;DR: An automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely are proposed that can enable GPUs to keep up to twice as many threads in flight simultaneously.
Abstract: Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs support both at the compiler and at the microarchitecture level. At the compiler level, a method is needed to automate the reduction of precision of each floating-point value. At the microarchitecture level, a lower precision of each floating-point register can allow more floating-point values to be packed into a register file. This, however, calls for new register file organizations.This article proposes an automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely. The automated precision-selection method uses a data-driven approach for setting the precision level of floating-point values, given a quality threshold and a representative set of input data. By allowing a small, but acceptable, degradation in output quality, our method can remove a significant amount of the bits needed to represent floating-point values in the investigated kernels (between 28% and 60%). Our proposed register file organization exploits these lower-precision floating-point values by packing several of them into the same physical register. This reduces the register pressure per thread by up to 48%, and by 27% on average, for a negligible output-quality degradation. This can enable GPUs to keep up to twice as many threads in flight simultaneously.
••
TL;DR: This article presents Transactional Correctness tool for Abstract Data Types (TxC-ADT), the first tool that can check the correctness of transactional data structures and presents a technique for defining correctness as a happens-before relation, an essential aspect for checking correctness of transactions that synchronize only for high-level semantic conflicts.
Abstract: Transactional memory simplifies multiprocessor programming by providing the guarantee that a sequential block of code in the form of a transaction will exhibit atomicity and isolation. Transactional data structures offer the same guarantee to concurrent data structures by enabling the atomic execution of a composition of operations. The concurrency control of transactional memory systems preserves atomicity and isolation by detecting read/write conflicts among multiple concurrent transactions. State-of-the-art transactional data structures improve on this concurrency control protocol by providing explicit transaction-level synchronization for only non-commutative operations. Since read/write conflicts are handled by thread-level concurrency control, the correctness of transactional data structures cannot be evaluated according to the read/write histories. This presents a challenge for existing correctness verification techniques for transactional memory, because correctness is determined according to the transitions taken by the transactions in the presence of read/write conflicts.In this article, we present Transactional Correctness tool for Abstract Data Types (TxC-ADT), the first tool that can check the correctness of transactional data structures. TxC-ADT elevates the standard definitions of transactional correctness to be in terms of an abstract data type, an essential aspect for checking correctness of transactions that synchronize only for high-level semantic conflicts. To accommodate a diverse assortment of transactional correctness conditions, we present a technique for defining correctness as a happens-before relation. Defining a correctness condition in this manner enables an automated approach in which correctness is evaluated by generating and analyzing a transactional happens-before graph during model checking. A transactional happens-before graph is maintained on a per-thread basis, making our approach applicable to transactional correctness conditions that do not enforce a total order on a transactional execution. We demonstrate the practical applications of TxC-ADT by checking Lock Free Transactional Transformation and Transactional Data Structure Libraries for serializability, strict serializability, opacity, and causal consistency.
••
TL;DR: A significance-centric programming model and runtime support that sets the supply voltage in a multicore CPU to sub-nominal values to reduce the energy footprint and provide mechanisms to control output quality is introduced.
Abstract: This article introduces a significance-centric programming model and runtime support that sets the supply voltage in a multicore CPU to sub-nominal values to reduce the energy footprint and provide mechanisms to control output quality. The developers specify the significance of application tasks respecting their contribution to the output quality and provide check and repair functions for handling faults. On a multicore system, we evaluate five benchmarks using an energy model that quantifies the energy reduction. When executing the least-significant tasks unreliably, our approach leads to 20% CPU energy reduction with respect to a reliable execution and has minimal quality degradation.
••
TL;DR: A lightweight progress-tracking methodology based on the outer loops of application kernels that builds on online history and uses it to estimate the total execution time and can reduce the energy consumption by more than 20% without missing any computational deadlines.
Abstract: Most systems allocate computational resources to each executing task without any actual knowledge of the application’s Quality-of-Service (QoS) requirements. Such best-effort policies lead to overprovisioning of the resources and increase energy loss. This work assumes applications with soft QoS requirements and exploits the inherent timing slack to minimize the allocated computational resources to reduce energy consumption.We propose a lightweight progress-tracking methodology based on the outer loops of application kernels. It builds on online history and uses it to estimate the total execution time. The prediction of the execution time and the QoS requirements are then used to schedule the application on a heterogeneous architecture with big out-of-order cores and small (LITTLE) in-order cores and select the minimum operating frequency, using DVFS, that meets the deadline. Our scheme is effective in exploiting the timing slack of each application. We show that it can reduce the energy consumption by more than 20% without missing any computational deadlines.