scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Architecture and Code Optimization in 2016"


Journal ArticleDOI
TL;DR: A new 3D-stacked DRAM architecture, called Simultaneous Multi-Layer Access (SMLA), is proposed, which increases the internal DRAM bandwidth by accessing multiple DRAM layers concurrently, thus making much greater use of the bandwidth that the TSVs offer.
Abstract: 3D-stacked DRAM alleviates the limited memory bandwidth bottleneck that exists in modern systems by leveraging through silicon vias (TSVs) to deliver higher external memory channel bandwidth. Today’s systems, however, cannot fully utilize the higher bandwidth offered by TSVs, due to the limited internal bandwidth within each layer of the 3D-stacked DRAM. We identify that the bottleneck to enabling higher bandwidth in 3D-stacked DRAM is now the global bitline interface, the connection between the DRAM row buffer and the peripheral IO circuits. The global bitline interface consists of a limited and expensive set of wires and structures, called global bitlines and global sense amplifiers, whose high cost makes it difficult to simply scale up the bandwidth of the interface within a single DRAM layer in the 3D stack. We alleviate this bandwidth bottleneck by exploiting the observation that several global bitline interfaces already exist across the multiple DRAM layers in current 3D-stacked designs, but only a fraction of them are enabled at the same time.We propose a new 3D-stacked DRAM architecture, called Simultaneous Multi-Layer Access (SMLA), which increases the internal DRAM bandwidth by accessing multiple DRAM layers concurrently, thus making much greater use of the bandwidth that the TSVs offer. To avoid channel contention, the DRAM layers must coordinate with each other when simultaneously transferring data. We propose two approaches to coordination, both of which deliver four times the bandwidth for a four-layer DRAM, over a baseline that accesses only one layer at a time. Our first approach, Dedicated-IO, statically partitions the TSVs by assigning each layer to a dedicated set of TSVs that operate at a higher frequency. Unfortunately, Dedicated-IO requires a nonuniform design for each layer (increasing manufacturing costs), and its DRAM energy consumption scales linearly with the number of layers. Our second approach, Cascaded-IO, solves both issues by instead time multiplexing all of the TSVs across layers. Cascaded-IO reduces DRAM energy consumption by lowering the operating frequency of higher layers. Our evaluations show that SMLA provides significant performance improvement and energy reduction across a variety of workloads (55p/18p on average for multiprogrammed workloads, respectively) over a baseline 3D-stacked DRAM, with low overhead.

150 citations


Journal ArticleDOI
TL;DR: A Deadline-Aware memory Scheduler for Heterogeneous systems (DASH), which overcomes problems using three key ideas, with the goal of meeting HWAs’ deadlines while providing high CPU performance.
Abstract: Modern SoCs integrate multiple CPU cores and hardware accelerators (HWAs) that share the same main memory system, causing interference among memory requests from different agents. The result of this interference, if it is not controlled well, is missed deadlines for HWAs and low CPU performance. Few previous works have tackled this problem. State-of-the-art mechanisms designed for CPU-GPU systems strive to meet a target frame rate for GPUs by prioritizing the GPU close to the time when it has to complete a frame. We observe two major problems when such an approach is adapted to a heterogeneous CPU-HWA system. First, HWAs miss deadlines because they are prioritized only when close to their deadlines. Second, such an approach does not consider the diverse memory access characteristics of different applications running on CPUs and HWAs, leading to low performance for latency-sensitive CPU applications and deadline misses for some HWAs, including GPUs.In this article, we propose a Deadline-Aware memory Scheduler for Heterogeneous systems (DASH), which overcomes these problems using three key ideas, with the goal of meeting HWAs’ deadlines while providing high CPU performance. First, DASH prioritizes an HWA when it is not on track to meet its deadline any time during a deadline period, instead of prioritizing it only when close to a deadline. Second, DASH prioritizes HWAs over memory-intensive CPU applications based on the observation that memory-intensive applications’ performance is not sensitive to memory latency. Third, DASH treats short-deadline HWAs differently as they are more likely to miss their deadlines and schedules their requests based on worst-case memory access time estimates.Extensive evaluations across a wide variety of different workloads and systems show that DASH achieves significantly better CPU performance than the best previous scheduler while always meeting the deadlines for all HWAs, including GPUs, thereby largely improving frame rates.

117 citations


Journal ArticleDOI
TL;DR: The proposed framework is based on the application characterization done dynamically by using independent microarchitecture features and Bayesian networks and demonstrates 4 × and 3 × speedup, respectively, on cBench and Polybench in terms of exploration efficiency given the same quality of the solutions generated by the random iterative compilation model.
Abstract: The variety of today’s architectures forces programmers to spend a great deal of time porting and tuning application codes across different platforms. Compilers themselves need additional tuning, which has considerable complexity as the standard optimization levels, usually designed for the average case and the specific target architecture, often fail to bring the best results. This article proposes COBAYN: Compiler autotuning framework using BAYesian Networks, an approach for a compiler autotuning methodology using machine learning to speed up application performance and to reduce the cost of the compiler optimization phases. The proposed framework is based on the application characterization done dynamically by using independent microarchitecture features and Bayesian networks. The article also presents an evaluation based on using static analysis and hybrid feature collection approaches. In addition, the article compares Bayesian networks with respect to several state-of-the-art machine-learning models. Experiments were carried out on an ARM embedded platform and GCC compiler by considering two benchmark suites with 39 applications. The set of compiler configurations, selected by the model (less than 7p of the search space), demonstrated an application performance speedup of up to 4.6 × on Polybench (1.85 × on average) and 3.1 × on cBench (1.54 × on average) with respect to standard optimization levels. Moreover, the comparison of the proposed technique with (i) random iterative compilation, (ii) machine learning--based iterative compilation, and (iii) noniterative predictive modeling techniques shows, on average, 1.2 × , 1.37 × , and 1.48 × speedup, respectively. Finally, the proposed method demonstrates 4 × and 3 × speedup, respectively, on cBench and Polybench in terms of exploration efficiency given the same quality of the solutions generated by the random iterative compilation model.

72 citations


Journal ArticleDOI
TL;DR: RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses, and drops a fraction of load requests that miss in the cache after predicting their values, which reduces memory bandwidth contention by removing them from the system.
Abstract: This article aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP does not check for or recover from load-value mispredictions, hence, avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops a fraction of load requests that miss in the cache after predicting their values. Dropping requests reduces memory bandwidth contention by removing them from the system. The drop rate is a knob to control the trade-off between performance/energy efficiency and output quality. Our extensive evaluations show that RFVP, when used in GPUs, yields significant performance improvement and energy reduction for a wide range of quality-loss levels. We also evaluate RFVP’s latency benefits for a single core CPU. The results show performance improvement and energy reduction for a wide variety of applications with less than 1p loss in quality.

69 citations


Journal ArticleDOI
TL;DR: This article classify, analyze, and compare covert channels through dynamic branch prediction units in modern processors, and estimates the capacity of the branch predictor covert channels and describes a software-only mitigation technique based on randomizing the state of the predictor tables on context switches.
Abstract: Covert channels through shared processor resources provide secret communication between two malicious processes: the trojan and the spy. In this article, we classify, analyze, and compare covert channels through dynamic branch prediction units in modern processors. Through experiments on a real hardware platform, we compare contention-based channel and the channel that is based on exploiting the branch predictor’s residual state. We analyze these channels in SMT and single-threaded environments under both clean and noisy conditions. Our results show that the residual state-based channel provides a cleaner signal and is effective even in noisy execution environments with another application sharing the same physical core with the trojan and the spy. We also estimate the capacity of the branch predictor covert channels and describe a software-only mitigation technique that is based on randomizing the state of the predictor tables on context switches. We show that this protection eliminates all covert channels through the branch prediction unit with minimal impact on performance.

62 citations


Journal ArticleDOI
TL;DR: Experimental results reveal that the use of the clustering-based DSE approach achieves a significant reduction in the total exploration time of the search space at the same time that considerable performance speedups were obtained using the optimized codes.
Abstract: A large number of compiler optimizations are nowadays available to users. These optimizations interact with each other and with the input code in several and complex ways. The sequence of application of optimization passes can have a significant impact on the performance achieved. The effect of the optimizations is both platform and application dependent. The exhaustive exploration of all viable sequences of compiler optimizations for a given code fragment is not feasible. As this exploration is a complex and time-consuming task, several researchers have focused on Design Space Exploration (DSE) strategies both to select optimization sequences to improve the performance of each function of the application and to reduce the exploration time. In this article, we present a DSE scheme based on a clustering approach for grouping functions with similarities and exploration of a reduced search space resulting from the combination of optimizations previously suggested for the functions in each group. The identification of similarities between functions uses a data mining method that is applied to a symbolic code representation. The data mining process combines three algorithms to generate clusters: the Normalized Compression Distance, the Neighbor Joining, and a new ambiguity-based clustering algorithm. Our experiments for evaluating the effectiveness of the proposed approach address the exploration of optimization sequences in the context of the ReflectC compiler, considering 49 compilation passes while targeting a Xilinx MicroBlaze processor, and aiming at performance improvements for 51 functions and four applications. Experimental results reveal that the use of our clustering-based DSE approach achieves a significant reduction in the total exploration time of the search space (20× over a Genetic Algorithm approach) at the same time that considerable performance speedups (41p over the baseline) were obtained using the optimized codes. Additional experiments were performed considering the LLVM compiler, considering 124 compilation passes, and targeting a LEON3 processor. The results show that our approach achieved geometric mean speedups of 1.49 × , 1.32 × , and 1.24 × for the best 10, 20, and 30 functions, respectively, and a global improvement of 7p over the performance obtained when compiling with -O2.

61 citations


Journal ArticleDOI
TL;DR: Nornir is proposed, an algorithm to automatically derive, without relying on historical data about previous executions, performance and power consumption models of an application in different configurations, by using these models to select a close-to-optimal configuration for the given user requirement, either performance or power consumption.
Abstract: In current computing systems, many applications require guarantees on their maximum power consumption to not exceed the available power budget. On the other hand, for some applications, it could be possible to decrease their performance, yet maintain an acceptable level, in order to reduce their power consumption. To provide such guarantees, a possible solution consists in changing the number of cores assigned to the application, their clock frequency, and the placement of application threads over the cores. However, power consumption and performance have different trends depending on the application considered and on its input. Finding a configuration of resources satisfying user requirements is, in the general case, a challenging task. In this article, we propose Nornir, an algorithm to automatically derive, without relying on historical data about previous executions, performance and power consumption models of an application in different configurations. By using these models, we are able to select a close-to-optimal configuration for the given user requirement, either performance or power consumption. The configuration of the application will be changed on-the-fly throughout the execution to adapt to workload fluctuations, external interferences, and/or application’s phase changes. We validate the algorithm by simulating it over the applications of the Parsec benchmark suit. Then, we implement our algorithm and we analyse its accuracy and overhead over some of these applications on a real execution environment. Eventually, we compare the quality of our proposal with that of the optimal algorithm and of some state-of-the-art solutions.

52 citations


Journal ArticleDOI
TL;DR: This work presents a “sample-locally-analyze-remotely” technique to reduce the overhead in the monitored system which has limited storage and computing resources, and demonstrates an 80% I/O bandwidth reduction after applying Compressive Sensing.
Abstract: Hardware Performance Counter-based (HPC) runtime checking is an effective way to identify malicious behaviors of malware and detect malicious modifications to a legitimate program’s control flow. To reduce the overhead in the monitored system which has limited storage and computing resources, we present a “sample-locally-analyze-remotely” technique. The sampled HPC data are sent to a remote server for further analysis. To minimize the I/O bandwidth required for transmission, the fine-grained HPC profiles are compressed into much smaller vectors with Compressive Sensing. The experimental results demonstrate an 80p I/O bandwidth reduction after applying Compressive Sensing, without compromising the detection and identification capabilities.

50 citations


Journal ArticleDOI
TL;DR: A technique for amortizing the fetch cost by merging fetching of values for different snapshots of the same vertex and a technique for feeding values computed by earlier snapshots into later snapshots to address inefficiencies are proposed.
Abstract: Evolving graph processing involves repeating analyses, which are often iterative, over multiple snapshots of the graph corresponding to different points in time. Since the snapshots of an evolving graph share a great number of vertices and edges, traditional approaches that process these snapshots one at a time without exploiting this overlap contain much wasted effort on both data loading and computation, making them extremely inefficient. In this article, we identify major sources of inefficiencies and present two optimization techniques to address them. First, we propose a technique for amortizing the fetch cost by merging fetching of values for different snapshots of the same vertex. Second, we propose a technique for amortizing the processing cost by feeding values computed by earlier snapshots into later snapshots. We have implemented these optimizations in two distributed graph processing systems, namely, GraphLab and ASPIRE. Our experiments with multiple real evolving graphs and algorithms show that, on average fetch amortization speeds up execution of GraphLab and ASPIRE by 5.2× and 4.1× , respectively. Amortizing the processing cost yields additional average speedups of 2× and 7.9×, respectively.

40 citations


Journal ArticleDOI
TL;DR: This article proposes a lightweight runtime approach that can exploit the properties of the power profile specific to a processor, outperforming classical Linux governors such as powersave or on-demand for computational kernels and demonstrates that it systematically outperforms the powersave Linux governor while also improving overall performance.
Abstract: Dynamic Voltage and Frequency Scaling (DVFS) typically adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical DVFS approaches include using default strategies such as running at the lowest or the highest frequency or reacting to the CPU’s runtime load to reduce or increase frequency based on the CPU usage. In this article, we argue that a compile-time approach to CPU frequency selection is achievable for affine program regions and can significantly outperform runtime-based approaches. We first propose a lightweight runtime approach that can exploit the properties of the power profile specific to a processor, outperforming classical Linux governors such as powersave or on-demand for computational kernels. We then demonstrate that, for affine kernels in the application, a purely compile-time approach to CPU frequency and core count selection is achievable, providing significant additional benefits over the runtime approach. Our framework relies on a one-time profiling of the target CPU, along with a compile-time categorization of loop-based code segments in the application. These are combined to determine at compile-time the frequency and the number of cores to use to execute each affine region to optimize energy or energy-delay product. Extensive evaluation on 60 benchmarks and 5 multi-core CPUs show that our approach systematically outperforms the powersave Linux governor while also improving overall performance.

38 citations


Journal ArticleDOI
TL;DR: This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory that adopts the heterogeneous-race-free (HRF) memory model, which simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU).
Abstract: This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals implement coherence protocols for sequential consistency (SC) between central processing unit (CPU) cores and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency aware. We achieve an average 45p speedup and 45p energy-delay product reduction (20p energy) over the corresponding SC implementation.

Journal ArticleDOI
TL;DR: It is found that PiM architectures and, specifically, GP-SIMD benefit more than other many-core architectures from using resistive memory.
Abstract: GP-SIMD, a novel hybrid general-purpose SIMD architecture, addresses the challenge of data synchronization by in-memory computing, through combining data storage and massive parallel processing. In this article, we explore a resistive implementation of the GP-SIMD architecture. In resistive GP-SIMD, a novel resistive row and column addressable 4F2 crossbar is utilized, replacing the modified CMOS 190F2 SRAM storage previously proposed for GP-SIMD architecture. The use of the resistive crossbar allows scaling the GP-SIMD from few millions to few hundred millions of processing units on a single silicon die. The performance, power consumption and power efficiency of a resistive GP-SIMD are compared with the CMOS version. We find that PiM architectures and, specifically, GP-SIMD benefit more than other many-core architectures from using resistive memory. A framework for in-place arithmetic operation on a single multivalued resistive cell is explored, demonstrating a potential to become a building block for next-generation PiM architectures.

Journal ArticleDOI
TL;DR: A simple yet effective SLP compiler technique called Paver (PArtial VEctorizeR), formulated and implemented in LLVM as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops.
Abstract: Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed correctly and efficiently, where the degree of partial SIMD parallelism is smaller than the SIMD datapath width. We present a simple yet effective SLP compiler technique called Paver (PArtial VEctorizeR), formulated and implemented in LLVM as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops. The key idea is to maximize SIMD utilization by widening vector instructions used while minimizing the overheads caused by memory access, packing/unpacking, and/or masking operations, without introducing new memory errors or new numeric exceptions. For a set of 9 C/C++/Fortran applications with partial SIMD parallelism, Paver achieves significantly better kernel and whole-program speedups than LLVM on both Intel’s AVX and ARM’s NEON.

Journal ArticleDOI
TL;DR: This work presents a new power estimation method that allows understanding the power breakdown of an application when running on modern processor architecture such as the newly released Intel Skylake processor.
Abstract: A detailed analysis of power consumption at low system levels becomes important as a means for reducing the overall power consumption of a system and its thermal hot spots. This work presents a new power estimation method that allows understanding the power breakdown of an application when running on modern processor architecture such as the newly released Intel Skylake processor. This work also provides a detailed power and performance characterization report for the SPEC CPU2006 benchmarks, analysis of the data using side-by-side power and performance breakdowns, as well as few interesting case studies.

Journal ArticleDOI
TL;DR: The general design of MAMBO is described, a new DBM tool for ARM is proposed, and novel optimisations to handle indirect branches are introduced, and scenarios in which it may be possible to relax the transparency offered by DBM tools to allow extra optimisation to be applied are explored.
Abstract: As the ARM architecture expands beyond its traditional embedded domain, there is a growing interest in dynamic binary modification (DBM) tools for general-purpose multicore processors that are part of the ARM family. Existing DBM tools for ARM suffer from introducing large overheads in the execution of applications. The specific questions that this article addresses are (i) how to develop such DBM tools for the ARM architecture and (ii) whether new optimisations are plausible and needed. We describe the general design of MAMBO, a new DBM tool for ARM, which we release together with this publication, and introduce novel optimisations to handle indirect branches. In addition, we explore scenarios in which it may be possible to relax the transparency offered by DBM tools to allow extra optimisations to be applied. These scenarios arise from analysing the most typical usages: for example, application binaries without handcrafted assembly. The performance evaluation shows that MAMBO introduces small overheads for SPEC CPU2006 and PARSEC 3.0 when comparing with the execution times of the unmodified programs: a geometric mean overhead of 28p on a Cortex-A9 and of 34p on a Cortex-A15 for CPU2006, and between 27p and 32p, depending on the number of threads, for PARSEC on a Cortex-A15.

Journal ArticleDOI
TL;DR: Yet Another Compressed Cache (YACC) is proposed, a new compressed cache design that targets improving effective cache capacity with a simple design that performs comparably to DCC and SCC but is much simpler to implement.
Abstract: Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip's area and thus account for a significant fraction of its cost. Compression has the potential to improve the effective capacity of a cache, providing the performance and energy benefits of a larger cache while using less area. The design of a compressed cache must address two important issues: (i) a low-latency, low-overhead compression algorithm that can represent a fixed-size cache block using fewer bits and (ii) a cache organization that can efficiently store the resulting variable-size compressed blocks. This article focuses on the latter issue.Here, we propose Yet Another Compressed Cache (YACC), a new compressed cache design that targets improving effective cache capacity with a simple design. YACC uses super-blocks to reduce tag overheads while packing variable-size compressed blocks to reduce internal fragmentation. YACC achieves the benefits of two state-of-the art compressed caches—Decoupled Compressed Cache (DCC) lSardashti and Wood 2013a, 2013br and Skewed Compressed Cache (SCC) lSardashti et al. 2014r—with a more practical and simpler design. YACC's cache layout is similar to conventional caches, with a largely unmodified tag array and unmodified data array. Compared to DCC and SCC, YACC requires neither the significant extra metadata (i.e., back pointers) needed by DCC to track blocks nor the complexity and overhead of skewed associativity (i.e., indexing ways differently) needed by SCC. An additional advantage over previous work is that YACC enables modern replacement mechanisms, such as RRIP.For our benchmark set, compared to a conventional uncompressed 8MB LLC, YACC improves performance by 8p on average and up to 26p, and reduces total energy by 6p on average and up to 20p. An 8MB YACC achieves approximately the same performance and energy improvements as a 16MB conventional cache at a much smaller silicon footprint, with only 1.6p greater area than an 8MB conventional cache. YACC performs comparably to DCC and SCC but is much simpler to implement.

Journal ArticleDOI
TL;DR: Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches and allows the CPU to access GPUs modified data by at at least 13 × faster.
Abstract: In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.

Journal ArticleDOI
TL;DR: This article identifies significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s, and shows how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2.
Abstract: The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies. In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.

Journal ArticleDOI
TL;DR: This article presents three array-based applications from the financial domain that are suitable for gpgpu execution and provides useful insight into the language constructs and compiler infrastructure capable of expressing and optimizing such applications.
Abstract: Commodity many-core hardware is now mainstream, but parallel programming models are still lagging behind in efficiently utilizing the application parallelism. There are (at least) two principal reasons for this. First, real-world programs often take the form of a deeply nested composition of parallel operators, but mapping the available parallelism to the hardware requires a set of transformations that are tedious to do by hand and beyond the capability of the common user. Second, the best optimization strategy, such as what to parallelize and what to efficiently sequentialize, is often sensitive to the input dataset and therefore requires multiple code versions that are optimized differently, which also raises maintainability problems. This article presents three array-based applications from the financial domain that are suitable for gpgpu execution. Common benchmark-design practice has been to provide the same code for the sequential and parallel versions that are optimized for only one class of datasets. In comparison, we document (1) all available parallelism via nested map-reduce functional combinators, in a simple Haskell implementation that closely resembles the original code structure, (2) the invariants and code transformations that govern the main trade-offs of a data-sensitive optimization space, and (3) report target cpu and multiversion gpgpu code together with an evaluation that demonstrates optimization trade-offs and other difficulties. We believe that this work provides useful insight into the language constructs and compiler infrastructure capable of expressing and optimizing such applications, and we report in-progress work in this direction.

Journal ArticleDOI
TL;DR: This article proposes intense pages mapping, a mechanism that analyzes the memory access behavior using information about the time the entry of each page resides in the translation lookaside buffer, which provides accurate information with a very low overhead.
Abstract: The performance and energy efficiency of modern architectures depend on memory locality, which can be improved by thread and data mappings considering the memory access behavior of parallel applications. In this article, we propose intense pages mapping, a mechanism that analyzes the memory access behavior using information about the time the entry of each page resides in the translation lookaside buffer. It provides accurate information with a very low overhead. We present experimental results with simulation and real machines, with average performance improvements of 13.7p and energy savings of 4.4p, which come from reductions in cache misses and interconnection traffic.

Journal ArticleDOI
TL;DR: This paper applies autotuning on runtime specialization of Sparse Matrix-Vector Multiplication to predict a best specialization method among several, and achieves average speedups that are very close to the speedups achievable when only the best methods are used.
Abstract: Runtime specialization is used for optimizing programs based on partial information available only at runtime. In this paper we apply autotuning on runtime specialization of Sparse Matrix-Vector Multiplication to predict a best specialization method among several. In 91p to 96p of the predictions, either the best or the second-best method is chosen. Predictions achieve average speedups that are very close to the speedups achievable when only the best methods are used. By using an efficient code generator and a carefully designed set of matrix features, we show the runtime costs can be amortized to bring performance benefits for many real-world cases.

Journal ArticleDOI
TL;DR: An explicit formula is obtained, giving OPT hits and misses as a function of past references, including a proof that OPT miss curves are always convex, and a new algorithm called OPT tokens, for reasoning about optimal replacement.
Abstract: This article exposes and proves some mathematical facts about optimal cache replacement that were previously unknown or not proved rigorously. An explicit formula is obtained, giving OPT hits and misses as a function of past references. Several mathematical facts are derived from this formula, including a proof that OPT miss curves are always convex, and a new algorithm called OPT tokens, for reasoning about optimal replacement.

Journal ArticleDOI
TL;DR: This article first introduces a low-cost WSS tracking system with effective optimizations on data structures, as well as an efficient mechanism to decrease the frequency of monitoring, and proposes a Memory Balancer (MEB), which dynamically reallocates guest memory based on the predicted WSS.
Abstract: Allocating memory dynamically for virtual machines (VMs) according to their demands provides significant benefits as well as great challenges. Efficient memory resource management requires knowledge of the memory demands of applications or systems at runtime. A widely proposed approach is to construct a miss ratio curve (MRC) for a VM, which not only summarizes the current working set size (WSS) of the VM but also models the relationship between its performance and the target memory allocation size. Unfortunately, the cost of monitoring and maintaining the MRC structures is nontrivial. This article first introduces a low-cost WSS tracking system with effective optimizations on data structures, as well as an efficient mechanism to decrease the frequency of monitoring. We also propose a Memory Balancer (MEB), which dynamically reallocates guest memory based on the predicted WSS. Our experimental results show that our prediction schemes yield a high accuracy of 95.2p and low overhead of 2p. Furthermore, the overall system throughput can be significantly improved with MEB, which brings a speedup up to 7.4 for two to four VMs and 4.54 for an overcommitted system with 16 VMs.

Journal ArticleDOI
TL;DR: This article analyzes experimental results obtained executing HOG on embedded GPUs from two different vendors, exposed for about 100 hours to a controlled neutron beam at Los Alamos National Laboratory, and performs a fault-injection campaign to identify HOG critical procedures.
Abstract: Pedestrian detection reliability is a key problem for autonomous or aided driving, and methods that use Histogram of Oriented Gradients (HOG) are very popular. Embedded Graphics Processing Units (GPUs) are exploited to run HOG in a very efficient manner. Unfortunately, GPUs architecture has been shown to be particularly vulnerable to radiation-induced failures. This article presents an experimental evaluation and analytical study of HOG reliability. We aim at quantifying and qualifying the radiation-induced errors on pedestrian detection applications executed in embedded GPUs. We analyze experimental results obtained executing HOG on embedded GPUs from two different vendors, exposed for about 100 hours to a controlled neutron beam at Los Alamos National Laboratory. We consider the number and position of detected objects as well as precision and recall to discriminate critical erroneous computations. The reported analysis shows that, while being intrinsically resilient (65% to 85% of output errors only slightly impact detection), HOG experienced some particularly critical errors that could result in undetected pedestrians or unnecessary vehicle stops. Additionally, we perform a fault-injection campaign to identify HOG critical procedures. We observe that Resize and Normalize are the most sensitive and critical phases, as about 20% of injections generate an output error that significantly impacts HOG detection. With our insights, we are able to find those limited portions of HOG that, if hardened, are more likely to increase reliability without introducing unnecessary overhead.

Journal ArticleDOI
TL;DR: NoCAlert is proposed, a comprehensive online and real-time fault detection and localization mechanism that demonstrates 0% false negatives within the interconnect for the fault models and stimulus set used in this study.
Abstract: Networks-on-Chip (NoC) are becoming increasingly susceptible to emerging reliability threats. The need to detect and localize the occurrence of faults at runtime is steadily becoming imperative. In this work, we propose NoCAlert, a comprehensive online and real-time fault detection and localization mechanism that demonstrates 0p false negatives within the interconnect for the fault models and stimulus set used in this study. Based on the concept of invariance checking, NoCAlert employs a group of lightweight microchecker modules that collectively implement real-time hardware assertions. The checkers operate concurrently with normal NoC operation, thus eliminating the need for periodic, or triggered-based, self-testing. Based on the pattern/signature of asserted checkers, NoCAlert can pinpoint the location of the fault at various granularity levels. Most important, 97p of the transient and 90p of the permanent faults are detected instantaneously, within a single clock cycle upon fault manifestation. The fault localization accuracy ranges from 90p to 100p, depending on the desired localization granularity. Extensive cycle-accurate simulations in a 64-node CMP and analysis at the RTL netlist-level demonstrate the efficacy of the proposed technique.

Journal ArticleDOI
TL;DR: MAMBO-X64, a dynamic binary translator that translates 32-bit ARM (AArch32) code to 64-bit CPU2006 code, uses three novel techniques to improve the performance of indirect branch translation, which allows the performance to be on par with thread-private hash tables while having superior memory scalability.
Abstract: Dynamic binary translation is a technology for transparently translating and modifying a program at the machine code level as it is running. A significant factor in the performance of a dynamic binary translator is its handling of indirect branches. Unlike direct branches, which have a known target at translation time, an indirect branch requires translating a source program counter address to a translated program counter address every time the branch is executed. This translation can impose a serious runtime penalty if it is not handled efficiently. MAMBO-X64, a dynamic binary translator that translates 32-bit ARM (AArch32) code to 64-bit ARM (AArch64) code, uses three novel techniques to improve the performance of indirect branch translation. Together, these techniques allow MAMBO-X64 to achieve a very low performance overhead of only 10p on average compared to native execution of 32-bit programs. Hardware-assisted function returns use a software return address stack to predict the targets of function returns, making use of several novel optimizations while also exploiting hardware return address prediction. This technique has a significant impact on most benchmarks, reducing binary translation overhead compared to native execution by 40p on average and by 90p on some benchmarks. Branch table inference, an algorithm for detecting and translating branch tables, can reduce the overhead of translated code by up to 40p on some SPEC CPU2006 benchmarks. The remaining indirect branches are handled using a fast atomic hash table, which is optimized to work with multiple threads. This last technique translates indirect branches using a single shared hash table while avoiding expensive synchronization in performance-critical lookup code. This allows the performance to be on par with thread-private hash tables while having superior memory scalability.

Journal ArticleDOI
TL;DR: Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank, enables a high-performance and low-power memory system and also efficiently protects the stacked memory system from large-granularity failures.
Abstract: Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (e.g., due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory-level parallelism, causing significant slowdown and higher power consumption.This article proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank. By retaining cache lines within banks, Citadel enables a high-performance and low-power memory system and also efficiently protects the stacked memory system from large-granularity failures. Citadel consists of three components; TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs; Tri-Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures; and Dynamic Dual-Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700 × higher reliability than ChipKill-like ECC codes.

Journal ArticleDOI
TL;DR: A highly efficient matrix-based segmented sum/scan algorithm for SpMV, which eliminates global synchronization is proposed, and an autotuning framework to choose optimization parameters is introduced.
Abstract: Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for both GPUs and Intel MIC many-core architectures. First, we devise a new SpMV format, called Blocked Compressed Common Coordinate (BCCOO). BCCOO extends the blocked Common Coordinate (COO) by using bit flags to store the row indices to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices for better data locality. Then, to address the load imbalance problem, we propose a highly efficient matrix-based segmented sum/scan algorithm for SpMV, which eliminates global synchronization. At last, we introduce an autotuning framework to choose optimization parameters. Experimental results show that our proposed framework has a significant advantage over the existing SpMV libraries. In single precision, our proposed scheme outperforms clSpMV COCKTAIL format by 255% on average on AMD FirePro W8000, and outperforms CUSPARSE V7.0 by 73.7% on average and outperforms CSR5 by 53.6% on average on GeForce Titan X; in double precision, our proposed scheme outperforms CUSPARSE V7.0 by 34.0% on average and outperforms CSR5 by 16.2% on average on Tesla K20, and has equivalent performance compared with CSR5 on Intel MIC.

Journal ArticleDOI
TL;DR: This article proposes an optimized implementation of the classical Combine-Brightness-Gradient model on the Xilinx ZYNQ FPGA-SoC, by taking advantage of the inherent algorithmic parallelism and ZynQ architecture.
Abstract: High-quality optical flow computation algorithms are computationally intensive. The low computational speed of such algorithms causes difficulties for real-world applications. In this article, we propose an optimized implementation of the classical Combine-Brightness-Gradient (CBG) model on the Xilinx ZYNQ FPGA-SoC, by taking advantage of the inherent algorithmic parallelism and ZYNQ architecture. The execution time decreases to 0.82 second with a lower power consumption (1.881W). It is better than software implementation on PC (Intel i7-3520M, 2.9GHz), which costs 2.635 seconds and 35W. We use C rather than HDLs to describe the algorithm for rapid prototyping.

Journal ArticleDOI
TL;DR: A compile-time optimization method aimed at reducing WCET in real-time embedded systems is proposed, based on the predicated execution capability of embedded processors, where program code blocks that are in the worst-case paths of the program are merged to increase instruction-level parallelism and opportunity for WCET reduction.
Abstract: Compile-time optimizations play an important role in the efficient design of real-time embedded systems. Usually, compile-time optimizations are designed to reduce average-case execution time (ACET). While ACET is a main concern in high-performance computing systems, in real-time embedded systems, concerns are different and worst-case execution time (WCET) is much more important than ACET. Therefore, WCET reduction is more desirable than ACET reduction in many real-time embedded systems. In this article, we propose a compile-time optimization method aimed at reducing WCET in real-time embedded systems. In the proposed method, based on the predicated execution capability of embedded processors, program code blocks that are in the worst-case paths of the program are merged to increase instruction-level parallelism and opportunity for WCET reduction. The use of predicated execution enables merging code blocks from different worst-case paths that can be very effective in WCET reduction. The experimental results show that the proposed method can reduce WCET by up to 45p as compared to previous compile-time block formation methods. It is noteworthy that compared to previous works, while the proposed method usually achieves more WCET reduction, it has considerably less negative impact on ACET and code size.