Showing papers on "Performance per watt published in 2017"

PDF

Open Access

Journal Article•DOI•

DyPO: Dynamic Pareto-Optimal Configuration Selection for Heterogeneous MpSoCs

[...]

Ujjwal Gupta¹, Chetan Arvind Patil¹, Ganapati Bhat¹, Prabhat Mishra², Umit Y. Ogras¹ - Show less +1 more•Institutions (2)

Arizona State University¹, University of Florida²

27 Sep 2017-ACM Transactions in Embedded Computing Systems

TL;DR: This paper proposes a novel methodology that can find the Pareto-optimal configurations at runtime as a function of the workload, and uses an extensive offline characterization to find classifiers that map performance counters to optimal configurations.

...read moreread less

Abstract: Modern multiprocessor systems-on-chip (MpSoCs) offer tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. As the workload phases change at runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is infeasible due to the large number of workloads and configurations. This paper proposes a novel methodology that can find the Pareto-optimal configurations at runtime as a function of the workload. To achieve this, we perform an extensive offline characterization to find classifiers that map performance counters to optimal configurations. Then, we use these classifiers and performance counters at runtime to choose Pareto-optimal configurations. We evaluate the proposed methodology by maximizing the performance per watt for 18 single- and multi-threaded applications. Our experiments demonstrate an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively.

...read moreread less

51 citations

Proceedings Article•DOI•

A many-core architecture for in-memory data processing

[...]

Sandeep R. Agrawal¹, Sam Idicula¹, Arun Raghavan¹, Evangelos Vlachos¹, Venkatraman Govindaraju¹, Venkatanathan Varadarajan¹, Cagri Balkesen¹, Georgios Giannikis¹, Charlie Roth¹, Nipun Agarwal¹, Eric Sedlar¹ - Show less +7 more•Institutions (1)

Oracle Corporation¹

14 Oct 2017

TL;DR: This work presents the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads and provides acceleration for core to core communication via a unique hardware RPC mechanism called the Atomic Transaction Engine.

...read moreread less

Abstract: For many years, the highest energy cost in processing has been data movement rather than computation, and energy is the limiting factor in processor design [21]. As the data needed for a single application grows to exabytes [56], there is clearly an opportunity to design a bandwidth-optimized architecture for big data computation by specializing hardware for data movement. We present the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads. The DPU contains a unique Data Movement System (DMS), which provides hardware acceleration for data movement and partitioning operations at the memory controller that is sufficient to keep up with DDR bandwidth. The DPU also provides acceleration for core to core communication via a unique hardware RPC mechanism called the Atomic Transaction Engine. Comparison of a DPU chip fabricated in 40nm with a Xeon processor on a variety of data processing applications shows a 3× - 15× performance per watt advantage.CCS CONCEPTS• Computer systems organization $\rightarrow$ Multicore architectures; Special purpose systems;

...read moreread less

31 citations

Proceedings Article•DOI•

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

[...]

Sreeram Potluri¹, Anshuman Goswami¹, Davide Rossetti¹, Chris J. Newburn¹, Manjunath Gorentla Venkata², Neena Imam² - Show less +2 more•Institutions (2)

Nvidia¹, Oak Ridge National Laboratory²

01 Dec 2017

TL;DR: This work evaluates different design alternatives for Mellanox InfiniBand adapters in CUDA, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU, and implements a 2dstencil application kernel using NVSHMEM.

...read moreread less

Abstract: GPUs have become an essential component for building compute clusters with high compute density and high performance per watt. As such clusters scale to have 1000s of GPUs, efficiently moving data between the GPUs becomes imperative to get maximum performance. NVSHMEM is an implementation of the OpenSHMEM standard for NVIDIA GPU clusters which allows communication to be issued from inside GPU kernels. In earlier work, we have shown how NVSHMEM can be used to achieve better application performance on GPUs connected through PCIe or NVLink. As part of this effort, we implement IB verbs for Mellanox InfiniBand adapters in CUDA. We evaluate different design alternatives, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU. We also consider correctness issues that arise in these designs. We take advantage of these designs transparently or through API extensions in NVSHMEM. With micro-benchmarks, we show that a Nvidia Pascal P100 GPU is able saturate the network bandwidth using only one or two of its 56 available streaming multiprocessors (SM). On a single GPU using a single IB EDR adapter, we achieve a throughput of around 90 million messages per second. In addition, we implement a 2dstencil application kernel using NVSHMEM and compare its performance with a CUDA-aware MPI-based implementation that uses GPUDirect RDMA. Speedups in the range of 23% to 42% are seen for input sizes large enough to fill the occupancy of Nvidia Pascal P100 GPUs on 2 to 4 nodes indicating that there are gains to be had by eliminating the CPU from the communication path when all computation runs on the GPU.

...read moreread less

23 citations

Journal Article•DOI•

Towards completely fair scheduling on asymmetric single-ISA multicore processors

[...]

Juan Carlos Saez¹, Adrián Pousa², Fernando Castro¹, Daniel Chaver¹, Manuel Prieto-Matias¹ - Show less +1 more•Institutions (2)

Complutense University of Madrid¹, National University of La Plata²

01 Apr 2017-Journal of Parallel and Distributed Computing

TL;DR: Evaluation on real AMP hardware and using scheduler implementations in the Linux kernel demonstrates that ACFS achieves an average 23% fairness improvement over two state-of-the-art schemes, while providing higher system throughput.

...read moreread less

21 citations

Book Chapter•DOI•

NVIDIA Jetson Platform Characterization

[...]

Hassan H. Halawa¹, Hazem A. Abdelhafez¹, Andrew Boktor¹, Matei Ripeanu¹•Institutions (1)

University of British Columbia¹

28 Aug 2017

TL;DR: This paper characterizes the NVIDIA Jetson TK1 and TX1 Platforms by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach and through a case study of a heterogeneous application (matrix multiplication).

...read moreread less

Abstract: This study characterizes the NVIDIA Jetson TK1 and TX1 Platforms, both built on a NVIDIA Tegra System on Chip and combining a quad-core ARM CPU and an NVIDIA GPU. Their heterogeneous nature, as well as their wide operating frequency range, make it hard for application developers to reason about performance and determine which optimizations are worth pursuing. This paper attempts to inform developers’ choices by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach as well as through a case study of a heterogeneous application (matrix multiplication). Our results highlight a difference of more than an order of magnitude in compute performance between the CPU and GPU on both platforms. Given that the CPU and GPU share the same memory bus, their Roofline models’ balance points are also more than an order of magnitude apart. We also explore the impact of frequency scaling: build CPU and GPU Roofline profiles and characterize both platforms’ balance point variation, power consumption, and performance per watt as frequency is scaled.

...read moreread less

18 citations

Proceedings Article•DOI•

High-level design using Intel FPGA OpenCL: A hyperspectral imaging spatial-spectral classifier

[...]

R. Domingo, Ruben Salvador, Himar Fabelo¹, D. Madroñal, Samuel Ortega¹, R. Lazcano, Eduardo Juarez, Gustavo M. Callico¹, César Sanz - Show less +5 more•Institutions (1)

University of Las Palmas de Gran Canaria¹

12 Jul 2017

TL;DR: This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator and shows how reasonable speedups are obtained in a device with scarce computing and embedded memory resources.

...read moreread less

Abstract: Current computational demands require increasing designer's efficiency and system performance per watt. A broadly accepted solution for efficient accelerators implementation is reconfigurable computing. However, typical HDL methodologies require very specific skills and a considerable amount of designer's time. Despite the new approaches to high-level synthesis like OpenCL, given the large heterogeneity in today's devices (manycore, CPUs, GPUs, FPGAs), there is no one-fits-all solution, so to maximize performance, platform-driven optimization is needed. This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator. Results are reported for a Cyclone V SoC using Intel FPGA OpenCL Offline Compiler 16.0 out-of-the-box. From a common baseline C implementation running on the embedded ARM® Cortex®-A9, OpenCL-based synthesis is evaluated applying different generic and vendor specific optimizations. Results show how reasonable speedups are obtained in a device with scarce computing and embedded memory resources. It seems a great step has been given to effectively raise the abstraction level, but still, a considerable amount of HW design skills is needed.

...read moreread less

16 citations

Journal Article•DOI•

Pareto Governors for Energy-Optimal Computing

[...]

Rathijit Sen¹, Darien Wood¹•Institutions (1)

University of Wisconsin-Madison¹

13 Mar 2017-ACM Transactions on Architecture and Code Optimization

TL;DR: A new definition of ideal energy-proportional computing is introduced, new metrics to quantify computational energy waste, and new SLA-aware OS governors that seek Pareto optimality to achieve power-efficient performance are introduced.

...read moreread less

Abstract: The original definition of energy-proportional computing does not characterize the energy efficiency of recent reconfigurable computers, resulting in nonintuitive “super-proportional” behavior. This article introduces a new definition of ideal energy-proportional computing, new metrics to quantify computational energy waste, and new SLA-aware OS governors that seek Pareto optimality to achieve power-efficient performance.

...read moreread less

13 citations

Journal Article•DOI•

Power-Efficient Computing: Experiences from the COSA Project

[...]

Daniele Cesini, Elena Corni, Antonio Falabella, Andrea Ferraro, Lucia Morganti, Enrico Calore, Sebastiano Fabio Schifano, Michele Michelotto, R. Alfieri, Roberto De Pietri, Tommaso Boccali, Andrea Biagioni, Francesca Lo Cicero, Alessandro Lonardo, Michele Martinelli, Pierluigi Paolucci, Elena Pastorelli, Piero Vicini - Show less +14 more

25 Sep 2017-Scientific Programming

TL;DR: The results of the project analyzing the performance of several scientific applications on several GPU- and SoC-based systems are presented and the methodology used to measure energy performance and the tools implemented to monitor the power drained by applications while running are described.

...read moreread less

Abstract: Energy consumption is today one of the most relevant issues in operating HPC systems for scientific applications. The use of unconventional computing systems is therefore of great interest for several scientific communities looking for a better tradeoff between time-to-solution and energy-to-solution. In this context, the performance assessment of processors with a high ratio of performance per watt is necessary to understand how to realize energy-efficient computing systems for scientific applications, using this class of processors. Computing On SOC Architecture (COSA) is a three-year project (2015–2017) funded by the Scientific Commission V of the Italian Institute for Nuclear Physics (INFN), which aims to investigate the performance and the total cost of ownership offered by computing systems based on commodity low-power Systems on Chip (SoCs) and high energy-efficient systems based on GP-GPUs. In this work, we present the results of the project analyzing the performance of several scientific applications on several GPU- and SoC-based systems. We also describe the methodology we have used to measure energy performance and the tools we have implemented to monitor the power drained by applications while running.

...read moreread less

13 citations

Journal Article•DOI•

InfiniBand Verbs on GPU: a case study of controlling an InfiniBand network device from the GPU

[...]

Lena Oden¹, Holger Fröning²•Institutions (2)

Fraunhofer Society¹, Heidelberg University²

01 Jul 2017-International Journal of High Performance Computing Applications

TL;DR: This work modify user space libraries and device drivers of GPUs and the InfiniBand network device in a way to enable the GPU to control an Infini Band network device to independently source and sink communication requests without any involvement of the CPU.

...read moreread less

Abstract: Due to their massive parallelism and high performance per Watt, GPUs have gained high popularity in high-performance computing and are a strong candidate for future exascale systems. But communicat...

...read moreread less

12 citations

Proceedings Article•DOI•

A Highly Scalable and Efficient Parallel Design of N-Body Simulation on FPGA

[...]

Emanuele Del Sozzo¹, Lorenzo Di Tucci¹, Marco D. Santambrogio¹•Institutions (1)

Polytechnic University of Milan¹

01 May 2017

TL;DR: The final design is able to outperform both CPU and FPGA works in the state of the art in terms of pure performance of a factor up to 10x, and high-end GPUs in termsof performance per watt by a factor of 1.84x.

...read moreread less

Abstract: N-Body simulation simulates the evolution of a system that is composed of N particles, where each element receives a force that is due to the interaction with all the other elements within the system. Usually, the influence of external physical forces, such as gravity, is involved too. This methodology is widely used in different fields that range from astrophysics, where it is used to study the interaction of celestial objects, to molecular dynamics, where the bodies are represented by molecules. Although its wide range of applicability, the algorithm presents a high computational complexity that requires the usage of powerful and high power consuming computers. An acceleration on a reconfigurable device, such as an FPGA, would benefit both in term of performance and power consumption. In this work we presents a scalable, high performance and highly efficient implementation of an N-Body simulation algorithm on FPGA. The final design is able to outperform both CPU and FPGA works in the state of the art in terms of pure performance of a factor up to 10x, and high-end GPUs in terms of performance per watt by a factor of 1.84x.

...read moreread less

11 citations

Proceedings Article•DOI•

Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing

[...]

Benjamin Schwaller¹, Barath Ramesh¹, Alan D. George¹•Institutions (1)

University of Pittsburgh¹

01 Sep 2017

TL;DR: A direct memory-access scheme is developed to take advantage of the complex KeyStone architecture for FFTs and shows that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53.

...read moreread less

Abstract: Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.

...read moreread less

Journal Article•DOI•

Multilevel NoSQL Cache Combining In-NIC and In-Kernel Approaches

[...]

Yuta Tokusashi¹, Hiroki Matsutani¹•Institutions (1)

Keio University¹

01 Sep 2017-IEEE Micro

TL;DR: A multilevel NoSQL cache architecture that utilizes both FPGA-based hardware cache and in-kernel software cache in a complementary style is proposed, which reduces the cache miss ratio and improves the throughput compared to the nonhierarchical design.

...read moreread less

Abstract: Key-value store accelerators based on field-programmable gate arrays (FPGAs) have been proposed to achieve higher performance per watt than software-based processing. However, because their cache capacity is strictly limited by DRAMs implemented on FPGA boards, their application domains are also limited. To address this issue, the authors propose a multilevel NoSQL cache architecture that utilizes both FPGA-based hardware cache and in-kernel software cache in a complementary style. This motivates them to explore various design options. Simulation results show that their design reduces the cache miss ratio and improves the throughput compared to the nonhierarchical design.

...read moreread less

Journal Article•DOI•

Energy-Aware Modeling of Scaled Heterogeneous Systems

[...]

Ami Marowka

01 Oct 2017-International Journal of Parallel Programming

TL;DR: Analytical models based on scaled power metrics are presented to analyze the impact of various architectural design choices on scaled performance and power savings and show that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.

...read moreread less

Abstract: Many-core processors are accelerating the performance of contemporary high-performance systems. Managing power consumption within these systems demands low-power architectures to increase power savings. One of the promising solutions offered today by microprocessor architects is asymmetric microprocessors that integrate different core architectures on a single die. This paper presents analytical models based on scaled power metrics to analyze the impact of various architectural design choices on scaled performance and power savings. The power consumption implications of different processing schemes and various chip configurations were also analyzed. Analysis shows that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.

...read moreread less

Proceedings Article•DOI•

Energy consumption in Java: An early experience

[...]

Mohit Kumar¹, Youhuizi Li², Weisong Shi¹•Institutions (2)

Wayne State University¹, Hangzhou Dianzi University²

01 Oct 2017

TL;DR: This paper evaluates energy consumption of data types, operators, control statements, exception, and object in Java at a granular level to help in standardizing the energy consumption traits of Java which can be leveraged by software developers to generate energy efficient code in future.

...read moreread less

Abstract: There has been a 10,000-fold increase in performance of supercomputers since 1992 but only 300-fold improvement in performance per watt. Dynamic adaptation of hardware techniques such as fine-grain clock gating, power gating and dynamic voltage/frequency scaling, are used for many years to improve the computer's energy efficiency. However, recent demands of exascale computation, as well as the increasing carbon footprint, require new breakthrough to make ICT systems more energy efficient. Energy efficient software has not been well studied in the last decade. In this paper, we take an early step to investigate the energy efficiency of Java which is one of the most common languages used in ICT systems. We evaluate energy consumption of data types, operators, control statements, exception, and object in Java at a granular level. Intel Running Average Power Limit (RAPL) technology is applied to measure the relative power consumption of small code snippets. Several observations are found, and these results will help in standardizing the energy consumption traits of Java which can be leveraged by software developers to generate energy efficient code in future.

...read moreread less

Proceedings Article•DOI•

Pearson Correlation Coefficient Acceleration for Modeling and Mapping of Neural Interconnections

[...]

Enrico Reggiani, Eleonora D'Arnese, Andrea Purgato, Marco D. Santambrogio

01 May 2017

TL;DR: In this work, it is proposed a Field Programmable Gate Array (FPGA) implementation of the Pearson Correlation Coefficient (PCC) algorithm, applied to a Brain Network (BN) case study, and it will be shown that the proposed implementation can achieve up to 10x speedup with respect to a single-threaded Central Processing Unit (CPU) implementation, while guaranteeing 2x performance per Watt ratio in comparison to a Graphic processing Unit (GPU) implementation.

...read moreread less

Abstract: Thanks to the availability of new biomedical technologies and analysis methodologies, the quality of clinical exams and medical research is increasing. These improvements have given the opportunity to analyze large amount of data with an higher level of accuracy. Therefore, processors able to handle compute intensive algorithms and large datasets are needed, and the use of homogeneous processors is becoming inefficient for this purpose. As a result, heterogeneous architectures are becoming the key technology to improve the efficiency of these computations, by allowing a concurrent elaboration of data. In this work, it is proposed a Field Programmable Gate Array (FPGA) implementation of the Pearson Correlation Coefficient (PCC) algorithm, applied to a Brain Network (BN) case study. Itwill be shown that the proposed implementation can achieve up to 10x speedup with respect to a single-threaded Central Processing Unit (CPU) implementation, while guaranteeing 2x performance per Watt ratio in comparison to a Graphic Processing Unit (GPU) implementation. These considerations open to the possibility of using FPGA architectures in application fields, such as data centers and biomedical embedded systems, where power capping and heat are relevant issues to be considered.

...read moreread less

Book Chapter•DOI•

Smith-Waterman Acceleration in Multi-GPUs: A Performance per Watt Analysis

[...]

Jesús Pérez Serrano¹, Edans Flavius de Oliveira Sandes², Alba Cristina Magalhaes Alves de Melo², Manuel Ujaldón¹•Institutions (2)

University of Málaga¹, University of Brasília²

26 Apr 2017

TL;DR: A performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method demonstrates a good correlation between the performance attained and the extra energy required.

...read moreread less

Abstract: We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous scenarios to maximize acceleration and minimize power consumption. Experimental results using CUDA on a set of GeForce GTX 980 GPUs illustrate their capabilities as high-performance and low-power devices, with a energy cost to be more attractive when increasing the number of GPUs. Overall, our results demonstrate a good correlation between the performance attained and the extra energy required, even in scenarios where multi-GPUs do not show great scalability.

...read moreread less

Proceedings Article•DOI•

OptiBook: Optimal resource booking for energy-efficient datacenters

[...]

Selome Kostentinos Tesfatsion¹, Luis Tomas², Johan Tordsson¹•Institutions (2)

Umeå University¹, Red Hat²

14 Jun 2017

TL;DR: OptiBook is presented, a system that improves energy proportionality and/or resource utilization to optimize performance and energy efficiency and applies performance isolation techniques such as CPU pinning and quota enforcement as well as online resource tuning to effectively improve energy efficiency.

...read moreread less

Abstract: A lack of energy proportionality, low resource utilization, and interference in virtualized infrastructure make the cloud a challenging target environment for improving energy efficiency. In this paper we present OptiBook, a system that improves energy proportionality and/or resource utilization to optimize performance and energy efficiency. OptiBook shares servers between latency-sensitive services and batch jobs, overbooks the system in a controllable manner, uses vertical (CPU and DVFS) scaling for prioritized virtual machines, and applies performance isolation techniques such as CPU pinning and quota enforcement as well as online resource tuning to effectively improve energy efficiency. Our evaluations show that on average, OptiBook improves performance per watt by 20% and reduces energy consumption by 9% while minimizing SLO violations.

...read moreread less

Journal Article•DOI•

Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS

[...]

Zhonghai Lu¹, Yuan Yao¹•Institutions (1)

Royal Institute of Technology¹

01 Nov 2017-IEEE Transactions on Computers

TL;DR: It is shown that application performance does not grow linearly with network power in an NoC-based CMP, and a new figure of merit called Marginal Performance (MP) is proposed which evaluates the incremental performance per power increment after the inertial region.

...read moreread less

Abstract: In network-on-chip (NoC) based CMPs, DVFS is commonly used to co-optimize performance and power. To achieve optimal efficiency, it is important to gain proportional performance growth with power. However, power over/under provisioning often exists. To properly evaluate and guide NoC DVFS techniques, it is highly desirable to formalize and quantify power over/under provisioning. In this paper, we first show that application performance does not grow linearly with network power in an NoC-based CMP. Instead, their relationship is non-linear and can be captured using performance-power characteristics curve (PPCC) with three distinct regions: an inertial region, a linear region, and a saturation region. We note that conventional DVFS metrics such as Performance Per Watt (PPW) cannot accurately evaluate such non-linear relationship. Based on PPCC, we propose a new figure of merit called Marginal Performance (MP) which evaluates the incremental performance per power increment after the inertial region. The MP concept enables to formally define power over- and under-provisioning with reference to the linear region in which an efficient NoC DVFS should operate. Applying the PPCC and MP concepts in full-system simulations with PARSEC and SPEC OMP2012 benchmarks, we are able to identify power over/under provisioning occurrences, measure and compare their statistics in two latest NoC DVFS techniques. Moreover, we show evidences that MP can accurately and consistently evaluate the NoC DVFS techniques, avoiding the misjudgement and inconsistency of PPW-based evaluations.

...read moreread less

Proceedings Article•DOI•

DENA: A DVFS-Capable Heterogeneous NoC Architecture

[...]

Luca Cremona¹, William Fornaciari¹, Andrea Marchese, Michele Zanella¹, Davide Zoni¹ - Show less +1 more•Institutions (1)

Polytechnic University of Milan¹

01 Jul 2017

TL;DR: This paper presents DENA, a DVFS-capable, heterogeneous NoC design, encompassing the coherence protocol, the application behavior and the need to minimize the energy budget, with results on a 64-core, 2D-mesh architecture executing the SPLASH2 benchmark suite.

...read moreread less

Abstract: The current design drivers for multi-cores, namely performance per watt, scalability and flexibility, make the Networks-on-Chip (NoCs) the de-facto on-chip interconnect. State of the art NoCs can exploit heterogeneous solutions and complex DVFS techniques to fulfill also the variability of the application requirements. Relevant showstoppers to the design of a truly flexible NoC fitting all the possible traffic conditions, are the burstiness of the traffic generated by modern applications magnified by the unbalanced usage of the interconnect resources due to the implemented coherence protocol. This paper presents DENA, a DVFS-capable, heterogeneous NoC design, encompassing the coherence protocol, the application behavior and the need to minimize the energy budget. Simulation results on a 64-core, 2D-mesh architecture executing the SPLASH2 benchmark suite, testify the advantages of DENA from both the performance and energy viewpoints with an average 34.3% energy-performance improvement against the state of the art DVFS-capable NoC design.

...read moreread less

Proceedings Article•DOI•

Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping

[...]

Satoshi Imamura¹, Yuichiro Yasui¹, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki², Katsuki Fujisawa - Show less +2 more•Institutions (2)

Kyushu University¹, Columbia University²

23 Jan 2017

TL;DR: In this article, a per-row channel interleaving scheme was proposed to improve the power efficiency of DRAM and investigate the memory access pattern of a BFS implementation using a cycle-accurate processor simulator.

...read moreread less

Abstract: Graph analysis applications have been widely used in real services such as road-traffic analysis and social network services. Breadth-first search (BFS) is one of the most representative algorithms for such applications; therefore, many researchers have tuned it to maximize performance. On the other hand, owing to the strict power constraints of modern HPC systems, it is necessary to improve power efficiency (i.e., performance per watt) when executing BFS. In this work, we focus on the power efficiency of DRAM and investigate the memory access pattern of a state-of-the-art BFS implementation using a cycle-accurate processor simulator. The results reveal that the conventional address mapping schemes of modern memory controllers do not efficiently exploit row buffers in DRAM. Thus, we propose a new scheme called per-row channel interleaving and improve the DRAM power efficiency by 30.3% compared to a conventional scheme for a certain simulator setting. Moreover, we demonstrate that this proposed scheme is effective for various configurations of memory controllers.

...read moreread less

Proceedings Article•DOI•

Fast, Ring-Based Design of 3D Stacked DRAM

[...]

Andrew J. Douglass¹, Sunil P. Khatri¹•Institutions (1)

Texas A&M University¹

01 Nov 2017

TL;DR: The Memory Architecture using a Ring-based Scheme (MARS) can effectively trade off power, throughput, and latency to improve system performance for different application spaces and shows that its MARS variants can deliver better latency, power, and performance per watt over HBM when averaged over 11 SPEC CPU 2006 benchmarks.

...read moreread less

Abstract: As computer memory increases in size and processors continue to get faster, the memory subsystem becomes an increasing bottleneck to system performance. To mitigate the relatively slow DRAM memory chip speeds, a new generation of 3D stacked DRAM is being developed, with lower power consumption and higher bandwidth. This paper proposes the use of 3D ring-based data fabrics for fast data transfer between these chips. The ring-based data fabric uses a fast standing wave oscillator to clock its transactions. With a fast clocking scheme, and multiple channels sharing the same bus, more channels are utilized while significantly reducing the number of through-silicon vias (TSVs). Experimental results show that our ring-based data fabric can reduce read latencies by almost 4X compared to traditional stacked memory chips. Variations of our scheme can also reduce power consumption compared to traditional memory stacks. Our Memory Architecture using a Ring-based Scheme (MARS) can effectively trade off power, throughput, and latency to improve system performance for different application spaces. We show that our MARS variants can deliver better latency (up to ~4X), power (up to ~8X), and performance per watt (up to ~4X) over HBM, when averaged over 11 SPEC CPU 2006 benchmarks. Other MARS variants provide higher throughput with similar power consumption compared to Wide I/O memory.

...read moreread less

Proceedings Article•DOI•

OptiMatch: Enabling an Optimal Match between Green Power and Various Workloads for Renewable-Energy Powered Storage Systems

[...]

Xiaoyang Qu¹, Jiguang Wan, Fengguang Song, Xiaozhao Zhuang¹, Fei Wu¹, Changsheng Xie¹ - Show less +2 more•Institutions (1)

Huazhong University of Science and Technology¹

01 Aug 2017

TL;DR: This paper introduces a novel scheme called OptiMatch to optimize the match between the power supply and the user-workload demand for massive storage systems that are mostly powered by renewable energy sources.

...read moreread less

Abstract: To reduce energy consumption and carbon emission, many data centers have deployed (or anticipate to build) their own renewable-energy power plants. However, the renewable energy (such as wind, tide, and solar energy) has the serious issues of intermittency and variability that prevent the green energy from being utilized effectively in practice. To cope with the issues, new power-supply management policies and workload scheduling algorithms have been designed. However, most existing work focuses on power optimization on computation only. In this paper, we introduce a novel scheme called OptiMatch to optimize the match between the power supply and the user-workload demand for massive storage systems that are mostly powered by renewable energy sources. OptiMatch has a hierarchical architecture, which consists of a number of heterogeneous storage devices. OptiMatch systematically utilizes the performance disparities between heterogeneous storage devices (i.e., performance per watt, IOPS/watt) to split the process for every write request into two stages: an on-line stage and a deferred off-line stage. The deferred off-line requests are used to match the green energy supplies. To maximize green energy utilization and minimize power budget without sacrificing quality of service, the fundamental methodology is to make the aggregate power supplies be proportional to the I/O workload demand at any time. To this end, our OptiMatch employs novel co-design optimizations. (1) We propose a dual-drive power control approach that makes the number of active nodes proportional to the workload demand when the green power supply is insufficient, meanwhile be proportional to the green power supply when green power is sufficient. (2) During periods of insufficient green supplies, we exploit virtualization consolidation schemes which enable a fine-grained power control to minimize the grid budgets. (3) During the periods of sufficient green supplies, we design an intelligent workload scheduling scheme which enables a near-optimal off-line requests assignment to maximize the green utilization. The experimental results demonstrate that the new OptiMatch framework can achieve high green utilization (up to 94.9%) with a minor performance degradation (less than 9.8%)

...read moreread less

Proceedings Article•DOI•

Load Balancing of Multimedia Workloads for Energy Efficiency on the Tegra K1 Multicore Architecture

[...]

Kristoffer Robin Stokke¹, Håkon Kvale Stensland¹, Carsten Griwodz¹, Pål Halvorsen¹•Institutions (1)

University of Oslo¹

20 Jun 2017

TL;DR: It is shown that it is much harder to save energy by balancing workloads between the heterogeneous cores of the Tegra K1, where only a 5% energy saving is demonstrated by offloading 10% DCT workload from the GPU to the CPU, while significantly more energy can be saved using the appropriate processor for different workloads.

...read moreread less

Abstract: Energy efficiency is a timely topic for modern mobile computing. Reducing the energy consumption of devices not only increases their battery lifetime, but also reduces the risk of hardware failure. Many researchers strive to understand the relationship between software activity and hardware power usage. A recurring strategy for saving power is to reduce operating frequencies. It is widely acknowledged that standard frequency scaling algorithms generally overreact to changes in hardware utilisation. More recent and original efforts attempt to balance software workloads on heterogeneous multicore architectures, such as the Tegra K1, which includes a quad-core CPU and a CUDA-capable GPU. However, it is not known whether it is possible to utilise these processor elements in parallel to save energy. Research into these types of systems are unfortunately often evaluated with the Performance Per Watt (PPW) metric, which is an unaccurate method because it ignores constant power usage from idle components. We show that this metric can end up increase energy usage on the Tegra K1, and give a false impression of how such systems consume energy. In reality, we show that it is much harder to save energy by balancing workloads between the heterogeneous cores of the Tegra K1, where we demonstrate only a 5% energy saving by offloading 10% DCT workload from the GPU to the CPU. Significantly more energy can be saved (up to 50 %) using the appropriate processor for different workloads.

...read moreread less

Journal Article•

Integrated Wireless Multi-Core Engineering Wireless Sensor Networks and Applications

[...]

S. Sashivardhan, T. Ravichandra babu

03 Mar 2017-International Journal of Research

TL;DR: Results reveal that TMAs exploit data locality effectively and are more suitable for MCEWSN applications that require integer manipulation of sensor data, such as information fusion, and have little or no communication between the parallelized tasks.

...read moreread less

Abstract: Technological advancements in the silicon industry, as predicted by Moore’s law, have enabled integration of billions of transistors on a single chip. To exploit this high transistor density for high performance, embedded systems are undergoing a transition from single core to multi-core. Although a majority of embedded wireless sensor networks (EWSNs) consist of single-core embedded sensor nodes, multi-core embedded sensor nodes are envisioned to burgeon in selected application domains that require complex in-network processing of the sensed data. In this project, an architecture for heterogeneous hierarchical multi-core embedded wireless sensor networks (MCEWSNs) as well as an architecture for multi-core embedded sensor nodes used in MCEWSNs is proposed.. This project also investigates the feasibility of two multi-core architectural paradigms symmetric multiprocessors (SMPs) and tiled many-core architectures (TMAs)V for MCEWSNs. This work compares and analyzes the performance of an SMP (an Intel-based SMP) and a TMA (Tilera’s TILEPro64) based on a parallelized information fusion application for various performance metrics (e.g., runtime, speedup, efficiency, cost, and performance per watt). Results reveal that TMAs exploit data locality effectively and are more suitable for MCEWSN applications that require integer manipulation of sensor data, such as information fusion, and have little or no communication between the parallelized tasks. To demonstrate the practical relevance of MCEWSNs, this project also discusses several state-of-the-art multi-core embedded sensor node prototypes developed in academia and industry.

...read moreread less

Book Chapter•DOI•

Warehouse-Scale Computing in the Post-Moore Era

[...]

Babak Falsafi¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

05 Sep 2017

TL;DR: Todays IT services are provided by centralized infrastructure referred to as datacenters, which consist of low-cost servers for high-volume data processing, communication and storage.

...read moreread less

Abstract: Todays IT services are provided by centralized infrastructure referred to as datacenters In contrast to supercomputers aimed at the high-cost/high-performance scientific domain, datacenters consist of low-cost servers for high-volume data processing, communication and storage Datacenter owners prioritize capital and operating costs (often measured in performance per watt) over ultimate performance

...read moreread less