scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2017"


Journal ArticleDOI
TL;DR: This paper proposes a novel methodology that can find the Pareto-optimal configurations at runtime as a function of the workload, and uses an extensive offline characterization to find classifiers that map performance counters to optimal configurations.
Abstract: Modern multiprocessor systems-on-chip (MpSoCs) offer tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. As the workload phases change at runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is infeasible due to the large number of workloads and configurations. This paper proposes a novel methodology that can find the Pareto-optimal configurations at runtime as a function of the workload. To achieve this, we perform an extensive offline characterization to find classifiers that map performance counters to optimal configurations. Then, we use these classifiers and performance counters at runtime to choose Pareto-optimal configurations. We evaluate the proposed methodology by maximizing the performance per watt for 18 single- and multi-threaded applications. Our experiments demonstrate an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively.

51 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: This work presents the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads and provides acceleration for core to core communication via a unique hardware RPC mechanism called the Atomic Transaction Engine.
Abstract: For many years, the highest energy cost in processing has been data movement rather than computation, and energy is the limiting factor in processor design [21]. As the data needed for a single application grows to exabytes [56], there is clearly an opportunity to design a bandwidth-optimized architecture for big data computation by specializing hardware for data movement. We present the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads. The DPU contains a unique Data Movement System (DMS), which provides hardware acceleration for data movement and partitioning operations at the memory controller that is sufficient to keep up with DDR bandwidth. The DPU also provides acceleration for core to core communication via a unique hardware RPC mechanism called the Atomic Transaction Engine. Comparison of a DPU chip fabricated in 40nm with a Xeon processor on a variety of data processing applications shows a 3× - 15× performance per watt advantage.CCS CONCEPTS• Computer systems organization $\rightarrow$ Multicore architectures; Special purpose systems;

31 citations


Proceedings ArticleDOI
01 Dec 2017
TL;DR: This work evaluates different design alternatives for Mellanox InfiniBand adapters in CUDA, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU, and implements a 2dstencil application kernel using NVSHMEM.
Abstract: GPUs have become an essential component for building compute clusters with high compute density and high performance per watt. As such clusters scale to have 1000s of GPUs, efficiently moving data between the GPUs becomes imperative to get maximum performance. NVSHMEM is an implementation of the OpenSHMEM standard for NVIDIA GPU clusters which allows communication to be issued from inside GPU kernels. In earlier work, we have shown how NVSHMEM can be used to achieve better application performance on GPUs connected through PCIe or NVLink. As part of this effort, we implement IB verbs for Mellanox InfiniBand adapters in CUDA. We evaluate different design alternatives, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU. We also consider correctness issues that arise in these designs. We take advantage of these designs transparently or through API extensions in NVSHMEM. With micro-benchmarks, we show that a Nvidia Pascal P100 GPU is able saturate the network bandwidth using only one or two of its 56 available streaming multiprocessors (SM). On a single GPU using a single IB EDR adapter, we achieve a throughput of around 90 million messages per second. In addition, we implement a 2dstencil application kernel using NVSHMEM and compare its performance with a CUDA-aware MPI-based implementation that uses GPUDirect RDMA. Speedups in the range of 23% to 42% are seen for input sizes large enough to fill the occupancy of Nvidia Pascal P100 GPUs on 2 to 4 nodes indicating that there are gains to be had by eliminating the CPU from the communication path when all computation runs on the GPU.

23 citations


Journal ArticleDOI
TL;DR: Evaluation on real AMP hardware and using scheduler implementations in the Linux kernel demonstrates that ACFS achieves an average 23% fairness improvement over two state-of-the-art schemes, while providing higher system throughput.

21 citations


Book ChapterDOI
28 Aug 2017
TL;DR: This paper characterizes the NVIDIA Jetson TK1 and TX1 Platforms by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach and through a case study of a heterogeneous application (matrix multiplication).
Abstract: This study characterizes the NVIDIA Jetson TK1 and TX1 Platforms, both built on a NVIDIA Tegra System on Chip and combining a quad-core ARM CPU and an NVIDIA GPU. Their heterogeneous nature, as well as their wide operating frequency range, make it hard for application developers to reason about performance and determine which optimizations are worth pursuing. This paper attempts to inform developers’ choices by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach as well as through a case study of a heterogeneous application (matrix multiplication). Our results highlight a difference of more than an order of magnitude in compute performance between the CPU and GPU on both platforms. Given that the CPU and GPU share the same memory bus, their Roofline models’ balance points are also more than an order of magnitude apart. We also explore the impact of frequency scaling: build CPU and GPU Roofline profiles and characterize both platforms’ balance point variation, power consumption, and performance per watt as frequency is scaled.

18 citations


Proceedings ArticleDOI
12 Jul 2017
TL;DR: This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator and shows how reasonable speedups are obtained in a device with scarce computing and embedded memory resources.
Abstract: Current computational demands require increasing designer's efficiency and system performance per watt. A broadly accepted solution for efficient accelerators implementation is reconfigurable computing. However, typical HDL methodologies require very specific skills and a considerable amount of designer's time. Despite the new approaches to high-level synthesis like OpenCL, given the large heterogeneity in today's devices (manycore, CPUs, GPUs, FPGAs), there is no one-fits-all solution, so to maximize performance, platform-driven optimization is needed. This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator. Results are reported for a Cyclone V SoC using Intel FPGA OpenCL Offline Compiler 16.0 out-of-the-box. From a common baseline C implementation running on the embedded ARM® Cortex®-A9, OpenCL-based synthesis is evaluated applying different generic and vendor specific optimizations. Results show how reasonable speedups are obtained in a device with scarce computing and embedded memory resources. It seems a great step has been given to effectively raise the abstraction level, but still, a considerable amount of HW design skills is needed.

16 citations


Journal ArticleDOI
TL;DR: A new definition of ideal energy-proportional computing is introduced, new metrics to quantify computational energy waste, and new SLA-aware OS governors that seek Pareto optimality to achieve power-efficient performance are introduced.
Abstract: The original definition of energy-proportional computing does not characterize the energy efficiency of recent reconfigurable computers, resulting in nonintuitive “super-proportional” behavior. This article introduces a new definition of ideal energy-proportional computing, new metrics to quantify computational energy waste, and new SLA-aware OS governors that seek Pareto optimality to achieve power-efficient performance.

13 citations


Journal ArticleDOI
TL;DR: The results of the project analyzing the performance of several scientific applications on several GPU- and SoC-based systems are presented and the methodology used to measure energy performance and the tools implemented to monitor the power drained by applications while running are described.
Abstract: Energy consumption is today one of the most relevant issues in operating HPC systems for scientific applications. The use of unconventional computing systems is therefore of great interest for several scientific communities looking for a better tradeoff between time-to-solution and energy-to-solution. In this context, the performance assessment of processors with a high ratio of performance per watt is necessary to understand how to realize energy-efficient computing systems for scientific applications, using this class of processors. Computing On SOC Architecture (COSA) is a three-year project (2015–2017) funded by the Scientific Commission V of the Italian Institute for Nuclear Physics (INFN), which aims to investigate the performance and the total cost of ownership offered by computing systems based on commodity low-power Systems on Chip (SoCs) and high energy-efficient systems based on GP-GPUs. In this work, we present the results of the project analyzing the performance of several scientific applications on several GPU- and SoC-based systems. We also describe the methodology we have used to measure energy performance and the tools we have implemented to monitor the power drained by applications while running.

13 citations


Journal ArticleDOI
TL;DR: This work modify user space libraries and device drivers of GPUs and the InfiniBand network device in a way to enable the GPU to control an Infini Band network device to independently source and sink communication requests without any involvement of the CPU.
Abstract: Due to their massive parallelism and high performance per Watt, GPUs have gained high popularity in high-performance computing and are a strong candidate for future exascale systems. But communicat...

12 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: The final design is able to outperform both CPU and FPGA works in the state of the art in terms of pure performance of a factor up to 10x, and high-end GPUs in termsof performance per watt by a factor of 1.84x.
Abstract: N-Body simulation simulates the evolution of a system that is composed of N particles, where each element receives a force that is due to the interaction with all the other elements within the system. Usually, the influence of external physical forces, such as gravity, is involved too. This methodology is widely used in different fields that range from astrophysics, where it is used to study the interaction of celestial objects, to molecular dynamics, where the bodies are represented by molecules. Although its wide range of applicability, the algorithm presents a high computational complexity that requires the usage of powerful and high power consuming computers. An acceleration on a reconfigurable device, such as an FPGA, would benefit both in term of performance and power consumption. In this work we presents a scalable, high performance and highly efficient implementation of an N-Body simulation algorithm on FPGA. The final design is able to outperform both CPU and FPGA works in the state of the art in terms of pure performance of a factor up to 10x, and high-end GPUs in terms of performance per watt by a factor of 1.84x.

11 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: A direct memory-access scheme is developed to take advantage of the complex KeyStone architecture for FFTs and shows that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53.
Abstract: Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.

Journal ArticleDOI
TL;DR: A multilevel NoSQL cache architecture that utilizes both FPGA-based hardware cache and in-kernel software cache in a complementary style is proposed, which reduces the cache miss ratio and improves the throughput compared to the nonhierarchical design.
Abstract: Key-value store accelerators based on field-programmable gate arrays (FPGAs) have been proposed to achieve higher performance per watt than software-based processing. However, because their cache capacity is strictly limited by DRAMs implemented on FPGA boards, their application domains are also limited. To address this issue, the authors propose a multilevel NoSQL cache architecture that utilizes both FPGA-based hardware cache and in-kernel software cache in a complementary style. This motivates them to explore various design options. Simulation results show that their design reduces the cache miss ratio and improves the throughput compared to the nonhierarchical design.

Journal ArticleDOI
TL;DR: Analytical models based on scaled power metrics are presented to analyze the impact of various architectural design choices on scaled performance and power savings and show that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.
Abstract: Many-core processors are accelerating the performance of contemporary high-performance systems. Managing power consumption within these systems demands low-power architectures to increase power savings. One of the promising solutions offered today by microprocessor architects is asymmetric microprocessors that integrate different core architectures on a single die. This paper presents analytical models based on scaled power metrics to analyze the impact of various architectural design choices on scaled performance and power savings. The power consumption implications of different processing schemes and various chip configurations were also analyzed. Analysis shows that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: This paper evaluates energy consumption of data types, operators, control statements, exception, and object in Java at a granular level to help in standardizing the energy consumption traits of Java which can be leveraged by software developers to generate energy efficient code in future.
Abstract: There has been a 10,000-fold increase in performance of supercomputers since 1992 but only 300-fold improvement in performance per watt. Dynamic adaptation of hardware techniques such as fine-grain clock gating, power gating and dynamic voltage/frequency scaling, are used for many years to improve the computer's energy efficiency. However, recent demands of exascale computation, as well as the increasing carbon footprint, require new breakthrough to make ICT systems more energy efficient. Energy efficient software has not been well studied in the last decade. In this paper, we take an early step to investigate the energy efficiency of Java which is one of the most common languages used in ICT systems. We evaluate energy consumption of data types, operators, control statements, exception, and object in Java at a granular level. Intel Running Average Power Limit (RAPL) technology is applied to measure the relative power consumption of small code snippets. Several observations are found, and these results will help in standardizing the energy consumption traits of Java which can be leveraged by software developers to generate energy efficient code in future.

Proceedings ArticleDOI
01 May 2017
TL;DR: In this work, it is proposed a Field Programmable Gate Array (FPGA) implementation of the Pearson Correlation Coefficient (PCC) algorithm, applied to a Brain Network (BN) case study, and it will be shown that the proposed implementation can achieve up to 10x speedup with respect to a single-threaded Central Processing Unit (CPU) implementation, while guaranteeing 2x performance per Watt ratio in comparison to a Graphic processing Unit (GPU) implementation.
Abstract: Thanks to the availability of new biomedical technologies and analysis methodologies, the quality of clinical exams and medical research is increasing. These improvements have given the opportunity to analyze large amount of data with an higher level of accuracy. Therefore, processors able to handle compute intensive algorithms and large datasets are needed, and the use of homogeneous processors is becoming inefficient for this purpose. As a result, heterogeneous architectures are becoming the key technology to improve the efficiency of these computations, by allowing a concurrent elaboration of data. In this work, it is proposed a Field Programmable Gate Array (FPGA) implementation of the Pearson Correlation Coefficient (PCC) algorithm, applied to a Brain Network (BN) case study. Itwill be shown that the proposed implementation can achieve up to 10x speedup with respect to a single-threaded Central Processing Unit (CPU) implementation, while guaranteeing 2x performance per Watt ratio in comparison to a Graphic Processing Unit (GPU) implementation. These considerations open to the possibility of using FPGA architectures in application fields, such as data centers and biomedical embedded systems, where power capping and heat are relevant issues to be considered.

Book ChapterDOI
26 Apr 2017
TL;DR: A performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method demonstrates a good correlation between the performance attained and the extra energy required.
Abstract: We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous scenarios to maximize acceleration and minimize power consumption. Experimental results using CUDA on a set of GeForce GTX 980 GPUs illustrate their capabilities as high-performance and low-power devices, with a energy cost to be more attractive when increasing the number of GPUs. Overall, our results demonstrate a good correlation between the performance attained and the extra energy required, even in scenarios where multi-GPUs do not show great scalability.

Proceedings ArticleDOI
14 Jun 2017
TL;DR: OptiBook is presented, a system that improves energy proportionality and/or resource utilization to optimize performance and energy efficiency and applies performance isolation techniques such as CPU pinning and quota enforcement as well as online resource tuning to effectively improve energy efficiency.
Abstract: A lack of energy proportionality, low resource utilization, and interference in virtualized infrastructure make the cloud a challenging target environment for improving energy efficiency. In this paper we present OptiBook, a system that improves energy proportionality and/or resource utilization to optimize performance and energy efficiency. OptiBook shares servers between latency-sensitive services and batch jobs, overbooks the system in a controllable manner, uses vertical (CPU and DVFS) scaling for prioritized virtual machines, and applies performance isolation techniques such as CPU pinning and quota enforcement as well as online resource tuning to effectively improve energy efficiency. Our evaluations show that on average, OptiBook improves performance per watt by 20% and reduces energy consumption by 9% while minimizing SLO violations.

Journal ArticleDOI
TL;DR: It is shown that application performance does not grow linearly with network power in an NoC-based CMP, and a new figure of merit called Marginal Performance (MP) is proposed which evaluates the incremental performance per power increment after the inertial region.
Abstract: In network-on-chip (NoC) based CMPs, DVFS is commonly used to co-optimize performance and power. To achieve optimal efficiency, it is important to gain proportional performance growth with power. However, power over/under provisioning often exists. To properly evaluate and guide NoC DVFS techniques, it is highly desirable to formalize and quantify power over/under provisioning. In this paper, we first show that application performance does not grow linearly with network power in an NoC-based CMP. Instead, their relationship is non-linear and can be captured using performance-power characteristics curve (PPCC) with three distinct regions: an inertial region, a linear region, and a saturation region. We note that conventional DVFS metrics such as Performance Per Watt (PPW) cannot accurately evaluate such non-linear relationship. Based on PPCC, we propose a new figure of merit called Marginal Performance (MP) which evaluates the incremental performance per power increment after the inertial region. The MP concept enables to formally define power over- and under-provisioning with reference to the linear region in which an efficient NoC DVFS should operate. Applying the PPCC and MP concepts in full-system simulations with PARSEC and SPEC OMP2012 benchmarks, we are able to identify power over/under provisioning occurrences, measure and compare their statistics in two latest NoC DVFS techniques. Moreover, we show evidences that MP can accurately and consistently evaluate the NoC DVFS techniques, avoiding the misjudgement and inconsistency of PPW-based evaluations.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper presents DENA, a DVFS-capable, heterogeneous NoC design, encompassing the coherence protocol, the application behavior and the need to minimize the energy budget, with results on a 64-core, 2D-mesh architecture executing the SPLASH2 benchmark suite.
Abstract: The current design drivers for multi-cores, namely performance per watt, scalability and flexibility, make the Networks-on-Chip (NoCs) the de-facto on-chip interconnect. State of the art NoCs can exploit heterogeneous solutions and complex DVFS techniques to fulfill also the variability of the application requirements. Relevant showstoppers to the design of a truly flexible NoC fitting all the possible traffic conditions, are the burstiness of the traffic generated by modern applications magnified by the unbalanced usage of the interconnect resources due to the implemented coherence protocol. This paper presents DENA, a DVFS-capable, heterogeneous NoC design, encompassing the coherence protocol, the application behavior and the need to minimize the energy budget. Simulation results on a 64-core, 2D-mesh architecture executing the SPLASH2 benchmark suite, testify the advantages of DENA from both the performance and energy viewpoints with an average 34.3% energy-performance improvement against the state of the art DVFS-capable NoC design.

Proceedings ArticleDOI
23 Jan 2017
TL;DR: In this article, a per-row channel interleaving scheme was proposed to improve the power efficiency of DRAM and investigate the memory access pattern of a BFS implementation using a cycle-accurate processor simulator.
Abstract: Graph analysis applications have been widely used in real services such as road-traffic analysis and social network services. Breadth-first search (BFS) is one of the most representative algorithms for such applications; therefore, many researchers have tuned it to maximize performance. On the other hand, owing to the strict power constraints of modern HPC systems, it is necessary to improve power efficiency (i.e., performance per watt) when executing BFS. In this work, we focus on the power efficiency of DRAM and investigate the memory access pattern of a state-of-the-art BFS implementation using a cycle-accurate processor simulator. The results reveal that the conventional address mapping schemes of modern memory controllers do not efficiently exploit row buffers in DRAM. Thus, we propose a new scheme called per-row channel interleaving and improve the DRAM power efficiency by 30.3% compared to a conventional scheme for a certain simulator setting. Moreover, we demonstrate that this proposed scheme is effective for various configurations of memory controllers.

Proceedings ArticleDOI
01 Nov 2017
TL;DR: The Memory Architecture using a Ring-based Scheme (MARS) can effectively trade off power, throughput, and latency to improve system performance for different application spaces and shows that its MARS variants can deliver better latency, power, and performance per watt over HBM when averaged over 11 SPEC CPU 2006 benchmarks.
Abstract: As computer memory increases in size and processors continue to get faster, the memory subsystem becomes an increasing bottleneck to system performance. To mitigate the relatively slow DRAM memory chip speeds, a new generation of 3D stacked DRAM is being developed, with lower power consumption and higher bandwidth. This paper proposes the use of 3D ring-based data fabrics for fast data transfer between these chips. The ring-based data fabric uses a fast standing wave oscillator to clock its transactions. With a fast clocking scheme, and multiple channels sharing the same bus, more channels are utilized while significantly reducing the number of through-silicon vias (TSVs). Experimental results show that our ring-based data fabric can reduce read latencies by almost 4X compared to traditional stacked memory chips. Variations of our scheme can also reduce power consumption compared to traditional memory stacks. Our Memory Architecture using a Ring-based Scheme (MARS) can effectively trade off power, throughput, and latency to improve system performance for different application spaces. We show that our MARS variants can deliver better latency (up to ~4X), power (up to ~8X), and performance per watt (up to ~4X) over HBM, when averaged over 11 SPEC CPU 2006 benchmarks. Other MARS variants provide higher throughput with similar power consumption compared to Wide I/O memory.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: This paper introduces a novel scheme called OptiMatch to optimize the match between the power supply and the user-workload demand for massive storage systems that are mostly powered by renewable energy sources.
Abstract: To reduce energy consumption and carbon emission, many data centers have deployed (or anticipate to build) their own renewable-energy power plants. However, the renewable energy (such as wind, tide, and solar energy) has the serious issues of intermittency and variability that prevent the green energy from being utilized effectively in practice. To cope with the issues, new power-supply management policies and workload scheduling algorithms have been designed. However, most existing work focuses on power optimization on computation only. In this paper, we introduce a novel scheme called OptiMatch to optimize the match between the power supply and the user-workload demand for massive storage systems that are mostly powered by renewable energy sources. OptiMatch has a hierarchical architecture, which consists of a number of heterogeneous storage devices. OptiMatch systematically utilizes the performance disparities between heterogeneous storage devices (i.e., performance per watt, IOPS/watt) to split the process for every write request into two stages: an on-line stage and a deferred off-line stage. The deferred off-line requests are used to match the green energy supplies. To maximize green energy utilization and minimize power budget without sacrificing quality of service, the fundamental methodology is to make the aggregate power supplies be proportional to the I/O workload demand at any time. To this end, our OptiMatch employs novel co-design optimizations. (1) We propose a dual-drive power control approach that makes the number of active nodes proportional to the workload demand when the green power supply is insufficient, meanwhile be proportional to the green power supply when green power is sufficient. (2) During periods of insufficient green supplies, we exploit virtualization consolidation schemes which enable a fine-grained power control to minimize the grid budgets. (3) During the periods of sufficient green supplies, we design an intelligent workload scheduling scheme which enables a near-optimal off-line requests assignment to maximize the green utilization. The experimental results demonstrate that the new OptiMatch framework can achieve high green utilization (up to 94.9%) with a minor performance degradation (less than 9.8%)

Proceedings ArticleDOI
20 Jun 2017
TL;DR: It is shown that it is much harder to save energy by balancing workloads between the heterogeneous cores of the Tegra K1, where only a 5% energy saving is demonstrated by offloading 10% DCT workload from the GPU to the CPU, while significantly more energy can be saved using the appropriate processor for different workloads.
Abstract: Energy efficiency is a timely topic for modern mobile computing. Reducing the energy consumption of devices not only increases their battery lifetime, but also reduces the risk of hardware failure. Many researchers strive to understand the relationship between software activity and hardware power usage. A recurring strategy for saving power is to reduce operating frequencies. It is widely acknowledged that standard frequency scaling algorithms generally overreact to changes in hardware utilisation. More recent and original efforts attempt to balance software workloads on heterogeneous multicore architectures, such as the Tegra K1, which includes a quad-core CPU and a CUDA-capable GPU. However, it is not known whether it is possible to utilise these processor elements in parallel to save energy. Research into these types of systems are unfortunately often evaluated with the Performance Per Watt (PPW) metric, which is an unaccurate method because it ignores constant power usage from idle components. We show that this metric can end up increase energy usage on the Tegra K1, and give a false impression of how such systems consume energy. In reality, we show that it is much harder to save energy by balancing workloads between the heterogeneous cores of the Tegra K1, where we demonstrate only a 5% energy saving by offloading 10% DCT workload from the GPU to the CPU. Significantly more energy can be saved (up to 50 %) using the appropriate processor for different workloads.

Journal Article
TL;DR: Results reveal that TMAs exploit data locality effectively and are more suitable for MCEWSN applications that require integer manipulation of sensor data, such as information fusion, and have little or no communication between the parallelized tasks.
Abstract: Technological advancements in the silicon industry, as predicted by Moore’s law, have enabled integration of billions of transistors on a single chip. To exploit this high transistor density for high performance, embedded systems are undergoing a transition from single core to multi-core. Although a majority of embedded wireless sensor networks (EWSNs) consist of single-core embedded sensor nodes, multi-core embedded sensor nodes are envisioned to burgeon in selected application domains that require complex in-network processing of the sensed data. In this project, an architecture for heterogeneous hierarchical multi-core embedded wireless sensor networks (MCEWSNs) as well as an architecture for multi-core embedded sensor nodes used in MCEWSNs is proposed.. This project also investigates the feasibility of two multi-core architectural paradigms symmetric multiprocessors (SMPs) and tiled many-core architectures (TMAs)V for MCEWSNs. This work compares and analyzes the performance of an SMP (an Intel-based SMP) and a TMA (Tilera’s TILEPro64) based on a parallelized information fusion application for various performance metrics (e.g., runtime, speedup, efficiency, cost, and performance per watt). Results reveal that TMAs exploit data locality effectively and are more suitable for MCEWSN applications that require integer manipulation of sensor data, such as information fusion, and have little or no communication between the parallelized tasks. To demonstrate the practical relevance of MCEWSNs, this project also discusses several state-of-the-art multi-core embedded sensor node prototypes developed in academia and industry.

Book ChapterDOI
05 Sep 2017
TL;DR: Todays IT services are provided by centralized infrastructure referred to as datacenters, which consist of low-cost servers for high-volume data processing, communication and storage.
Abstract: Todays IT services are provided by centralized infrastructure referred to as datacenters In contrast to supercomputers aimed at the high-cost/high-performance scientific domain, datacenters consist of low-cost servers for high-volume data processing, communication and storage Datacenter owners prioritize capital and operating costs (often measured in performance per watt) over ultimate performance