scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper compares and analyzes the performance of an Intel-based SMP and Tilera's TILEPro64 TMA based on parallelized benchmarks for the following performance metrics: runtime, speedup, efficiency, cost, scalability, and performance per watt.
Abstract: With Moore's law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multi-core to exploit this high transistor density for high performance. However, there exists a plethora of multi-core architectures and the suitability of these multi-core architectures for different embedded domains (e.g., distributed, real-time, reliability-constrained) requires investigation. Despite the diversity of embedded domains, one of the critical applications in many embedded domains (especially distributed embedded domains) is information fusion. Furthermore, many other applications consist of various kernels, such as Gaussian elimination (used in network coding), that dominate the execution time. In this paper, we evaluate two embedded systems multi-core architectural paradigms: symmetric multiprocessors (SMPs) and tiled multi-core architectures (TMAs). We base our evaluation on a parallelized information fusion application and benchmarks that are used as building blocks in applications for SMPs and TMAs. We compare and analyze the performance of an Intel-based SMP and Tilera's TILEPro64 TMA based on our parallelized benchmarks for the following performance metrics: runtime, speedup, efficiency, cost, scalability, and performance per watt. Results reveal that TMAs are more suitable for applications requiring integer manipulation of data with little communication between the parallelized tasks (e.g., information fusion) whereas SMPs are more suitable for applications with floating point computations and a large amount of communication between processor cores.

6 citations

16 Sep 2014
TL;DR: This dissertation focuses at the operating-system-level power management and exploits the available sleep states to improve on energy efficiency while mainly concentrating on the leakage power dissipation.
Abstract: Modern embedded systems have increasingly penetrated our daily life, and have facilitated and accelerated our regular activities. Some of these systems are constrained with strict timing requirements, and have limited and/or intermittent power supply. One of the major challenges in the design process of such systems is to minimise their energy consumption and thus to increase the battery life and enhance their mobility. In order to address this objective, it is important to understand the current trends in the embedded systems industry. With progressing CMOS technology miniaturisation, the leakage power dissipation — once neglected — has become a major contributor to the overall power dissipation of modern embedded systems and as a matter of fact it has started to dominate its counterpart, the dynamic power dissipation. To cope with current trend of increasing leakage current, hardware vendors have equipped modern embedded processors with several sleep states and reduced the overhead (energy/time) of a sleep transition. Secondly, there is a trend towards an increased number of devices, as an ever increasing need for extra functionality in a single embedded system demands for extra Input/Output (I/O) devices, which are expensive in terms of energy consumption. Similar to processors, these devices are also equipped with low power sleep states to reduce their energy consumption. Thirdly, modern embedded processors have started to suffer from thermal issues due to increase in power density. It is essential to keep the temperature within recommended limits for the safe operation of the system and to increase the durability/reliability of hardware platforms. Finally, the CMOS industry experienced a paradigm shift in the last decade from single processor design to multicore hardware platforms as the clock frequency cannot be further increased efficiently to enhance the performance of the system. This is driven by the increase in performance per watt ratio that demands special packaging techniques to dissipate the generated heat at high frequencies. This dissertation attempts to provide energy efficient solutions and techniques to cope with the aforementioned arising trends, while closing the gap between theoretical research and practice. In particular, it focuses at the operating-system-level power management and exploits the available sleep states to improve on energy efficiency while mainly concentrating on the leakage power dissipation. Uniprocessor power management has been widely explored in the last two decades. Several procrastination approaches has been proposed in the literature to deal with the leakage current. However, these solutions approximate the procrastination interval to ease the analysis and sub-optimally utilise the available resources to minimise energy consumption. Such approximation is eliminated in this dissertation with the optimal algorithm to maximise energy savings. A practical limitation of the procrastination scheduling algorithm is relaxed by eliminating the need for an external hardware to implement the power saving algorithm. These newly developed algorithms with low complexity save energy comparable to procrastination scheduling. Furthermore, this dissertation demonstrates that idealised dynamic voltage and frequency scaling, and the thermally constrained dynamic power management are equivalent in nature. Hence, existing solutions proposed for dynamic voltage and frequency scaling can be easily ported to increase energy efficiency in thermally constrained systems.

5 citations

Proceedings ArticleDOI
04 May 2009
TL;DR: A novel profile-guided compiler technique is presented for cache-aware scheduling of iteration spaces of parallel loops which captures the effect of variation in the number of cache misses across the iteration space.
Abstract: The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.

5 citations

Proceedings ArticleDOI
01 Sep 2021
TL;DR: In this paper, a parameterizable vector processing unit (VPU) is presented based on a subset of V-extension from the RISC-V instruction set architecture (ISA) for embedded processing.
Abstract: The computational intensity in embedded processing applications is increasing. This requires domain-specific embedded platforms in order to achieve maximum performance per watt of the system. With the arrival of open-source instruction set architectures such as RISC-V, and different domain-specific architecture development toolchains, the trend of application-specific architectures is increasing. In this paper, a parameterizable Vector Processing Unit (VPU) is presented based on a subset of V-extension from the RISC-V instruction set architecture (ISA) for embedded processing. Two key configurable parameters for the proposed VPU are vector length (VLEN) and the number of execution lanes. These parameters allow design space exploration for the VPU for different configurations and help to understand which application scenarios would fit for certain configurations. The proposed VPU was integrated into a 32-bit RISC-V processor. For maximum parallelization configuration, 2.3 x fewer cycles per instructions were achieved as compared to a RISC-V processor. Moreover, a relative cycle gain of 33-73% was achieved for different configurations as compared with the RISC-V processor.

5 citations

Proceedings ArticleDOI
30 Sep 2013
TL;DR: It is shown that GPU provides a better performance per watt than ARM cluster, but by less than an order of magnitude, which indicates the great potential of ARM clusters, given the differences in hardware and software between the alternatives.
Abstract: This paper compares two parallel architectures, the GPU and the integrated ARM cluster, for the execution of map-reduce applications. The comparison targets performance and power usage. The increasing importance of energy efficiency, especially for large distributed systems - such as frequently used for map-reduce - motivates the comparison of alternative parallel architectures. Because the different hardware platforms require specific map-reduce implementations, we selected two different implementations and showed that GPU provides a better performance per watt than ARM cluster, but by less than an order of magnitude. These results indicate the great potential of ARM clusters, given the differences in hardware and software between the alternatives.

5 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631