scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Results indicate that multitier topologies are comparable to the single-level dragonfly topology in terms of power and latency while providing higher bisection and reduced area overhead, albeit at higher packet latency owing to increased diameter.
Abstract: Exascale and datacenter systems require terabits per second of internode communication bandwidth to meet the performance demands of high-performance computing applications. High-radix routers combined with scalable dragonfly topology have been proposed to reduce execution time and improve power dissipation. Although the dragonfly network has low diameter for exascale networks, fewer global links reduce the bisection bandwidth and require adaptive routing to prevent hot spots due to congestion. Moreover, the number of ports in a high-radix router affects the router cost when implemented with alternate emerging technologies. In this article, the authors advocate multitier network topologies that combine scalable topologies for local (intracabinet) and global (intercabinet) interconnects such as the k-ary n-cube, the flattened butterfly, and the dragonfly, to lead to improved bisection, manageable radix, and reduced link costs, albeit at higher packet latency owing to increased diameter. Because the performance per watt delivered by metallic interconnects or coaxial cables significantly exceeds the available power budget, we envision an entire exascale network composed of photonic links for communication and CMOS routers for switching. Results indicate that multitier topologies are comparable to the single-level dragonfly topology in terms of power and latency while providing higher bisection and reduced area overhead.

7 citations

Journal ArticleDOI
TL;DR: The usefulness of the simulator for this type of studies is demonstrated and it is concluded that the superior behavior of multiobjective algorithms makes them recommended for use in modern scheduling systems.
Abstract: Today, in an energy-aware society, job scheduling is becoming an important task for computer engineers and system analysts that may lead to a performance per Watt trade-off of computing infrastructures. Thus, new algorithms, and a simulator of computing environments, may help information and communications technology and data center managers to make decisions with a solid experimental basis. There are several simulators that try to address performance and, somehow, estimate energy consumption, but there are none in which the energy model is based on benchmark data that have been countersigned by independent bodies such as the Standard Performance Evaluation Corporation. This is the reason why we have implemented a performance and energy-aware scheduling PEAS simulator for high-performance computing. Furthermore, to evaluate the simulator, we propose an implementation of the non-dominated sorting genetic algorithm-II NSGA-II algorithm, a fast and elitist multiobjective genetic algorithm, for the resource selection. With the help of the PEAS simulator, we have studied if it is possible to provide an intelligent job allocation policy that may be able to save energy and time without compromising performance. The results of our simulations show a great improvement in response time and power consumption. In most of the cases, NSGA-II performs better than other 'intelligent' algorithms like multiobjective heterogeneous earliest finish time and clearly outperforms the first-fit algorithm. We demonstrate the usefulness of the simulator for this type of studies and conclude that the superior behavior of multiobjective algorithms makes them recommended for use in modern scheduling systems. Copyright © 2015 John Wiley & Sons, Ltd.

7 citations

Journal ArticleDOI
TL;DR: This paper proposes high-performance and energy-efficient multicore architectures for variety of parallelisms and memory-intensities in workloads and uses dynamic voltage and frequency scaling in Amdahl’s law to decrease amount of dark silicon and improve performance and performance per watt/joule.
Abstract: As technology scales further, multicore and many-core processors emerge as an alternative to keep up with performance demands. However, because of power and thermal constraints, we are obliged to power off remarkable area of chip. Many innovative techniques have been presented to improve energy efficiency and maintain utilization at the highest level. In this paper, we discuss different models and methods of exploiting dark silicon, and by using dynamic voltage and frequency scaling in Amdahl's law and considering memory overheads, we attempt to decrease amount of dark silicon and improve performance and performance per watt/joule. We propose high-performance and energy-efficient multicore architectures for variety of parallelisms and memory-intensities in workloads. According to the results, by voltage scaling, for a highly parallel CPU-intensive workload, we reach improvements of approximately $$5.2{\times }$$5.2× and $$3.78{\times }$$3.78× in performance per watt and performance per joule, respectively, while about 27 % reduction of performance should be tolerated. For memory-intensive applications, a negligible change in speedup is detected by scaling, while performance per watt and performance per joule for both serial and parallel applications lead to around $$6{\times }$$6× enhancements.

7 citations

Proceedings ArticleDOI
22 Jul 2015
TL;DR: This study examines a class of embedded system applications relevant to mobile vehicles to understand the limits of achievable energy efficiency under varying levels of system resilience constraints and considers static optimization of voltage-frequency settings on a per-application-segment basis.
Abstract: Low-power embedded processing typically relies on dynamic voltage-frequency scaling (DVFS) in order to optimize energy usage (and therefore, battery life) However, low voltage operation exacerbates the incidence of soft errors Similarly, higher voltage operation (to meet real-time deadlines) is constrained by hard-failure rate limits In this paper, we examine a class of embedded system applications relevant to mobile vehicles We investigate the problem of assigning optimal voltage-frequency settings to individual segments within target workflows The goal of this study is to understand the limits of achievable energy efficiency (performance per watt) under varying levels of system resilience constraints To optimize for energy efficiency, we consider static optimization of voltage-frequency settings on a per-application-segment basis We consider both linear and graph-structured workflows In order to understand the loss in energy efficiency in the face of environmental uncertainties encountered by the mobile vehicle, we also study the effect of injecting random variations in the actual runtime of individual application segments A dynamic re-optimization of the voltage-frequency settings is required to cope with such in-field uncertainties

7 citations

Proceedings ArticleDOI
15 Nov 2015
TL;DR: A course in GPU programming for senior undergraduates and first-year graduates that has been taught at Clemson University annually since 2010 is described, with focus on a large, real-world problem, in particular, a system for parallel solution of partial differential equations.
Abstract: Compared to CPUs, modern GPUs exhibit a high ratio of computing performance per watt, and so current supercomputer designs often include multiple racks of GPUs in order to achieve high teraflop counts at minimal energy cost. GPU programming is thus becoming increasingly important, and yet it remains a challenging task. This paper describes a course in GPU programming for senior undergraduates and first-year graduates that has been taught at Clemson University annually since 2010. The course uses problem-based learning, with focus on a large, real-world problem, in particular, a system for parallel solution of partial differential equations. Although the system for solving PDEs is useful in its own right, the problem is used as a vehicle in which to explore design issues that face those attempting to achieve new levels of performance on architectures.

7 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631