scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2010"


Proceedings ArticleDOI
13 Nov 2010
TL;DR: The programmer's view of this chip is described and RCCE is described: the native message passing model created for the SCC processor, an intermediate case, sharing traits of message passing and shared memory architectures.
Abstract: The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

267 citations


Proceedings ArticleDOI
13 Apr 2010
TL;DR: CAMP, a Comprehensive AMP scheduler is proposed, implemented, and evaluated, which delivers both efficiency and TLP specialization, and a new light-weight technique for discovering which threads utilize fast cores most efficiently is proposed.
Abstract: Symmetric-ISA (instruction set architecture) asymmetric-performance multicore processors were shown to deliver higher performance per watt and area for applications with diverse architectural requirements, and so it is likely that future multicore processors will combine a few fast cores characterized by complex pipelines, high clock frequency, high area requirements and power consumption, and many slow ones, characterized by simple pipelines, low clock frequency, low area requirements and power consumption. Asymmetric multicore processors (AMP) derive their efficiency from core specialization. Efficiency specialization ensures that fast cores are used for "CPU-intensive" applications, which efficiently utilize these cores' "expensive" features, while slow cores would be used for "memory-intensive" applications, which utilize fast cores inefficiently. TLP (thread-level parallelism) specialization ensures that fast cores are used to accelerate sequential phases of parallel applications, while leaving slow cores for energy-efficient execution of parallel phases. Specialization is effected by an asymmetry-aware thread scheduler, which maps threads to cores in consideration of the properties of both. Previous asymmetry-aware schedulers employed one type of specialization (either efficiency or TLP), but not both. As a result, they were effective only for limited workload scenarios. We propose, implement, and evaluate CAMP, a Comprehensive AMP scheduler, which delivers both efficiency and TLP specialization. Furthermore, we propose a new light-weight technique for discovering which threads utilize fast cores most efficiently. Our evaluation in the OpenSolaris operating system demonstrates that CAMP accomplishes an efficient use of an AMP system for a variety of workloads, while existing asymmetry-aware schedulers were effective only in limited scenarios.

144 citations


Proceedings ArticleDOI
18 Jan 2010
TL;DR: This work proposes a joint thermal and energy management technique specifically designed for heterogeneous MPSoCs that simultaneously reduces the thermal hot spots, temperature gradients, and energy consumption significantly.
Abstract: Heterogeneous multiprocessor system-on-chips (MPSoCs) which consist of cores with various power and performance characteristics can customize their configuration to achieve higher performance per Watt. On the other hand, inherent imbalance in power densities across MPSoCs leads to non-uniform temperature distributions, which affect performance and reliability adversely. In addition, managing temperature might result in conflicting decisions with achieving higher energy efficiency. In this work, we propose a joint thermal and energy management technique specifically designed for heterogeneous MPSoCs. Our technique identifies the performance demands of the current workload. By utilizing job scheduling and voltage/frequency scaling dynamically, we meet the desired performance while minimizing the energy consumption and the thermal imbalance. In comparison to performance-aware policies such as load balancing, our technique simultaneously reduces the thermal hot spots, temperature gradients, and energy consumption significantly.

61 citations


Journal ArticleDOI
TL;DR: The Sparc64 VIIIfx eight-core processor, developed for use in petascale computing systems, runs at speeds of up to 2 GHz and achieves a peak performance of 128 gigaflops while consuming as little as 58 watts of power.
Abstract: The Sparc64 VIIIfx eight-core processor, developed for use in petascale computing systems, runs at speeds of up to 2 GHz and achieves a peak performance of 128 gigaflops while consuming as little as 58 watts of power. Sparc64 VIIIfx realizes a six-fold improvement in performance per watt over previous generation Sparc64 processors.

56 citations


Proceedings ArticleDOI
15 Aug 2010
TL;DR: An inexpensive hardware system for monitoring power usage of individual CPU hosts and externally attached GPUs in HPC clusters and the software stack for integrating the power usage data streamed in real-time by the power monitoring hardware with the cluster management software tools are presented.
Abstract: We present an inexpensive hardware system for monitoring power usage of individual CPU hosts and externally attached GPUs in HPC clusters and the software stack for integrating the power usage data streamed in real-time by the power monitoring hardware with the cluster management software tools. We introduce a measure for quantifying the overall improvement in performance-per-watt for applications that have been ported to work on the GPUs. We use the developed hardware/software infrastructure to demonstrate the overall improvement in performance-per-watt for several HPC applications implemented to work on GPUs.

44 citations


Proceedings ArticleDOI
18 Mar 2010
TL;DR: A heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search, and automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, and schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE.
Abstract: We develop a heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search. Figure 5.3.1 shows the chip features. This SoC is capable of decoding 1080i audio/video data using a part of SoC (one general-purpose CPU core, video processing unit called VPU5 and sound processing unit called SPU) [1]. Four dynamically reconfigurable processors called FE [2] are integrated and have a total theoretical performance of 41.5GOPS and power consumption of 0.76W. Two 1024-way matrix-processors called MX-2 [3] are integrated and have a total theoretical performance of 36.9GOPS and power consumption of 1.10W. Overall, the performance per watt of our SoC is 37.3GOPS/W at 1.15V, the highest among comparable processors [4–6] excluding special-purpose codecs. The operation granularity of the CPU, FE and MX-2 are 32bit, 16bit, and 4bit respectively, and thus we can assign the appropriate processor for each task in an effective manner. A heterogeneous multi-core approach is one of the most promising approaches to attain high performance with low frequency, or low power, for consumer electronics application and scientific applications, compared to homogeneous multi-core SoCs [4]. For example, for image-recognition application in the IP-TV system, the FEs are assigned to calculate optical flow operation [7] of VGA (640×480) size video data at 15fps, which requires 0.62GOPS. The MX-2s are used for face detection and calculation of the feature quantity of the VGA video data at 15fps, which requires 30.6GOPS. In addition, general-purpose CPU cores are used for database search using the results of the above operations, which requires further enhancement of CPU. The automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE.

33 citations


Proceedings ArticleDOI
17 May 2010
TL;DR: Two different implementations of the parallelism-aware (PA) scheduling policy in OpenSolaris are created and evaluated on real hardware, where asymmetry was emulated via CPU frequency scaling.
Abstract: Asymmetric multicore processors (AMP) promise higher performance per watt than their symmetric counterparts, and it is likely that future processors will integrate a few fast out-of-order cores, coupled with a large number of simpler, slow cores, all exposing the same instruction-set architecture (ISA). It is well known that one of the most effective ways to leverage the effectiveness of these systems is to use fast cores to accelerate sequential phases of parallel applications, and to use slow cores for running parallel phases. At the same time, we are not aware of any implementation of this parallelism-aware (PA) scheduling policy in an operating system. So the questions as to whether this policy can be delivered efficiently by the operating system to unmodified applications, and what the associated overheads are remain open. To answer these questions we created two different implementations of the PA policy in OpenSolaris and evaluated it on real hardware, where asymmetry was emulated via CPU frequency scaling. This paper reports our findings with regard to benefits and drawbacks of this scheduling policy.

30 citations


Proceedings ArticleDOI
16 May 2010
TL;DR: A self-adaptive scheduler is proposed that exploits program behavior at runtime by matching computational demands of threads to the capabilities of cores to predict the selection of an appropriate core for changing program phases within threads.
Abstract: Asymmetric chip multiprocessors are imminent in the multi-core era primarily due their potential for power-performance efficiency. In order for software to fully realize this potential, the scheduling of threads to cores must be automated to adapt to the changing program behavior. However, strict system abstraction layers limit the controllability and observability of low level hardware details, thereby, limiting the state-of-the-art systems to rely on manual or static mapping of threads to cores in an asymmetric multi-core. In this paper, we propose a self-adaptive scheduler that exploits program behavior at runtime by matching computational demands of threads to the capabilities of cores. We present a novel empirical model to predict the selection of an appropriate core (based on optimizing throughput, power or performance per watt) for changing program phases within threads. Thread migration is initiated when an optimal mapping of threads to cores is predicted. Results show that our predictive schedulers for the three target optimizations are within 10% of the ideal scheduler.

28 citations


Journal ArticleDOI
TL;DR: The article explores the Gordon design space and the design of a specialized flash translation layer for data-centric applications that can outperform disk-based clusters by 1.5x and deliver 2x more performance per watt.
Abstract: Gordon is a system architecture for data-centric applications combining low-power processors, flash memory, and data-centric programming systems to improve performance and efficiency for data-centric applications, the article explores the Gordon design space and the design of a specialized flash translation layer. Gordon systems can outperform disk-based clusters by 1.5x and deliver 2.5x more performance per watt.

15 citations


Proceedings ArticleDOI
18 Aug 2010
TL;DR: This paper proposes and implements mechanisms and policies for a commercial OS scheduler and load balancer which incorporates thread characteristics, and shows that it results in improvements of up to 30% in performance per watt.
Abstract: Runtime characteristics of individual threads (such as IPC, cache usage, etc.) are a critical factor in making efficient scheduling decisions in modern chip-multiprocessor systems. They provide key insights into how threads interact when they share processor resources, and affect the overall system power and performance efficiency. In this paper, we propose and implement mechanisms and policies for a commercial OS scheduler and load balancer which incorporates thread characteristics, and show that it results in improvements of up to 30% in performance per watt.

14 citations


Proceedings ArticleDOI
15 Dec 2010
TL;DR: The improved parameterized FPGA implementation is a system-level abstraction of hardware-oriented parallel programming, as an alternative to gate-level Hardware Descriptive Language (HDL), to satisfy the high performance computation of parallel multidimensional filtering algorithms at a minimal development-to-market time.
Abstract: Two hardware architectures are developed via an improved parameterized efficient FPGA implementation method for parallel 1-D real-time signal filtering algorithms to provide higher performance per Watt and minimum logic area at maximum frequency. This improvement is evidently manifested rapid system-level abstraction FPGA prototyping and optimized speed, area and power, targeting Virtex-6 xc6vlX130Tl-1lff1156 FPGA board to achieve lower power consumption of (820 mW) and a (27%–44%) less device utilization at a maximum frequency of up to (231 MHz) using Xilinx System Generator . The improved parameterized FPGA implementation is a system-level abstraction of hardware-oriented parallel programming, as an alternative to gate-level Hardware Descriptive Language (HDL), to satisfy the high performance computation of parallel multidimensional filtering algorithms at a minimal development-to-market time.

Proceedings ArticleDOI
18 Dec 2010
TL;DR: This paper provides an autonomic power management scheme for the resource provisioning process for large-scale data centers while meeting the Service-Level Agreement (SLA) and power requirements.
Abstract: The characteristic of dramatic fluctuation in the resource provisioning for real-time applications calls for an elastic delivery of computing services Current data center deployment schemes, which feature a strong tie between servers and applications, are increasingly challenged to ensure power efficiency in terms of multiple peak loads provisioning, optimal average resources utilization, variable runtime workloads profiling, data center manageability and overhead control on the data center Total Cost of Ownership (TCO) Researchers have exploited paradigms such as virtualization and migration for large-scale computing systems, however, there is still a long way before we can optimally address the power-performance trade-off This paper provides an autonomic power management scheme for the resource provisioning process for large-scale data centers while meeting the Service-Level Agreement (SLA) and power requirements The system status is continuously monitored using a cross-layered hierarchy to optimally scale up and down the virtual machine resources such that power and performance can be optimized We have applied our technique to autonomically manage high performance platforms with multi-core processors and multi rank memory subsystems Our experimental results show around 5625 percent platform energy savings for memory-intensive workload, 6375 percent platform energy savings for processor-intensive workload and 475 percent platform energy savings for mixed workload while maintaining

Book ChapterDOI
21 May 2010
TL;DR: A finite difference scheme solving the general convection-diffusion-reaction equations adapted for application of Graphics Processing Units (GPU) and multithreading for many-Core computing.
Abstract: Many-Core system plays a key role on High Performance Computing, HPC, nowadays. This platform shows the big potential on the performance per watt, performance per floor area, cost performance, and so on. This paper presents a finite difference scheme solving the general convection-diffusion-reaction equations adapted for application of Graphics Processing Units (GPU) and multithreading. A two-dimensional nonlinear Burgers' equation was chosen as the test case. The best results that we measured are speed-up ratio of 12 times at mesh size 1026×1026 by using GPU and 20 times at mesh size 514×514 by using full 8 CPU cores when compared with an equivalent single CPU code.