scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2009"


Proceedings ArticleDOI
07 Mar 2009
TL;DR: The paper presents an exhaustive analysis of the design space of Gordon systems, focusing on the trade-offs between power, energy, and performance that Gordon must make, and describes a novel flash translation layer tailored to data intensive workloads and large flash storage arrays.
Abstract: As our society becomes more information-driven, we have begun to amass data at an astounding and accelerating rate. At the same time, power concerns have made it difficult to bring the necessary processing power to bear on querying, processing, and understanding this data. We describe Gordon, a system architecture for data-centric applications that combines low-power processors, flash memory, and data-centric programming systems to improve performance for data-centric applications while reducing power consumption. The paper presents an exhaustive analysis of the design space of Gordon systems, focusing on the trade-offs between power, energy, and performance that Gordon must make. It analyzes the impact of flash-storage and the Gordon architecture on the performance and power efficiency of data-centric applications. It also describes a novel flash translation layer tailored to data intensive workloads and large flash storage arrays. Our data show that, using technologies available in the near future, Gordon systems can out-perform disk-based clusters by 1.5× and deliver up to 2.5× more performance per Watt.

277 citations


Journal ArticleDOI
TL;DR: This work proposes a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline, and is comparatively simple and scalable.
Abstract: Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline. The resulting algorithm does not rely on dynamic profiling, and is comparatively simple and scalable. We implemented HASS in OpenSolaris, and achieved average workload speedups of up to 13%, matching best static assignment, achievable only by an oracle. We have also implemented a dynamic IPC-driven algorithm proposed earlier that relies on online profiling. We found that the complexity, load imbalance and associated performance degradation resulting from dynamic profiling are significant challenges to using this algorithm successfully. As a result it failed to deliver expected performance gains and to outperform HASS.

256 citations


Proceedings ArticleDOI
27 Jul 2009
TL;DR: This work adapts the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric, and is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUda programming model for high-performance computing in FPGAs.
Abstract: As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

177 citations


Proceedings ArticleDOI
29 Jul 2009
TL;DR: Results show that, for gravitational force calculation and many-body simulations in general, GPUs are very competitive in terms of performance and performance per dollar figures, whereas FPGAs are competitive in Terms of performance per Watt figures.
Abstract: In this paper, we describe the implementation of gravitational force calculation for N-body simulations in the context of Astrophysics. It will describe high performance implementations on general purpose processors, GPUs, and FPGAs, and compare them using a number of criteria including speed performance, power efficiency and cost of development. These results show that, for gravitational force calculation and many-body simulations in general, GPUs are very competitive in terms of performance and performance per dollar figures, whereas FPGAs are competitive in terms of performance per Watt figures.

39 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: The analyses and optimizations of the CHiMPS compiler that construct many-cache caches are presented, showing a performance advantage of 7.8x over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater.
Abstract: Many-cache is a memory architecture that efficiently supports caching in commercially available FPGAs. It facilitates FPGA programming for high-performance computing (HPC) developers by providing them with memory performance that is greater and power consumption that is less than their current CPU platforms, but without sacrificing their familiar, C-based programming environment.Many-cache creates multiple, multi-banked caches on top of an FGPA's small, independent memories, each targeting a particular data structure or region of memory in an application and each customized for the memory operations that access it. The caches are automatically generated from C source by the CHiMPS C-to-FPGA compiler.This paper presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches. An architectural evaluation of CHiMPS-generated FPGAs demonstrates a performance advantage of 7.8x (geometric mean) over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater, by a geometric mean of 21.3x.

37 citations


Journal ArticleDOI
TL;DR: The RC Amenability Test, or RAT, a methodology and model developed for this purpose, supporting rapid exploration and prediction of strategic design tradeoffs during the formulation stage of application development is presented.
Abstract: While the promise of achieving speedup and additional benefits such as high performance per watt with FPGAs continues to expand, chief among the challenges with the emerging paradigm of reconfigurable computing is the complexity in application design and implementation. Before a lengthy development effort is undertaken to map a given application to hardware, it is important that a high-level parallel algorithm crafted for that application first be analyzed relative to the target platform, so as to ascertain the likelihood of success in terms of potential speedup. This article presents the RC Amenability Test, or RAT, a methodology and model developed for this purpose, supporting rapid exploration and prediction of strategic design tradeoffs during the formulation stage of application development.

34 citations


Journal ArticleDOI
TL;DR: It is proposed that the controlling domain in a Virtual MachineMonitor or hypervisor is relatively insensitive to changes in core frequency, and thus scheduling it on a slower core saves power while only slightly affecting guest domain performance.
Abstract: Single-ISA heterogeneous multicore architectures promise to deliver plenty of cores with varying complexity, speed and performance in the near future. Virtualization enables multiple operating systems to run concurrently as distinct, independent guest domains, thereby reducing core idle time and maximizing throughput. This paper seeks to identify a heuristic that can aid in intelligently scheduling these virtualized workloads to maximize performance while reducing power consumption.We propose that the controlling domain in a Virtual MachineMonitor or hypervisor is relatively insensitive to changes in core frequency, and thus scheduling it on a slower core saves power while only slightly affecting guest domain performance. We test and validate our hypothesis and further propose a metric, the Combined Usage of a domain, to assist in future energy-efficient scheduling. Our preliminary findings show that the Combined Usage metric can be used as a starting point to gauge the sensitivity of a guest domain to variations in the controlling domain's frequency.

26 citations


Proceedings ArticleDOI
22 Feb 2009
TL;DR: This poster presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches, and presents the details of the cache parameters on a Xilinx Virtex-5 LX110T FPGA.
Abstract: CHiMPS is a C-based compiler for high-performance computing (HPC) on heterogeneous CPU-FPGA computing platforms. CHiMPS efficiently supports random accesses to main memory through the many-cache memory model, enabling a broader range of applications to take advantage of FPGA-based acceleration. Many-cache creates multiple caches on top of an FGPA's small, independent memories, each targeting a particular data structure or region of memory in an application and each customized for the memory operations that access it. This poster presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches, and presents the details of the cache parameters on a Xilinx Virtex-5 LX110T FPGA. Detailed simulation results on HPC kernels demonstrate a 7.8x (geometric mean) performance boost over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater, by a geometric mean of 21.3x.

20 citations


Proceedings ArticleDOI
01 Jan 2009
TL;DR: In this article, the authors reviewed recent industry activities around the recommended environmental conditions in the data center, the impact to the ICT equipment of air-side economizers, where they can best be applied, and provided data from a case study recently concluded at Intel's site in New Mexico.
Abstract: Moore’s Law continues to drive increased compute capability and greater performance per watt in today’s and future server platforms. However the increased demand for compute services has outstripped these gains and the energy consumption in the data center continues to rise. The challenge for the data center operator is to limit the operational costs and reduce the energy required to run the Information and Communications Technology (ICT) equipment and the supporting infrastructure. The cooling systems can represent a large portion of the energy use in the support infrastructure. There is significant focus in industry today on applying advanced cooling technologies to reduce this energy. One potential solution is the use of air-side economizers in the cooling system. This technology can provide a reduction in cooling energy by being able to maintain the required temperatures in the data center with the mechanical refrigeration turned off, significantly reducing the PUE for the data center. This paper reviews recent industry activities around the recommended environmental conditions in the data center, the impact to the ICT equipment of air-side economizers, where they can best be applied, and provides data from a case study recently concluded at Intel’s site in New Mexico. In that case study servers from an engineering compute data center were split into a standard configuration (closed system, tight temperature control) and a very aggressive air-side economization section (open system, significant out-door air quantities, moderate temperature control). Both sections performed equally well over a year long on-line test, with significant energy savings potential demonstrated by economizer side. The American Society of Air-conditioning Heating and Refrigerating Engineers (ASHRAE) has recently published new ICT-vendor consensus-based recommendations for the environmental conditions in data centers. These new limits are discussed in light of the successful experiment run in New Mexico as the revised operational envelop allows a far greater number of hours per year when a data center can be run in “free-cooling” mode to obtain the energy savings. Server design features as well as lessons learned from the experiment and their applicability to the potential use of air-side economizers is also discussed.Copyright © 2009 by ASME

12 citations


Proceedings ArticleDOI
04 May 2009
TL;DR: A novel profile-guided compiler technique is presented for cache-aware scheduling of iteration spaces of parallel loops which captures the effect of variation in the number of cache misses across the iteration space.
Abstract: The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.

5 citations


Patent
Heller Jr Thomas J1
16 Nov 2009
TL;DR: In this article, a stack of microprocessor chips that are designed to work together in a multiprocessor system is discussed, and the hypervisor or operating system controls the utilization of individual chips of a stack.
Abstract: A computing system has a stack of microprocessor chips that are designed to work together in a multiprocessor system. The chips are interconnected with 3D through vias, or alternatively by compatible package carriers having the interconnections, while logically the chips in the stack are interconnected via specialized cache coherent interconnections. All of the chips in the stack use the same logical chip design, even though they can be easily personalized by setting specialized latches on the chips. One or more of the individual microprocessor chips utilized in the stack are implemented in a silicon process that is optimized for high performance while others are implemented in a silicon process that is optimized for power consumption i.e. for the best performance per Watt of electrical power consumed. The hypervisor or operating system controls the utilization of individual chips of a stack.

Proceedings ArticleDOI
Ronny Ronen1
18 May 2009
TL;DR: This talk presents Larrabee, a many-core visual computing architecture that provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads and greatly increases the flexibility and programmability of the architecture as compared to standard GPUs.
Abstract: The ample supply of transistors provided by advancements in process technology, combined with the increased difficultly to exploit single thread performance, moved the industry to populate several cores on a single die. This talk presents Larrabee -- the next bold step in this direction.Larrabee is a many-core visual computing architecture. Larrabee uses multiple in-order X86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures.

Book ChapterDOI
07 Mar 2009
TL;DR: This paper introduces an open generic operating system interface concept what they call Accelerator File System (ACCFS) for integrating application accelerators into Linux based platforms and contributes to a broader discussion of this challenging topic.
Abstract: For a number of applications integrating specialized computational accelerators into a general-purpose computing environment yields more performance per watt and per dollar than a pure multi-core approach. In contrast to fully application-specific hybrid solutions we offer the advantage to maintain traditional programming models and development environments to a certain extent. In this paper we introduce an open generic operating system interface concept what we call Accelerator File System (ACCFS) for integrating application accelerators into Linux based platforms. By describing the proposed concepts and interface we contribute to a broader discussion of this challenging topic.

Proceedings ArticleDOI
10 May 2009
TL;DR: This presentation will first explain what is meant by green computing and how greenness of information processing may be quantified, and energy-efficient computing paradigms which utilize chip multi-processing, multiple-voltage domains, dynamic voltage/frequency scaling, and power/clock gating techniques will be reviewed.
Abstract: Digital information management is the key enabler for unprecedented rise in productivity and efficiency gains experienced by the world economies during the 21st century. Information processing systems have thus become essential to the functioning of business, service, academic, and governmental institutions. As institutions increase their offerings of digital information services, the demand for computation and storage capability also increases. Examples include online banking, e-filing of taxes, music and video downloads, online shipment tracking, real-time inventory/supply-chain management, electronic medical recording, insurance database management, surveillance and disaster recovery. It is estimated that, in some industries, the number of records that must be retained is growing at a CAGR of 50 percent or greater. This exponential increase in the digital intensity of human existence is driven by many factors, including ease of use and availability of a rich set of information technology (IT) devices and services. Indeed, it would be difficult to imagine how significant societal transformations that better our world could occur without the productivity and innovation enabled by the IT. Unfortunately, the energy cost and carbon footprint of the IT devices and services has become exorbitant. Moreover, current technological and digital service utilization trends result in a doubling of the energy cost of the IT infrastructure and its carbon footprint in less than five years. In an energy-constrained world, this consumption trend is unsustainable and comes at increasingly unacceptable societal and environmental costs. This presentation will first explain what is meant by green computing and how greenness of information processing may be quantified. Next, energy-efficient computing paradigms which utilize chip multi-processing, multiple-voltage domains, dynamic voltage/frequency scaling, and power/clock gating techniques will be reviewed. Finally, techniques for improving performance per Watt of large-scale information processing and storage systems (e.g., a data center), including hierarchical dynamic power management, task placement and scheduling, energy balancing, resource virtualization, and application optimizations that dynamically configure hardware for higher efficiency will be discussed.

01 Jan 2009
TL;DR: Proof-of-concept testing and total cost of ownership (TCO) analysis were conducted and seamless live migration between servers based on Intel Xeon processor 5500 series and previous Intel processor generations was verified using VMware Enhanced VMotion* and Intel Virtualization Technology FlexMigration assist.
Abstract: Intel IT, together with Intel’s Digital Enterprise Group, End User Platform Integration, and Intel’s Software and Services Group, conducted proof-of-concept testing and total cost of ownership (TCO) analysis to assess the virtualization capabilities of Intel® Xeon® processor 5500 series. A server based on Intel® Xeon® processor X5570 delivered up to 2.6x the performance and up to 2.05x the performance per watt of a server based on Intel® Xeon® processor E5450, resulting in the ability to support approximately twice as many virtual machines for the same TCO. We also verified seamless live migration between servers based on Intel Xeon processor 5500 series and previous Intel® processor generations using VMware Enhanced VMotion* and Intel® Virtualization Technology FlexMigration assist.

Proceedings ArticleDOI
29 Jul 2009
TL;DR: This paper presents an application specific reconfigurable processor architecture which is fine tuned for high performance computing that has higher functional density and lower power consumption per inch due to its runtime partial reconfiguration ability.
Abstract: One design goal of future processors is to maximize the performance per watt. However, the performance of general purpose processors can be hardly improved by barely increasing clock frequency. This paper presents an application specific reconfigurable processor architecture which is fine tuned for high performance computing. It benefits from the application specific hardware customized to significantly improve its efficiency. In comparison with the existing work on configurable processor architectures, the proposed architecture has higher functional density and lower power consumption per inch due to its runtime partial reconfiguration ability. Moreover, it can adaptively change its architecture to further promote the average performance and feasibility for other applications.