scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2012"


Proceedings ArticleDOI
25 Oct 2012
TL;DR: This paper explores techniques that allow programmers to efficiently use FPGAs at a level of abstraction that is closer to traditional software-centric approaches by using the emerging parallel language, OpenCL.
Abstract: The FPGA can be a tremendously efficient computational fabric for many applications. In particular, the performance to power ratios of FPGA make them attractive solutions to solve the problem of data centers that are constrained largely by power and cooling costs. However, the complexity of the FPGA design flow requires the programmer to understand cycle-accurate details of how data is moved and transformed through the fabric. In this paper, we explore techniques that allow programmers to efficiently use FPGAs at a level of abstraction that is closer to traditional software-centric approaches by using the emerging parallel language, OpenCL. Although the field of high level synthesis has evolved greatly in the last few decades, several fundamental parts were missing from the complete software abstraction of the FPGA. These include standard and portable methods of describing HW/SW codesign, memory hierarchy, data movement and control of parallelism. We believe that OpenCL addresses all of these issues and allows for highly efficient description of FPGA designs with a higher level of abstraction. We demonstrate this premise by examining the performance of a document filtering algorithm, implemented in OpenCL and automatically compiled to a Stratix IV 530 FPGA. We show that our implementation achieves 5.5× and 5.25× better performance per watt ratios than GPU and CPU implementations, respectively.

63 citations


Journal ArticleDOI
TL;DR: The paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM's Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm.
Abstract: This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM's Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs.

58 citations


Journal ArticleDOI
TL;DR: State-of-the-art hardware/software high-performance energy-efficient embedded computing (HPEEC) techniques that help meeting typical requirements of embedded applications are discussed and modern multicore processors that leverage these HPEEC techniques to deliver high performance per watt are discussed.
Abstract: With Moore's law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multicore to exploit this high-transistor density for high performance. Embedded systems differ from traditional high-performance supercomputers in that power is a first-order constraint for embedded systems; whereas, performance is the major benchmark for supercomputers. The increase in on-chip transistor density exacerbates power/thermal issues in embedded systems, which necessitates novel hardware/software power/thermal management techniques to meet the ever-increasing high-performance embedded computing demands in an energy-efficient manner. This paper outlines typical requirements of embedded applications and discusses state-of-the-art hardware/software high-performance energy-efficient embedded computing (HPEEC) techniques that help meeting these requirements. We also discuss modern multicore processors that leverage these HPEEC techniques to deliver high performance per watt. Finally, we present design challenges and future research directions for HPEEC system development.

48 citations


Proceedings ArticleDOI
01 Apr 2012
TL;DR: The results show that the GPGPU has outstanding results in performance, power consumption and energy efficiency for many applications, but it requires significant programming effort and is not general enough to show the same level of efficiency for all the applications.
Abstract: Power dissipation and energy consumption are becoming increasingly important architectural design constraints in different types of computers, from embedded systems to large-scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the best use of exponentially increasing numbers of transistors within the power and energy budgets. Intel SCC is an appealing option for future many-core architectures. In this paper, we use various scalable applications to quantitatively compare and analyze the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. These platforms include the Intel Single-Chip Cloud Computer (SCC) many-core, the Intel Core i7 general-purpose multi-core, the Intel Atom low-power processor, and the Nvidia ION2 GPGPU. Our results show that the GPGPU has outstanding results in performance, power consumption and energy efficiency for many applications, but it requires significant programming effort and is not general enough to show the same level of efficiency for all the applications. The “light-weight” many-core presents an opportunity for better performance per watt over the “heavy-weight” multi-core, although the multi-core is still very effective for some sophisticated applications. In addition, the low-power processor is not necessarily energy-efficient, since the runtime delay effect can be greater than the power savings.

39 citations


Journal ArticleDOI
TL;DR: In this article, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.
Abstract: Performance and total cost of ownership (TCO) are key optimization metrics in large-scale data centers. According to these metrics, data centers designed with conventional server processors are inefficient. Recently introduced processors based on low-power cores can improve both throughput and energy efficiency compared to conventional server chips. However, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.

38 citations


Proceedings ArticleDOI
21 May 2012
TL;DR: Results show that this underutilization is present, and resource optimization can increase the energy efficiency of GPU-based computation, and different strategies and proposals to increase energy efficiency in future GPU designs are discussed.
Abstract: In the last few years, Graphics Processing Units (GPUs) have become a great tool for massively parallel computing. GPUs are specifically designed for throughput and face several design challenges, specially what is known as the Power and Memory Walls. In these devices, available resources should be used to enhance performance and throughput, as the performance per watt is really high. For massively parallel applications or kernels, using the available silicon resources for power management was unproductive, as the main objective of the unit was to execute the kernel as fast as possible. However, not all the applications that are being currently ported to GPUs can make use of all the available resources, either due to data dependencies, bandwidth requirements, legacy software on new hardware, etc, reducing the performance per watt. This new scenario requires new designs and optimizations to make these GPGPU's\footnote{General Purpose Graphics Processing Units} more energy efficient. But first comes first, we should begin by analyzing the applications we are running on these processors looking for bottlenecks and opportunities to optimize for energy efficiency. In this paper we analyze some kernels taken from the CUDA SDK\footnote{Software Development Kit} in order to discover resource underutilization. Results show that this underutilization is present, and resource optimization can increase the energy efficiency of GPU-based computation. We then discuss different strategies and proposals to increase energy efficiency in future GPU designs.

29 citations


Proceedings ArticleDOI
Ami Marowka1
10 Jul 2012
TL;DR: This work investigated how energy efficiency and scalability are affected by the power constraints imposed on contemporary hybrid CPU-GPU processors, and shows clearly that greater parallelism is the most important factor affecting power consumption.
Abstract: Energy will be a major limiting factor in future multi-core architectures, so optimizing performance per watt should be a key driver for next generation massive-core architectures. Recent studies show that heterogeneous chips integrating different core architectures, such as CPU and GPU, on a single die is the most promising solution. We investigated how energy efficiency and scalability are affected by the power constraints imposed on contemporary hybrid CPU-GPU processors. Analytical models were developed to extend Amdahl's Law by accounting for energy limitations before examining the three processing modes available to heterogeneous processors, i.e., symmetric, asymmetric, and simultaneous asymmetric. The analysis shows clearly that greater parallelism is the most important factor affecting power consumption.

23 citations


Journal ArticleDOI
TL;DR: An analytical estimation model for performance and power using different server processor microarchitecture parameters is implemented and verified to estimate power and performance with less than 10% error deviation.
Abstract: Given the rapid expansion in cloud computing in the past few years, there is a driving necessity of having cloud workloads running on a backend servers analyzed and characterized for performance and power consumption. In this research, we focus on Hadoop framework and Memcached, which are distributed model frameworks for processing large scale data intensive applications for different purposes. Hadoop is used for short jobs requiring low response time; it is a popular open source implementation of MapReduce for the analysis of large datasets, while Memcached is a high performance distributed memory object caching system that could speed up throughput of web applications by reducing the effect of bottlenecks on database load. In this paper, we characterize different workloads running on Hadoop framework and Memcached for different processor configurations and microarchitecture parameters. We implement an analytical estimation model for performance and power using different server processor microarchitecture parameters. The proposed analytical estimation model uses analytical method to scale different processor microarchitecture parameters such as CPI with respect to processor core frequency. We also propose an analytical model to estimate power consumption scaling for different processor core frequency. The combination of both performance and power consumption analytical models enables the estimation of performance per watt for different cloud benchmarks. The proposed estimation models are verified to estimate power and performance with less than 10% error deviation.

21 citations


Proceedings ArticleDOI
10 Nov 2012
TL;DR: The design and performance of GRAPE-8 accelerator processor for gravitational N-body simulations is described, designed to evaluate gravitational interaction with cutoff between particles.
Abstract: In this paper, we describe the design and performance of GRAPE-8 accelerator processor for gravitational N-body simulations. It is designed to evaluate gravitational interaction with cutoff between particles. The cutoff function is useful for schemes like TreePM or Particle-Particle Particle-Tree, in which gravitational force is divided to short-range and long-range components. A single GRAPE-8 processor chip integrates 48 pipeline processors. The effective number of floating-point operations per interaction is around 40. Thus the peak performance of a single GRAPE-8 processor chip is 480 Gflops. A GRAPE-8 processor card houses two GRAPE-8 chips and one FPGA chip for PCI-Express interface. The total power consumption of the board is 46W. Thus, theoretical peak performance per watt is 20.5 Gflops/W. The effective performance of the total system, including the host computer, is around 5Gflops/W. This is more than a factor of two higher than the highest number in the current Green500 list.

15 citations


Journal ArticleDOI
TL;DR: A new technique, ASTPI (Average Stall Time Per Instruction), is proposed, design, implement and evaluate a new online monitoring approach called ESHMP, which is based on the metric, and shows that among HMP systems in which heterogeneity-aware schedulers are adopted and there are more than one LLC, the architecture where heterogeneous cores share LLCs gain better performance than the ones where homogeneous coresshare LLCs.

14 citations


01 Jan 2012
TL;DR: It is argued that performance per watt, which is often cited in the graphics hardware industry, is not a particularly useful unit for power efficiency in scientific and engineering discussions, and joules per task and watts are more reasonable units.
Abstract: In this short note, we argue that performance per watt, which is often cited in the graphics hardware industry, is not a particularly useful unit for power efficiency in scientific and engineering discussions. We argue that joules per task and watts are more reasonable units. We show a concrete example where nanojoules per pixel is much more intuitive, easier to compute aggregate statistics from, and easier to reason about.

Proceedings ArticleDOI
18 Mar 2012
TL;DR: Three possible technology components that would improve the operational performance of data centers are discussed: integration of renewable energy sources, increasing energy efficiency through IT consolidation and workload optimization, and multifunctional sensor networks for better cooling and infrastructure management.
Abstract: The energy consumption of data centers (DCs) has dramatically increased in recent years, primarily due to the massive computing demands driven by communications, banking, online retail, and entertainment services In today's data centers, the cooling and infrastructure operations require almost the same energy as the IT operations The large energy consumption in data centers prompted government agencies, industries, professional organizations, and academic institutions to investigate sustainable growth paths We discuss such scenarios based on current trends and projections and propose the required innovations to achieve a 10 fold increase in “performance per Watt” of IT operations We discuss three possible technology components that would improve the operational performance of data centers: (1) integration of renewable energy sources (2) increasing energy efficiency through IT consolidation and workload optimization, and (3) multifunctional sensor networks for better cooling and infrastructure management We discuss the key requirements and how these technologies can be combined to achieve a sustainable path

Proceedings ArticleDOI
22 Oct 2012
TL;DR: This paper proposes a novel VM scheduling algorithm that exploits core performance heterogeneity to optimize the overall system energy efficiency, and introduces a metric termed energy-efficiency factor to characterize the power and performance behaviors of the applications hosted by VMs on different cores.
Abstract: Multi-core architectures with asymmetric core performance have recently shown great promise, because applications with different needs can benefit from either the high performance of a fast core or the high parallelism and power efficiency of a group of slow cores. This performance heterogeneity can be particularly beneficial to applications running in virtual machines (VMs) on virtualized servers, which often have different needs and exhibit different performance and power characteristics. Therefore, scheduling VMs on performance-asymmetric multi-core architectures can have a great impact on a system's overall energy efficiency. Unfortunately, existing VM managers, such as Xen, have not taken the heterogeneity into account and thus often result in low energy efficiencies.In this paper, we propose a novel VM scheduling algorithm that exploits core performance heterogeneity to optimize the overall system energy efficiency. We first introduce a metric termed energy-efficiency factor to characterize the power and performance behaviors of the applications hosted by VMs on different cores. We then present a method to dynamically estimate the VM's energy-efficiency factors and then map the VMs to heterogeneous cores, such that the energy efficiency of the entire system is maximized. We implement the proposed algorithm in Xen and evaluate it with standard benchmarks on a real testbed. The experimental results show that our solution improves the system energy efficiency (i.e., performance per watt) by 13.5% on average and up to 55% for some benchmarks, compared to the default Xen scheduler.

Proceedings ArticleDOI
01 Aug 2012
TL;DR: This work proposes to use dedicated hardware accelerators like squaring and cubing units to perform squares and cubes to reduce power consumption per computation by more than 50% and more than 40% using dedicated units, respectively.
Abstract: With power becoming a precious resource in current VLSI systems, performance per Watt has become a more important metric than chip area. With a large number of applications benefitting from support for complex functional units like squaring and cubing, it becomes imperative that such functions be implemented in hardware. Implementing these functions using existing general purpose multipliers in a design may result in area savings in some cases but results in power and latency penalties. We propose to use dedicated hardware accelerators like squaring and cubing units to perform squares and cubes, respectively. We study the trade-off for computing squares and cubes using a general purpose multiplier versus dedicated units from a software perspective. We compare area and power requirements for various widths. We are able to reduce power consumption per computation by more than 50% in squaring units and more than 40% in cubing units using dedicated units. Depending on the requirements of the applications, dedicated squaring and cubing units can also aid multipliers in improving the performance and latency of various applications.

Proceedings Article
07 Jun 2012
TL;DR: This paper proposes to verify experimentally the scalability of shared-memory, PThreads based, applications, on Cycle-Accurate-Bit-Acc accurate (CABA) simulated, 512-cores, and shows a scalability limitation beyond 64 cores for FFT and 256 cores for EPFilter.
Abstract: Nowadays, single-chip cache-coherent multi-cores up to 100 cores are a reality. Many-cores of hundreds of cores are planned in the near future. Due to the large number of cores and for power efficiency reasons (performance per watt), cores become simpler with small caches. To get efficient use of parallelism offered by these architectures, applications must be multi-threads. The POSIX Threads (PThreads) standard is the most portable way to use threads across operating systems. It is also used as a low-level layer to support other portable, shared-memory, parallel environments like OpenMP. In this paper, we propose to verify experimentally the scalability of shared-memory, PThreads based, applications, on Cycle-Accurate-Bit-Accurate (CABA) simulated, 512-cores. Using two unmodified highly multi-threads applications, SPLASH-2 FFT, and EPFilter (medical images noise-filtering application provided by Phillips) our study shows a scalability limitation beyond 64 cores for FFT and 256 cores for EPFilter. Based on hardware events counters, our analysis shows: (i) the detected scalability limitation is a conceptual problem related to the notion of thread and process; and (ii) the small per-core caches found in many-cores exacerbates the problem. Finally, we present our solution in principle and future work.

Book
20 Apr 2012
TL;DR: This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism.
Abstract: To satisfy the higher requirements of digitally converged embedded systems, this book describes heterogeneous multicore technology that uses various kinds of low-power embedded processor cores on a single chip. With this technology, heterogeneous parallelism can be implemented on an SoC, and greater flexibility and superior performance per watt can then be achieved. This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism. The authors developed three multicore chips (called RP-1, RP-2, and RP-X) according to the defined architecture with the introduced processor cores. The chip implementations, software environments, and applications running on the chips are also explained in the book. Provides readers an overview and practical discussion of heterogeneous multicore technologies from both a hardware and software point of view;Discusses a new, high-performance and energy efficient approach to designing SoCs for digitally converged, embedded systems;Covers hardware issues such as architecture and chip implementation, as well as software issues such as compilers, operating systems, and application programs;Describes three chips developed according to the defined heterogeneous multicore architecture, including chip implementations, software environments, and working applications.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: Hardware based power measurement technique with Multi-Agent based framework for analyzing power in HPC systems at real time and clearly demonstrated the power consumed while running the various workloads such as High Performance Linpack and NAS Parallel Benchmarks.
Abstract: Power measurement and analysis are important aspects for optimizing the power consumption in High Performance Computing (HPC) systems. With the huge increase in the power consumption of HPC systems, it is important to compare systems with metrics based on performance per watt. There are various hardware and software based power measurement techniques available for HPC systems. But, it's a complex task to accurately measure and analyze the power consumption of entire HPC nodes in real time. Hence, we have used hardware based power measurement technique with Multi-Agent based framework for analyzing power in HPC systems at real time. We clearly demonstrated the power consumed while running the various workloads such as High Performance Linpack (HPL) and NAS Parallel Benchmarks (NPB).

Proceedings Article
21 May 2012
TL;DR: MPI and OpenMP are considered as a core scientific HPC programming libraries for distributed-memory and shared-memory computer systems respectively, but there are dissimilarities in their performance per watt ratio.
Abstract: A power consumption of a high performance computer (HPC) system has been an issue lately. Many programming techniques are still relying on performance gain, but only few of them are concerning energy footprint of the increased computing power. MPI and OpenMP are considered as a core scientific HPC programming libraries for distributed-memory and shared-memory computer systems respectively. Each of them brings performance when used on a parallel system, but there are dissimilarities in their performance per watt ratio. The key is to find the best appliance for each of them on a shared-memory computer system.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper compares and analyzes the performance of an Intel-based SMP and Tilera's TILEPro64 TMA based on parallelized benchmarks for the following performance metrics: runtime, speedup, efficiency, cost, scalability, and performance per watt.
Abstract: With Moore's law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multi-core to exploit this high transistor density for high performance. However, there exists a plethora of multi-core architectures and the suitability of these multi-core architectures for different embedded domains (e.g., distributed, real-time, reliability-constrained) requires investigation. Despite the diversity of embedded domains, one of the critical applications in many embedded domains (especially distributed embedded domains) is information fusion. Furthermore, many other applications consist of various kernels, such as Gaussian elimination (used in network coding), that dominate the execution time. In this paper, we evaluate two embedded systems multi-core architectural paradigms: symmetric multiprocessors (SMPs) and tiled multi-core architectures (TMAs). We base our evaluation on a parallelized information fusion application and benchmarks that are used as building blocks in applications for SMPs and TMAs. We compare and analyze the performance of an Intel-based SMP and Tilera's TILEPro64 TMA based on our parallelized benchmarks for the following performance metrics: runtime, speedup, efficiency, cost, scalability, and performance per watt. Results reveal that TMAs are more suitable for applications requiring integer manipulation of data with little communication between the parallelized tasks (e.g., information fusion) whereas SMPs are more suitable for applications with floating point computations and a large amount of communication between processor cores.

Journal ArticleDOI
TL;DR: In this article, the authors summarize their recent experience with the simulation of a wide range of spin models on GPU employing an equally broad range of update algorithms, ranging from Metropolis and heat bath updates, over cluster algorithms to generalized ensemble simulations.
Abstract: The use of graphics processing units (GPUs) in scientific computing has gathered considerable momentum in the past five years. While GPUs in general promise high performance and excellent performance per Watt ratios, not every class of problems is equally well suitable for exploiting the massively parallel architecture they provide. Lattice spin models appear to be prototypic examples of problems suitable for this architecture, at least as long as local update algorithms are employed. In this review, I summarize our recent experience with the simulation of a wide range of spin models on GPU employing an equally wide range of update algorithms, ranging from Metropolis and heat bath updates, over cluster algorithms to generalized ensemble simulations.

Book ChapterDOI
04 Sep 2012
TL;DR: This paper is the first to describe a study to evaluate the per-watt performance of block ciphers on GPUs and shows that performance per watt of AES-128 on the APU including 80 cores were 743.0 Mbps/W and 44.0 % increases compared with those on a system equipped with a discrete AMD Radeon HD 6770.
Abstract: Computer systems with discrete GPUs are expected to become the standard methodology for high-speed encryption processing, but they require large amounts of power consumption and are inapplicable to embedded devices. Therefore, we have specifically examined a new heterogeneous multicore processor with CPU---GPU integration architecture. We first implemented three 128-bit block ciphers (AES, Camellia, and SC2000) from several symmetric block ciphers in an e-government recommended ciphers list by CRYPTREC in Japan using OpenCL on AMD E-350 APU with CPU---GPU integration architecture and two traditional systems with discrete GPUs. Then we evaluated their respective power efficiencies. Result showed that performance per watt of AES-128 on the APU including 80 cores were 743.0 Mbps/W and 44.0 % increases compared with those on a system equipped with a discrete AMD Radeon HD 6770 including 800 cores. This paper is the first to describe a study to evaluate the per-watt performance of block ciphers on GPUs.

Patent
04 Jan 2012
TL;DR: In this article, a heterogeneous multi-core digital signal processor for an OFDM wireless communication system is presented. But the design of a dedicated task scheduling unit or a master control processor is eliminated, and the expandability and simplicity of the multicell processor are ensured.
Abstract: The invention provides a heterogeneous multi-core digital signal processor for an orthogonal frequency division multiplexing (OFDM) wireless communication system, and relates to the field of microprocessor system structures. The processor consists of a set of processor cores which are distributed in a row, wherein the processor cores can be divided into different types according to computing capability; the different types of processor cores are mutually connected in an open loop interconnection mode; the processor cores are very long instruction word (VLIW) processors; data transmission among the processor cores is realized by a shared memory; and control signals are transmitted through bus control units in the processor cores and task scheduling buses outside the processor cores. Each processor core can receive task scheduling information from other processor cores, so that the design of a dedicated task scheduling unit or a master control processor is eliminated, and the expandability and simplicity of the multi-core processor are ensured; and the structure of the multi-core processor can effectively accord with the characteristic of wireless communication baseband processing and achieves high performance per watt.

Book ChapterDOI
Karl Huppler1
27 Aug 2012
TL;DR: The electrical cost of managing information systems has always been a concern for those investing in technology, but in recent years the focus has increased, both because of increased costs of electricity and decreased costs of other components of the equation.
Abstract: The electrical cost of managing information systems has always been a concern for those investing in technology However, in recent years the focus has increased, both because of increased costs of electricity and decreased costs of other components of the equation

Proceedings ArticleDOI
03 Mar 2012
TL;DR: This paper describes the parallelization techniques applied to a Synthetic Aperture Radar application based on the 2-Dimentional Fourier Matched Filtering and Interpolation (2DFMFI) Algorithm, and applies parallelized techniques for shared memory, distributed shared memory and distributed memory environments.
Abstract: Future space application will require High Performance Computing (HPC) capabilities to be available on board of future spacecrafts. To cope with this requirement, multi and many-core processor technologies have to be integrated in the computing platforms of the spacecraft. One of the most important requirements, coming from the nature of space applications, is the efficiency in terms of performance per Watt. In order to improve the efficiency of such systems, algorithms and applications have to be optimized and scaled to the number of cores available in the computing platform. In this paper we describe the parallelization techniques applied to a Synthetic Aperture Radar (SAR) application based on the 2-Dimentional Fourier Matched Filtering and Interpolation (2DFMFI) Algorithm. Other than sequential optimizations, we applied parallelization techniques for shared memory, distributed shared memory and distributed memory environments, using parallel programming models like OpenMP and MPI. It turns out that parallelizing this type of algorithms is not an easy and straightforward task to do, but with a little bit of effort, one can improve performance and scalability, increasing the level of efficiency.

Proceedings ArticleDOI
19 Jun 2012
TL;DR: The continuing development of the RA DSP EED -HB™ , DDR2 Memory and switching POL regulators over the next year or so will set the base for this capability to be realized in a fully qualified space board in the coming year.
Abstract: The RAD750® has provided the standard processing benchmarks for the past decade or so. With the RA DSP EED™ DSP and associated processor products, the raw performance and performance per watt will increase significantly, opening up many new applications to advanced programmed general and signal processing. In the near term, the RA DSP EED DSP User Board will provide the initial hardware platform for realizing this. The continuing development of the RA DSP EED -HB™ , DDR2 Memory and switching POL regulators over the next year or so will set the base for this capability to be realized in a fully qualified space board in the coming year..

Proceedings ArticleDOI
23 Aug 2012
TL;DR: This work implemented an asymmetric aware scheduler based on program behavior offline analysis, which makes up the shortcoming of online analysis and can achieve accurate thread-to-core initial assignment when a thread is created.
Abstract: Previous work has already testified from both theory and simulation that for enough diverse workload, heterogeneous chip multi-core processor(CMP) can deliver higher performance per watt over comparable homogeneous multicore processor. But, the prerequisite is that operating system can recognize this diversity and then take an effective and reasonable task scheduling. We implemented an asymmetric aware scheduler based on program behavior offline analysis, which makes up the shortcoming of online analysis and can achieve accurate thread-to-core initial assignment when a thread is created. Preliminary evaluation shows that our scheduler can gain performance improvement over default heterogeneous-agnostic scheduler and gain quality of service guaranteed.

Journal Article
TL;DR: Priority matrix is proposed to overcome the deficiency of using height to decide the execution order of tasks and also change the chromosome encoding and decoding structure and result shows that comparing to height,priority matrix leads to more optimized scheduling list.
Abstract: Compared with homogeneous multi-core processor,single-ISA heterogeneous multi-core processor can achieve better performance per watt since they can adapt to workload diversity.Energy efficiency of this architecture depends on reasonable and intelligent task scheduling.This is typical multi-objective optimization problem since both performance and power are required.This paper applies the Pareto-based multi-objective optimization genetic algorithm to the static task scheduling on heterogeneous multi-core systems.Priority matrix is proposed to overcome the deficiency of using height to decide the execution order of tasks and also change the chromosome encoding and decoding structure.By using this method,the number of valid chromosomes from initial population increases to 100%.Simulation result shows that comparing to height,priority matrix leads to more optimized scheduling list.

01 Jan 2012
TL;DR: It is shown that system energy can be reduced by 28% retaining a decrease in performance within 1% by controlling the voltage and frequency levels of GPUs, and that energy savings can be achieved when GPU core and memory clock frequencies are appropriately scaled considering the workload characteristics.
Abstract: Graphics processing units (GPUs) have been increasingly used in general-purpose applications due to their significant benefits in performance and performance-perwatt. Figure 1 depicts the trend on the performance per watt for widely deployed NVIDIA GPU and Intel CPU architectures. Albeit energy efficient, GPUs consume significant power during operation, and commodity system software for GPUs is not well designed to control their power consumption. This is largely due to the fact that GPUs are primarily designed to accelerate computations. It is desirable that the system software can manage the power consumption of GPUs in a reliable manner. Dynamic voltage and frequency scaling (DVFS) is widely used to reduce power consumption at runtime for CPUs [2, 4], however there is not much study whether DVFS works efficiently for GPUs or not. This paper presents our initial work on the power and performance analysis of GPU-accelerated systems. In particular, we leverage Gdev [5], a new open-source implementation of first-class GPU resource management, to demonstrate the effect of GPU power scaling on realworld hardware, whereas the evaluations provided by previous work have been limited to simulations [3]. The ultimate goal of our project is to establish the theory and practice of DVFS schemes for GPU-accelerated systems, addressing correlative power and performance optimization problems. Toward this end, we make the following contributions in this paper: (i) verify the availability of voltage and frequency scaling for NVIDIA’s Fermi GPU architecture using Gdev, (ii) analyze the implication of voltage and frequency scaling with the GPU and CPU, and (iii) identify the necessity and open issues of GPU and CPU coordinated DVFS algorithms.

01 Jan 2012
TL;DR: The widespread availability of hardware using multi-core technology will change the computing universe and help professional animation companies to produce more realistic movies faster for less time, or to create breakthrough ways to make a PC more natural and intuitive.
Abstract: Thing that never changes in field of everyday computing are requirement for faster speed and more performance and we didn’t get satisfied even with new technologies. Each and every new technology and performance advances in processor leads to next level of increased performance demands from consumers. The performance criteria is not just the speed but also smaller and more powerful devices, longer battery life, quieter desktop PCs, and in the business –better price/performance per watt and lower cooling costs. Enthusiasts want improvement in productivity, security, multitasking, data protection, games and, many other capabilities. General consumes, too, will now get hands on greater performance then ever before, which will significantly expand the utility of their home PCs and Digital computing Systems. Multi-core Processors may also have the benefit of offering more performance without increasing power requirements, which will translate to greater performance per watt. Merging two or more powerful computing cores on a single processor opens up a world of new possibilities. The next generation of software applications will be developed accordingly to multi-core processor because of the performance and efficiency they can deliver as compare to single core processor. Whether these applications help professional animation companies to produce more realistic movies faster for less money & less time, or to create breakthrough ways to make a PC more natural and intuitive ,the widespread availability of hardware using multi-core technology will change the computing universe.

Patent
01 Aug 2012
TL;DR: In this article, a performance per watt optimization method and device for a computer based on renewable energy sources is presented, which comprises the following steps: pre-defining the load grade of each processor core in a multi-core processor according to the load support situation, pre-establishing a load regulation and control buffer area of the multicore processor, and predefining three system initialization variables: a front pointer, a tail pointer and an energy consumption keyword; then, figuring out an optimal electric energy distribution scheme among all the processor cores with the help of the load regulation
Abstract: The invention discloses a performance per watt optimization method and device for a computer based on renewable energy sources. The method comprises the following steps: pre-defining the load grade of each processor core in a multi-core processor according to the load support situation, pre-establishing a load regulation and control buffer area of the multi-core processor, and pre-defining three system initialization variables: a front pointer, a tail pointer and an energy consumption keyword; knowing the electric energy supply situation of the renewable energy sources by checking the energy consumption keyword after initialization, then figuring out an optimal electric energy distribution scheme among all the processor cores with the help of the load regulation and control buffer area, feeding the scheme back to an external power supply management controller as output, simultaneously updating the load regulation and control buffer area according to the adjustment result fed back by the power supply management controller and further realizing the optimization of the performance per watt of the multi-core computer based on the renewable energy sources.