scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
01 Dec 2018
TL;DR: This paper optimize a widely used kernel, radial basis function, in a support vector machine as a case study to evaluate the potential of using FPGAs and the capabilities of high-level synthesis (HLS) for data intensive applications.
Abstract: In this paper, we optimize a widely used kernel, radial basis function, in a support vector machine as a case study to evaluate the potential of using FPGAs and the capabilities of high-level synthesis (HLS) for data intensive applications. We explain the HLS flow, and use it to develop and evaluate the kernels optimized with vectorization, loop unrolling, and half-precision storage format. Our optimizations improve the kernel performance by a factor of 15.8 compared to a baseline kernel on the Nallatech 385A FPGA card that features an Intel Arria 10 GX 1150 FPGA. The half storage format can reduce the DSP and memory utilizations at the cost of increasing the logic utilization. Compared to the single-precision floating-point kernels, the half-precision kernels can reduce the dynamic power consumption on the FPGA by approximately 30%. In terms of energy efficiency, the performance per watt on the FPGA platform is approximately 3X higher than that on an Intel Xeon 16-core CPU, and 1.8X higher than that on an Nvidia Tesla K80 GPU. On the other hand, the raw performance on the FPGA is approximately 2X and 2.7X lower than that on the CPU and GPU, respectively.

1 citations

Book ChapterDOI
08 Jan 2016
TL;DR: This chapter presents a novel, queueing theory-based modeling technique for evaluating multicore embedded architectures that do not require architectural-level benchmark simulation, and proposes a method to quantify computing requirements of real benchmarks probabilistically.
Abstract: This chapter presents a novel, queueing theory-based modeling technique for evaluating multicore embedded architectures that do not require architectural-level benchmark simulation. This modeling technique enables quick and inexpensive architectural evaluation, with respect to design time and resources, as compared to developing and/or using existing multicore simulators and running benchmarks on these simulators. Based on a preliminary evaluation using the models, architectural designers can run targeted benchmarks to verify the performance characteristics of selected multicore architectures. The chapter proposes a method to quantify computing requirements of real benchmarks probabilistically. The modeling technique provides performance evaluation for workloads with any computing requirements as opposed to simulation-driven architectural evaluation that can provide performance results for specific benchmarks. The queueing theoretic modeling approach can be used for performance per watt and performance per unit area characterizations of multicore embedded architectures, with varying number of processor cores and cache configurations, to provide a comparative analysis.

1 citations

Proceedings ArticleDOI
02 Sep 2011
TL;DR: A new metric, ASTPI (Average Stall Time Per Instruction), is proposed, designed, implemented and evaluated, which is based on the metric, and shows that ESHMP delivers scalability while adapting to a wide variety of applications.
Abstract: Recent research advocates performance heterogeneous multicore processors, where cores in the same processor have same instruction set architecture (ISA) but often different performance characteristics. These architectures are able to deliver higher performance per watt and area for programs with diverse architectural requirements than comparable homogeneous ones. However, such power and area efficiencies of performance heterogeneous multicore systems can only be accomplished when thread-to-core assignment is made according to the characteristics of both the workload and the core. In this paper, we propose a new metric, ASTPI (Average Stall Time Per Instruction), to measure the properties of threads. We design, implement and evaluate a new online monitoring approach called ESHMP, which is based on the metric. Our evaluation in the Linux 2.6.21 operating system shows that ESHMP delivers scalability while adapting to a wide variety of applications.
Posted Content
TL;DR: In this paper, the authors propose an information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives.
Abstract: Mobile system-on-chips (SoCs) are growing in their complexity and heterogeneity (e.g., Arm's Big-Little architecture) to meet the needs of emerging applications, including games and artificial intelligence. This makes it very challenging to optimally manage the resources (e.g., controlling the number and frequency of different types of cores) at runtime to meet the desired trade-offs among multiple objectives such as performance and energy. This paper proposes a novel information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives. PaRMIS specifies parametric policies to manage resources and learns statistical models from candidate policy evaluation data in the form of target design objective values. The key idea is to select a candidate policy for evaluation in each iteration guided by statistical models that maximize the information gain about the true Pareto front. Experiments on a commercial heterogeneous SoC show that PaRMIS achieves better Pareto fronts and is easily usable to optimize complex objectives (e.g., performance per Watt) when compared to prior methods.
Book ChapterDOI
15 Sep 2020
TL;DR: NuPow as discussed by the authors is a hierarchical scheduling and power management framework for architectures with multiple cores per voltage and frequency domain and non-uniform memory access (NUMA) properties.
Abstract: Power management and task placement pose two of the greatest challenges for future many-core processors in data centers. With hundreds of cores on a single die, cores experience varying memory latencies and cannot individually regulate voltage and frequency, therefore calling for new approaches to scheduling and power management. This work presents NuPow, a hierarchical scheduling and power management framework for architectures with multiple cores per voltage and frequency domain and non-uniform memory access (NUMA) properties. NuPow considers the conflicting goals of grouping virtual machines (VMs) with similar load patterns while also placing them as close as possible to the accessed data. Implemented and evaluated on existing hardware, NuPow achieves significantly better performance per watt compared to competing approaches.
Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631