scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Managing performance vs. accuracy trade-offs with loop perforation

TL;DR: The results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factors of seven) while changing the result that the application produces by less than 10%.
Abstract: Many modern computations (such as video and audio encoders, Monte Carlo simulations, and machine learning algorithms) are designed to trade off accuracy in return for increased performance. To date, such computations typically use ad-hoc, domain-specific techniques developed specifically for the computation at hand. Loop perforation provides a general technique to trade accuracy for performance by transforming loops to execute a subset of their iterations. A criticality testing phase filters out critical loops (whose perforation produces unacceptable behavior) to identify tunable loops (whose perforation produces more efficient and still acceptably accurate computations). A perforation space exploration algorithm perforates combinations of tunable loops to find Pareto-optimal perforation policies. Our results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factor of seven) while changing the result that the application produces by less than 10%.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A survey of techniques for approximate computing (AC), which discusses strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units, processor components, memory technologies, and so forth, as well as programming frameworks for AC.
Abstract: Approximate computing trades off computation quality with effort expended, and as rising performance demands confront plateauing resource budgets, approximate computing has become not merely attractive, but even imperative. In this article, we present a survey of techniques for approximate computing (AC). We discuss strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units (e.g., CPU, GPU, and FPGA), processor components, memory technologies, and so forth, as well as programming frameworks for AC. We classify these techniques based on several key characteristics to emphasize their similarities and differences. The aim of this article is to provide insights to researchers into working of AC techniques and inspire more efforts in this area to make AC the mainstream computing approach in future systems.

890 citations

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A programming model is defined that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results and is faster and more energy efficient than executing the original code.
Abstract: This paper describes a learning-based approach to the acceleration of approximate programs. We describe the \emph{Parrot transformation}, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a \emph{neural processing unit} (NPU). The NPU is tightly coupled to the processor pipeline to accelerate small code regions. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3x and energy savings of 3.0x on average with quality loss of at most 9.6%.

532 citations


Cites background or result from "Managing performance vs. accuracy t..."

  • ...As prior work on approximate programming [2, 8, 29, 41, 43] has shown, it is not difficult to deem regions approximable....

    [...]

  • ...The domains are commensurate with evaluations of previous work on approximate computing [1, 11, 28, 29, 41, 43]....

    [...]

Proceedings ArticleDOI
29 May 2013
TL;DR: This work analysis and characterization of inherent application resilience present in a suite of 12 widely used applications from the domains of recognition, data mining, and search and proposes a systematic framework for Application Resilience Characterization (ARC), which characterizes the resilient parts using approximation models that abstract a wide range of approximate computing techniques.
Abstract: Approximate computing is an emerging design paradigm that enables highly efficient hardware and software implementations by exploiting the inherent resilience of applications to in-exactness in their computations. Previous work in this area has demonstrated the potential for significant energy and performance improvements, but largely consists of ad hoc techniques that have been applied to a small number of applications. Taking approximate computing closer to mainstream adoption requires (i) a deeper understanding of inherent application resilience across a broader range of applications (ii) tools that can quantitatively establish the inherent resilience of an application, and (iii) methods to quickly assess the potential of various approximate computing techniques for a given application. We make two key contributions in this direction. Our primary contribution is the analysis and characterization of inherent application resilience present in a suite of 12 widely used applications from the domains of recognition, data mining, and search. Based on this analysis, we present several new insights into the nature of resilience and its relationship to various key application characteristics. To facilitate our analysis, we propose a systematic framework for Application Resilience Characterization (ARC) that (a) partitions an application into resilient and sensitive parts and (b) characterizes the resilient parts using approximation models that abstract a wide range of approximate computing techniques. We believe that the key insights that we present can help shape further research in the area of approximate computing, while automatic resilience characterization frameworks such as ARC can greatly aid designers in the adoption approximate computing.

464 citations

Journal ArticleDOI
TL;DR: NPUs leverage the approximate algorithmic transformation that converts regions of code from a Von Neumann model to a neural model and shows that significant performance and efficiency gains are possible when the abstraction of full accuracy is relaxed in general-purpose computing.
Abstract: As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are needed to continue improvements in the performance and energy efficiency of general-purpose processors. One such departure is approximate computing, where error in computation is acceptable and the traditional robust digital abstraction of near-perfect accuracy is relaxed. Conventional techniques in energy-efficient computing navigate a design space defined by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate computing explores a third dimension---error---and trades the accuracy of computation for gains in both energy and performance. Techniques to harvest large savings from small errors have proven elusive. This paper describes a new approach that uses machine learning-based transformations to accelerate approximation-tolerant programs. The core idea is to train a learning model how an approximable region of code---code that can produce imprecise but acceptable results---behaves and replace the original code region with an efficient computation of the learned model. We use neural networks to learn code behavior and approximate it. We describe the Parrot algorithmic transformation, which leverages a simple programmer annotation ("approximable") to transform a code region from a von Neumann model to a neural model. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a neural processing unit (NPU). The NPU is tightly coupled to the processor pipeline to permit profitable acceleration even when small regions of code are transformed. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3× and energy savings of 3.0× on average with average quality loss of at most 9.6%. NPUs form a new class of accelerators and show that significant gains in both performance and efficiency are achievable when the traditional abstraction of near-perfect accuracy is relaxed in general-purpose computing.

290 citations

Proceedings ArticleDOI
07 Dec 2013
TL;DR: Quality programmable processors, in which the notion of quality is explicitly codified in the HW/SW interface, are suggested to be a significant step towards bringing approximate computing to the mainstream.
Abstract: Approximate computing leverages the intrinsic resilience of applications to inexactness in their computations, to achieve a desirable trade-off between efficiency (performance or energy) and acceptable quality of results. To broaden the applicability of approximate computing, we propose quality programmable processors, in which the notion of quality is explicitly codified in the HW/SW interface, i.e., the instruction set. The ISA of a quality programmable processor contains instructions associated with quality fields to specify the accuracy level that must be met during their execution. We show that this ability to control the accuracy of instruction execution greatly enhances the scope of approximate computing, allowing it to be applied to larger parts of programs. The micro-architecture of a quality programmable processor contains hardware mechanisms that translate the instruction-level quality specifications into energy savings. Additionally, it may expose the actual error incurred during the execution of each instruction (which may be less than the specified limit) back to software. As a first embodiment of quality programmable processors, we present the design of Quora, an energy efficient, quality programmable vector processor. Quora utilizes a 3-tiered hierarchy of processing elements that provide distinctly different energy vs. quality trade-offs, and uses hardware mechanisms based on precision scaling with error monitoring and compensation to facilitate quality programmable execution. We evaluate an implementation of Quora with 289 processing elements in 45nm technology. The results demonstrate that leveraging quality-programmability leads to 1.05×–1.7× savings in energy for virtually no loss (< 0.5%) in application output quality, and 1.18×–2.1× energy savings for modest impact (<2.5%) on output quality. Our work suggests that quality programmable processors are a significant step towards bringing approximate computing to the mainstream.

250 citations

References
More filters
01 Jan 2007

17,341 citations

Proceedings ArticleDOI
20 Mar 2004
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Abstract: We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.

4,841 citations

Journal ArticleDOI
TL;DR: The Baseline method has been by far the most widely implemented JPEG method to date, and is sufficient in its own right for a large number of applications.
Abstract: For the past few years, a joint ISO/CCITT committee known as JPEG (Joint Photographic Experts Group) has been working to establish the first international compression standard for continuous-tone still images, both grayscale and color. JPEG’s proposed standard aims to be generic, to support a wide variety of applications for continuous-tone images. To meet the differing needs of many applications, the JPEG standard includes two basic compression methods, each with various modes of operation. A DCT-based method is specified for “lossy’’ compression, and a predictive method for “lossless’’ compression. JPEG features a simple lossy technique known as the Baseline method, a subset of the other DCT-based modes of operation. The Baseline method has been by far the most widely implemented JPEG method to date, and is sufficient in its own right for a large number of applications. This article provides an overview of the JPEG standard, and focuses in detail on the Baseline method.

3,944 citations

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations