Proceedings ArticleDOI

# Data Subsetting: A Data-Centric Approach to Approximate Computing

25 Mar 2019-pp 576-581

TL;DR: A data-centric approach to AxC is proposed, which can boost the performance of memory-subsystem-limited applications and proposes a data-access approximation technique called data subsetting, in which all accesses to a data structure are redirected to a subset of its elements so that the overall footprint of memory accesses is decreased.

AbstractApproximate Computing (AxC), which leverages the intrinsic resilience of applications to approximations in their underlying computations, has emerged as a promising approach to improving computing system efficiency. Most prior efforts in AxC take a compute-centric approach and approximate arithmetic or other compute operations through design techniques at different levels of abstraction. However, emerging workloads such as machine learning, search and data analytics process large amounts of data and are significantly limited by the memory sub-systems of modern computing platforms.In this work, we shift the focus of approximations from computations to data, and propose a data-centric approach to AxC, which can boost the performance of memory-subsystem-limited applications. The key idea is to modulate the application’s data-accesses in a manner that reduces off-chip memory traffic. Specifically, we propose a data-access approximation technique called data subsetting, in which all accesses to a data structure are redirected to a subset of its elements so that the overall footprint of memory accesses is decreased. We realize data subsetting in a manner that is transparent to hardware and requires only minimal changes to application software. Recognizing that most applications of interest represent and process data as multi-dimensional arrays or tensors, we develop a templated data structure called SubsettableTensor that embodies mechanisms to define the accessible subset and to suitably redirect accesses to elements outside the subset. As a further optimization, we observe that data subsetting may cause some computations to become redundant and propose a mechanism for application software to identify and eliminate such computations. We implement SubsettableTensor as a C++ class and evaluate it using parallel software implementations of 7 machine learning applications on a 48-core AMD Opteron server. Our experiments indicate that data subsetting enables 1.33×–4.44× performance improvement with <0.5% loss in application-level quality, underscoring its promise as a new approach to approximate computing.

##### Citations
More filters
Journal ArticleDOI
, Xiao Sun1, Wei Wang1
10 Nov 2020
TL;DR: RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core that is built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware, is presented.
Abstract: Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond traditional computing paradigms to improve their operational efficiency. Leveraging the application-level insight of error resilience, we demonstrate how approximate computing (AxC) can significantly boost the efficiency of AI platforms and play a pivotal role in the broader adoption of AI-based applications and services. To this end, we present RaPiD, a multi-tera operations per second (TOPS) AI hardware accelerator core (fabricated at 14-nm technology) that we built from the ground-up using AxC techniques across the stack including algorithms, architecture, programmability, and hardware. We highlight the workload-guided systematic explorations of AxC techniques for AI, including custom number representations, quantization/pruning methodologies, mixed-precision architecture design, instruction sets, and compiler technologies with quality programmability, employed in the RaPiD accelerator.

8 citations

• ...[123] Y....

[...]

Journal ArticleDOI
TL;DR: This work proposes an approximation-aware design approach for optimizing the energy, memory and performance of an IBC system, making it suitable for embedded implementation and designs an optimal approximation- aware controller that models the approximation error as sensor noise and shows QoC improvements.
Abstract: Image-based control (IBC) systems are common in many modern applications. In such systems, image-based sensing imposes massive compute workload, making them challenging to implement on embedded platforms. Approximate image processing is a way to handle this challenge. In essence, approximation reduces the workload at the cost of additional sensor noise. In this work, we propose an approximation-aware design approach for optimizing the energy, memory and performance of an IBC system, making it suitable for embedded implementation. First, we perform compute- and data-centric approximations and evaluate its impact on the energy efficiency, memory utilization and closed-loop quality-of-control (QoC) of the IBC system. We observe that the workload reductions due to approximations allow mapping these lighter approximated IBC tasks to embedded platforms with lower power consumption while still ensuring proper system functionality. Therefore, we explore the interplay between approximations and platform mappings to improve the energy-efficiency of IBC systems. Further, an IBC system operates under several environmental scenarios e.g., weather conditions. We evaluate the sensitivity of the IBC system to our approximation-aware design approach when operated under different scenarios and perform a failure probability (FP) analysis using Monte-Carlo simulations to analyze the robustness of the approximate system. Finally, we design an optimal approximation-aware controller that models the approximation error as sensor noise and show QoC improvements. We demonstrate the effectiveness of our approach using a concrete case-study of a lane keeping assist system (LKAS) using a heterogeneous NVIDIA AGX Xavier embedded platform in a hardware-in-the-loop (HiL) framework. We show energy and memory reduction of up to 92% and 88% respectively, for 44% QoC improvements with respect to the accurate implementation. We show that our approximation-aware design approach has an FP (per km) $\leq 9.6\times 10^{-6}$ %.

2 citations

Proceedings ArticleDOI
01 Feb 2021
TL;DR: VSX as mentioned in this paper leverages the application property of value similarity, i.e., input operands to computations that occur close-in-time take similar values, and the fetch-decode-execute of entire instruction sequences are skipped to benefit performance.
Abstract: Approximate Computing (AxC) is a popular design paradigm wherein selected computations are executed approximately to gain efficiency with minimal impact on application-level quality. Most efforts in AxC target specialized accelerators and domain-specific processors, with relatively limited focus on General-Purpose Processors (GPPs). However, GPPs are still broadly used to execute applications that are amenable to AxC, making AxC for GPPs a critical challenge. A key bottleneck in applying AxC to GPPs is that their execution units account for only a small fraction of total energy, requiring a holistic approach targeting compute, memory and control front-ends. This paper proposes such an approach that leverages the application property of value similarity, i.e., input operands to computations that occur close-in-time take similar values. Such similar computations are dynamically pre-detected and the fetch-decode-execute of entire instruction sequences are skipped to benefit performance. To this end, we propose a set of lightweight micro-architectural and ISA extensions called VSX that enable: (i) similarity detection amongst values in a cache-line, (ii) skipping of pre-defined instructions and/or loop iterations when similarity is detected, and (iii) substituting outputs of skipped instructions with saved results from previously executed computations. We also develop compiler techniques, guided by user annotations, to benefit from VSX in the context of common Machine Learning (ML) kernels. Our RTL implementation of VSX for a low-power RISC-V processor incurred 2.13% area overhead and yielded 1.19×-3.84× speedup with <0.5 % accuracy loss on 6 ML benchmarks.

##### References
More filters
Journal ArticleDOI

04 Jun 2011
TL;DR: EnerJ is developed, an extension to Java that adds approximate data types and a hardware architecture that offers explicit approximate storage and computation and allows a programmer to control explicitly how information flows from approximate data to precise data.
Abstract: Energy is increasingly a first-order concern in computer systems. Exploiting energy-accuracy trade-offs is an attractive choice in applications that can tolerate inaccuracies. Recent work has explored exposing this trade-off in programming models. A key challenge, though, is how to isolate parts of the program that must be precise from those that can be approximated so that a program functions correctly even as quality of service degrades.We propose using type qualifiers to declare data that may be subject to approximate computation. Using these types, the system automatically maps approximate variables to low-power storage, uses low-power operations, and even applies more energy-efficient algorithms provided by the programmer. In addition, the system can statically guarantee isolation of the precise program component from the approximate component. This allows a programmer to control explicitly how information flows from approximate data to precise data. Importantly, employing static analysis eliminates the need for dynamic checks, further improving energy savings. As a proof of concept, we develop EnerJ, an extension to Java that adds approximate data types. We also propose a hardware architecture that offers explicit approximate storage and computation. We port several applications to EnerJ and show that our extensions are expressive and effective; a small number of annotations lead to significant potential energy savings (10%-50%) at very little accuracy cost.

645 citations

### "Data Subsetting: A Data-Centric App..." refers background in this paper

• ...Next, at the architecture level, research efforts have explored approximate architectures for both general-purpose and domain-specific processors [7], [8], with suitable programming support [6]....

[...]

Proceedings ArticleDOI
09 Sep 2011
TL;DR: The results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factors of seven) while changing the result that the application produces by less than 10%.
Abstract: Many modern computations (such as video and audio encoders, Monte Carlo simulations, and machine learning algorithms) are designed to trade off accuracy in return for increased performance. To date, such computations typically use ad-hoc, domain-specific techniques developed specifically for the computation at hand. Loop perforation provides a general technique to trade accuracy for performance by transforming loops to execute a subset of their iterations. A criticality testing phase filters out critical loops (whose perforation produces unacceptable behavior) to identify tunable loops (whose perforation produces more efficient and still acceptably accurate computations). A perforation space exploration algorithm perforates combinations of tunable loops to find Pareto-optimal perforation policies. Our results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factor of seven) while changing the result that the application produces by less than 10%.

438 citations

### "Data Subsetting: A Data-Centric App..." refers methods in this paper

• ...We compare the performance of data subsetting with a wellknown compute-centric approximation technique called loop perforation [10], wherein iterations of loops are periodically or randomly skipped from execution....

[...]

Proceedings ArticleDOI
05 Mar 2011
TL;DR: Flikker exposes and leverages an interesting trade-off between energy consumption and hardware correctness, and shows that many applications are naturally tolerant to errors in the non-critical data, and in the vast majority of cases, the errors have little or no impact on the application's final outcome.
Abstract: Energy has become a first-class design constraint in computer systems. Memory is a significant contributor to total system power. This paper introduces Flikker, an application-level technique to reduce refresh power in DRAM memories. Flikker enables developers to specify critical and non-critical data in programs and the runtime system allocates this data in separate parts of memory. The portion of memory containing critical data is refreshed at the regular refresh-rate, while the portion containing non-critical data is refreshed at substantially lower rates. This partitioning saves energy at the cost of a modest increase in data corruption in the non-critical data. Flikker thus exposes and leverages an interesting trade-off between energy consumption and hardware correctness. We show that many applications are naturally tolerant to errors in the non-critical data, and in the vast majority of cases, the errors have little or no impact on the application's final outcome. We also find that Flikker can save between 20-25% of the power consumed by the memory sub-system in a mobile device, with negligible impact on application performance. Flikker is implemented almost entirely in software, and requires only modest changes to the hardware.

426 citations

### "Data Subsetting: A Data-Centric App..." refers background in this paper

• ...However, recent work suggests that applying it to other system components has a potential to result in additional improvements [15]–[20]....

[...]

• ...These include reducing DRAM refresh rate [15], [21], storing/accessing data in a compressed format [17], and speculating on the results of loads [18], [19], among others....

[...]

Proceedings ArticleDOI
03 Mar 2012
TL;DR: An ISA extension that provides approximate operations and storage is described that gives the hardware freedom to save energy at the cost of accuracy and Truffle, a microarchitecture design that efficiently supports the ISA extensions is proposed.
Abstract: Disciplined approximate programming lets programmers declare which parts of a program can be computed approximately and consequently at a lower energy cost. The compiler proves statically that all approximate computation is properly isolated from precise computation. The hardware is then free to selectively apply approximate storage and approximate computation with no need to perform dynamic correctness checks.In this paper, we propose an efficient mapping of disciplined approximate programming onto hardware. We describe an ISA extension that provides approximate operations and storage, which give the hardware freedom to save energy at the cost of accuracy. We then propose Truffle, a microarchitecture design that efficiently supports the ISA extensions. The basis of our design is dual-voltage operation, with a high voltage for precise operations and a low voltage for approximate operations. The key aspect of the microarchitecture is its dependence on the instruction stream to determine when to use the low voltage. We evaluate the power savings potential of in-order and out-of-order Truffle configurations and explore the resulting quality of service degradation. We evaluate several applications and demonstrate energy savings up to 43%.

399 citations

### "Data Subsetting: A Data-Centric App..." refers background in this paper

• ...Next, at the architecture level, research efforts have explored approximate architectures for both general-purpose and domain-specific processors [7], [8], with suitable programming support [6]....

[...]

Proceedings ArticleDOI
02 Jan 2011
TL;DR: A novel multiplier architecture with tunable error characteristics, that leverages a modified inaccurate 2x2 building block, that can achieve 2X - 8X better Signal-Noise-Ratio (SNR) for the same power savings when compared to recent voltage over-scaling based power-error tradeoff methods is proposed.
Abstract: We propose a novel multiplier architecture with tunable error characteristics, that leverages a modified inaccurate 2x2 building block. Our inaccurate multipliers achieve an average power saving of 31.78% ? 45.4% over corresponding accurate multiplier designs, for an average error of 1.39%?3.32%. Using image filtering and JPEG compression as sample applications we show that our architecture can achieve 2X - 8X better Signal-Noise-Ratio (SNR) for the same power savings when compared to recent voltage over-scaling based power-error tradeoff methods. We project the multiplier power savings to bigger designs highlighting the fact that the benefits are strongly design dependent. We compare this circuit-centric approach to power quality tradeoffs with a pure software adaptation approach for a JPEG example. We also enhance the design to allow for correct operation of the multiplier using a residual adder, for non error resilient applications.

364 citations

### "Data Subsetting: A Data-Centric App..." refers methods in this paper

• ...This ranges from manual approximate designs of adders and multipliers [3], [4] to automatic methodologies capable of approximating arbitrary logic [5]....

[...]