SAM: A Segmentation Based Approximate Multiplier for Error Tolerant Applications
22 May 2021-pp 1-5
TL;DR: A novel technique to multiply two unsigned binary numbers through a Segmentation based Approximate Multiplier (SAM) that reduces the size of the Partial Products Matrix to a Reduced Partial Product Matrix (R-PPM) and eliminates the extra hardware required for compression and rearrangement of partial products.
Abstract: In recent times, approximate computing has found significant use in applications that can tolerate partially inaccurate results. This tolerance can be exploited to design simpler hardware aimed at getting area and energy benefits. In this work, we propose a novel technique to multiply two unsigned binary numbers through a Segmentation based Approximate Multiplier (SAM). The proposed design reduces the size of the Partial Products Matrix (PPM) in the order of n × (2n — 1) to a Reduced Partial Product Matrix (R-PPM) of the order 4 × 2n. Additionally, it also eliminates the extra hardware required for compression and rearrangement of partial products. μ-SAM, an optimized version of our basic design is also proposed along with this work. μ-SAM further minimizes the on-chip area and power consumption of the basic design. The basic design consumes 32.43% lesser on-chip area when compared to the conventional Wallace tree multiplier [1] and produces results that are 89.1% more accurate when compared to other existing state-of-the-art designs such as TOSAM [2], LETAM [3], and DQ4:2C4 [4].
Citations
More filters
28 May 2022
TL;DR: In this paper , the authors proposed a novel architecture for an (unsigned × unsigned) approximate rounding and truncation based MAC unit named ART-MAC, which replaces the accurate multiplier architecture with an approximate multiplier proposed along with this work, thus improving the overall Quality of Results (QoR).
Abstract: In recent times, approximate computing has emerged as a promising technique to achieve significant power and energy benefits in computational systems. It is widely employed in fault-tolerant computationally intensive applications that require large arithmetic blocks. Applications such as image processing and machine learning often invoke the Multiply-Accumulate (MAC) unit for convolution operations. This paper proposes a novel architecture for an (unsigned × unsigned) approximate rounding and truncation based MAC unit named ART-MAC. It replaces the accurate multiplier architecture with an approximate multiplier proposed along with this work, thus improving the overall Quality of Results (QoR). The proposed design consumes 35.35% less power and showcases a significant speedup of 1.23 times when compared to the conventional MAC unit. On an average, the ART-MAC consumes 7.44% lesser on-chip area and showcases 13.49% lesser power-delay-product (PDP) compared to existing state-of-the-art designs.
1 citations
06 Apr 2022
TL;DR: In this paper , the authors proposed an efficient reconfigurable carry speculative approximate adder with rectification (EFCSA adder), which can be used in most error-resilient applications.
Abstract: Approximate computing offers the flexibility to trade-off accuracy for computational speed, reduced power consumption, and lesser on-chip area. Such techniques have accumulated extensive attention in recent times as these can be used in most error-resilient applications. Although several approximate adder designs have been proposed in the past, there still exists scope for further improvement. Existing state-of-the-art designs often involve a trade-off between the margin of acceptable error and its Quality of Results (QoR). This paper proposes an approximate adder with higher accuracy and better QoR for error-resilient applications called an efficient reconfigurable carry speculative approximate adder with rectification, or simply EFCSA adder. Its reconfigurable sister version, called REFCSA adder, is inherently reconfigurable, allowing accurate configuration during runtime. The proposed design aims to limit the carry chain’s length in the conventional ripple carry adder (RCA) using a block-based mechanism. EFCSA showcases results that are 12.3x faster than the conventional RCA. On average, the adder is 45.1% more accurate and has 31.97% better power-delay-product (PDP) than several existing state-of-the-art approximate designs.
1 citations
28 May 2022
TL;DR: In this paper , the authors proposed a novel architecture for an (unsigned × unsigned) approximate rounding and truncation based MAC unit named ART-MAC, which replaces the accurate multiplier architecture with an approximate multiplier proposed along with this work, thus improving the overall Quality of Results (QoR).
Abstract: In recent times, approximate computing has emerged as a promising technique to achieve significant power and energy benefits in computational systems. It is widely employed in fault-tolerant computationally intensive applications that require large arithmetic blocks. Applications such as image processing and machine learning often invoke the Multiply-Accumulate (MAC) unit for convolution operations. This paper proposes a novel architecture for an (unsigned × unsigned) approximate rounding and truncation based MAC unit named ART-MAC. It replaces the accurate multiplier architecture with an approximate multiplier proposed along with this work, thus improving the overall Quality of Results (QoR). The proposed design consumes 35.35% less power and showcases a significant speedup of 1.23 times when compared to the conventional MAC unit. On an average, the ART-MAC consumes 7.44% lesser on-chip area and showcases 13.49% lesser power-delay-product (PDP) compared to existing state-of-the-art designs.
1 citations
18 May 2022
TL;DR: In this paper , a new shift-add segmented hybrid approximated (SASHA) multiplier for image processing applications is proposed, which uses segmentation to achieve high performance in power reduction and accuracy of results.
Abstract: In this paper, a new shift-add segmented hybrid approximated (SASHA) multiplier for image processing applications is proposed. The new multiplier uses segmentation to achieve high performance in power reduction and accuracy of results. It segments the operands and most significant bits. Three hardware implementations of the 2-bit, 4-bit and 6-bit SASHA approximate multiplier are presented in this paper. The power consumption and accuracy of the proposed multipliers are evaluated by comparing performance with non-approximated multipliers using different design parameters. Experimental results show that the accuracy in terms of mean relative error percentage of 2-bit, 4-bit and 6-bit SASHA have minimum impact on the performance of image processing applications such as edge detection of an image. Additionally, a reduction of signal and logic power consumption is observed.
References
More filters
TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Abstract: Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a structural similarity index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MATLAB implementation of the proposed algorithm is available online at http://www.cns.nyu.edu//spl sim/lcv/ssim/.
40,609 citations
TL;DR: A design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step, using straightforward diode-transistor logic.
Abstract: It is suggested that the economics of present large-scale scientific computers could benefit from a greater investment in hardware to mechanize multiplication and division than is now common. As a move in this direction, a design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step. Using straightforward diode-transistor logic, it appears presently possible to obtain products in under 1, ?sec, and quotients in 3 ?sec. A rapid square-root process is also outlined. Approximate component counts are given for the proposed design, and it is found that the cost of the unit would be about 10 per cent of the cost of a modern large-scale computer.
1,750 citations
02 Nov 2015
TL;DR: This paper designs a novel approximate multiplier to have an unbiased error distribution, which leads to lower computational errors in real applications because errors cancel each other out, rather than accumulate, as the multiplier is used repeatedly for a computation.
Abstract: Many applications for signal processing, computer vision and machine learning show an inherent tolerance to some computational error. This error resilience can be exploited to trade off accuracy for savings in power consumption and design area. Since multiplication is an essential arithmetic operation for these applications, in this paper we focus specifically on this operation and propose a novel approximate multiplier with a dynamic range selection scheme. We design the multiplier to have an unbiased error distribution, which leads to lower computational errors in real applications because errors cancel each other out, rather than accumulate, as the multiplier is used repeatedly for a computation. Our approximate multiplier design is also scalable, enabling designers to parameterize it depending on their accuracy and power targets. Furthermore, our multiplier benefits from a reduction in propagation delay, which enables its use on the critical path. We theoretically analyze the error of our design as a function of its parameters and evaluate its performance for a number of applications in image processing, and machine classification. We demonstrate that our design can achieve power savings of 54% -- 80%, while introducing bounded errors with a Gaussian distribution with near-zero average and standard deviations of 0.45% -- 3.61%. We also report power savings of up to 58% when using the proposed design in applications. We show that our design significantly outperforms other approximate multipliers recently proposed in the literature.
231 citations
29 Mar 2015
TL;DR: The proposed cell library is intended to provide access to advanced technology node for universities and other research institutions, in order to design digital integrated circuits and also to develop cell-based design flows, EDA tools and associated algorithms.
Abstract: This paper presents the 15nm FinFET-based Open Cell Library (OCL) and describes the challenges in the methodology while designing a standard cell library for such advanced technology node The 15nm OCL is based on a generic predictive state-of-the-art technology node The proposed cell library is intended to provide access to advanced technology node for universities and other research institutions, in order to design digital integrated circuits and also to develop cell-based design flows, EDA tools and associated algorithms Developing a 15nm standard cell library brings out design challenges which are not present in previous technology nodes Some of these challenges include double-patterning for both metal and poly layers, a very restrictive set of physical design rules, and the demand for lithography-friendly patterns This paper discusses the development of the library considering the challenges associated with advanced technology nodes
194 citations
24 Feb 2014
TL;DR: This paper proposes a software-only system, Paraprox, for realizing transparent approximation of data-parallel programs that operates on commodity hardware systems and yields an average performance gain of 2.7x on a NVIDIA GTX 560 GPU and 2.5x on an Intel Core i7 quad-core processor.
Abstract: Approximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where the output of programs need not be perfectly correct to provide useful results or even noticeable differences to the end user. These soft domains include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate computing is transparency to insulate both software and hardware developers from the time, cost, and difficulty of using approximation. This paper proposes a software-only system, Paraprox, for realizing transparent approximation of data-parallel programs that operates on commodity hardware systems. Paraprox starts with a data-parallel kernel implemented using OpenCL or CUDA and creates a parameterized approximate kernel that is tuned at runtime to maximize performance subject to a target output quality (TOQ) that is supplied by the user. Approximate kernels are created by recognizing common computation idioms found in data-parallel programs (e.g., Map, Scatter/Gather, Reduction, Scan, Stencil, and Partition) and substituting approximate implementations in their place. Across a set of 13 soft data-parallel applications with at most 10% quality degradation, Paraprox yields an average performance gain of 2.7x on a NVIDIA GTX 560 GPU and 2.5x on an Intel Core i7 quad-core processor compared to accurate execution on each platform.
192 citations