scispace - formally typeset
Search or ask a question
Author

Muhammad Husnain Mubarik

Bio: Muhammad Husnain Mubarik is an academic researcher from University of Illinois at Urbana–Champaign. The author has contributed to research in topics: Computer science & Software rendering. The author has an hindex of 2, co-authored 2 publications receiving 8 citations.

Papers
More filters
Proceedings ArticleDOI
01 Oct 2020
TL;DR: This work explores the hardware cost of inference engines for popular classification algorithms in EGT and CNT-TFT printed technologies and determines that Decision Trees and SVMs provide a good balance between accuracy and cost and concludes that their area and power overhead must be reduced.
Abstract: A large number of application domains have requirements on cost, conformity, and non-toxicity that silicon-based computing systems cannot meet, but that may be met by printed electronics. For several of these domains, a typical computational task to be performed is classification. In this work, we explore the hardware cost of inference engines for popular classification algorithms (Multi-Layer Perceptrons, Support Vector Machines (SVMs), Logistic Regression, Random Forests and Binary Decision Trees) in EGT and CNT-TFT printed technologies and determine that Decision Trees and SVMs provide a good balance between accuracy and cost. We evaluate conventional Decision Tree and SVM architectures in these technologies and conclude that their area and power overhead must be reduced. We explore, through SPICE and gate-level hardware simulations and multiple working prototypes, several classifier architectures that exploit the unique cost and implementation tradeoffs in printed technologies - a) Bespoke printed classifers that are customized to a model generated for a given application using specific training datasets, b) Lookup-based printed classifiers where key hardware computations are replaced by lookup tables, and c) Analog printed classifiers where some classifier components are replaced by their analog equivalents. Our evaluations show that bespoke implementation of EGT printed Decision Trees has 48.9× lower area (average) and 75.6× lower power (average) than their conventional equivalents; corresponding benefits for bespoke SVMs are 12.8× and Decision outperform 12.7× respectively. Lookup-based Trees their non-lookup bespoke equivalents by 38% and 70%; lookup-based SVMs are better by 8% and 0.6%. Analog printed Decision Trees provide 437× area and 27× power benefits over digital bespoke counterparts; analog SVMs yield 490× area and 12× power improvements. Our results and prototypes demonstrate feasibility of fabricating and deploying battery and self-powered printed classifiers in the application domains of interest.

20 citations

Proceedings ArticleDOI
30 May 2020
TL;DR: This paper performs a design space exploration of printed microprocessor architectures over multiple parameters - datawidths, pipeline depth, etc, and shows that the best cores outperform pre-existing cores by at least one order of magnitude in terms of power and area.
Abstract: Printed electronics holds the promise of meeting the cost and conformality needs of emerging disposable and ultra-low cost margin applications. Recent printed circuits technologies also have low supply voltage and can, therefore, be battery-powered. In this paper, we explore the design space of microprocessors implemented in such printing technologies - these printed microprocessors will be needed for battery-powered applications with requirements of low cost, conformality, and programmability. To enable this design space exploration, we first present the standard cell libraries for EGFET and CNT-TFT printed technologies - to the best of our knowledge, these are the first synthesis and physical design ready standard cell libraries for any low voltage printing technology. We then present an area, power, and delay characterization of several off-the-shelf low gate count microprocessors (Z80, light8080, ZPU, and openMSP430) in EGFET and CNT-TFT technologies. Our characterization shows that several printing applications can be feasibly targeted by battery-powered printed microprocessors. However, our results also show the need to significantly reduce area and power of such printed microprocessors. We perform a design space exploration of printed microprocessor architectures over multiple parameters - datawidths, pipeline depth, etc. We show that the best cores outperform pre-existing cores by at least one order of magnitude in terms of power and area. Finally, we show that printing-specific architectural and low-level optimizations further improve area and power characteristics of low voltage battery-compatible printed microprocessors. Program-specific ISA, for example, improves power, and area by up to 4.18x and 1.93x respectively. Crosspoint-based instruction ROM outperforms a RAM-based design by 5.77x, 16.8x, and 2.42x respectively in terms of power, area, and delay.

15 citations

Proceedings ArticleDOI
11 Jun 2022
TL;DR: This paper proposes SpEaC --- a coarse-grained reconfigurable spatial architecture - as an energy-efficient programmable processor for earable applications, which outperforms programmable cores modeled after M4, M7, A53, and HiFi4 DSP by 99.3% and outperforms low power Mali T628 MP6 GPU across all kernels.
Abstract: Earables such as earphones [15, 16, 73], hearing aids [28], and smart glasses [2, 14] are poised to be a prominent programmable computing platform in the future. In this paper, we ask the question: what kind of programmable hardware would be needed to support earable computing in future? To understand hardware requirements, we propose EarBench, a suite of representative emerging earable applications with diverse sensor-based inputs and computation requirements. Our analysis of EarBench applications shows that, on average, there is a 13.54×-3.97× performance gap between the computational needs of EarBench applications and the performance of the microprocessors that several of today's programmable earable SoCs are based on; more complex microprocessors have unacceptable energy efficiency for Earable applications. Our analysis also shows that EarBench applications are dominated by a small number of digital signal processing (DSP) and machine learning (ML)-based kernels that have significant computational similarity. We propose SpEaC --- a coarse-grained reconfigurable spatial architecture - as an energy-efficient programmable processor for earable applications. SpEaC targets earable applications efficiently using a) a reconfigurable fixed-point multiply-and-add augmented reduction tree-based substrate with support for vectorized complex operations that is optimized for the earable ML and DSP kernel code and b) a tightly coupled control core for executing other code (including non-matrix computation, or non-multiply or add operations in the earable DSP kernel code). Unlike other CGRAs that typically target general-purpose computations, SpEaC substrate is optimized for energy-efficient execution of the earable kernels at the expense of generality. Across all our kernels, SpEaC outperforms programmable cores modeled after M4, M7, A53, and HiFi4 DSP by 99.3×, 32.5×, 14.8×, and 9.8× respectively. At 63 mW in 28 nm, the energy efficiency benefits are 1.55 ×, 9.04×, 68.3 ×, and 32.7 × respectively; energy efficiency benefits are 15.7 × -- 1087 × over a low power Mali T628 MP6 GPU.

1 citations

Journal ArticleDOI
TL;DR: In this article , the authors investigate how the architecture of a family of chips influences how it is affected by supply and demand uncertainties and develop a model to analyze the impact of architectural techniques on supply chain costs under different regimes of uncertainties and evaluate what happens when they are combined.
Abstract: Mitigating losses from supply and demand volatility in the semiconductor supply chain and market has traditionally been cast as a logistics and forecasting problem. We investigate how the architecture of a family of chips influences how it is affected by supply and demand uncertainties. We observe that semiconductor supply chains become fragile, in part, due to single demand paths, where one chip can satisfy only one demand. Chip architects can enable multiple paths to satisfy a chip demand, which improves supply chain resilience. Based on this observation, we study composition and adaptation as architectural strategies to improve resilience to volatility and also introduce a third strategy of dispersion. These strategies allow multiple paths to satisfy a given chip demand. We develop a model to analyze the impact of these architectural techniques on supply chain costs under different regimes of uncertainties and evaluate what happens when they are combined. We present several interesting and even counterintuitive observations about the product configurations and market conditions where these interventions are impactful and where they are not. In all, we show that product redesign supported by architectural changes can mitigate nearly half of the losses caused by supply and demand volatility. As far as we know, this is the first such investigation concerning chip architecture.
Proceedings ArticleDOI
10 Mar 2023
TL;DR: The Neural Graphics Processing Clustering (NGPC) as mentioned in this paper is a scalable and flexible hardware architecture that directly accelerates the input encoding and multi-layer perceptron kernels through dedicated engines and supports a wide range of neural graphics applications.
Abstract: Rendering and inverse rendering techniques have recently attained powerful new capabilities and building blocks in the form of neural representations (NR), with derived rendering techniques quickly becoming indispensable tools next to classic computer graphics algorithms, covering a wide range of functions throughout the full pipeline from sensing to pixels. NRs have recently been used to directly learn the geometric and appearance properties of scenes that were previously hard to capture, and to re-synthesize photo realistic imagery based on this information, thereby promising simplifications and replacements for several complex traditional computer graphics problems and algorithms with scalable quality and predictable performance. In this work we ask the question: Does neural graphics (graphics based on NRs) need hardware support? We studied four representative neural graphics applications (NeRF, NSDF, NVR, and GIA) showing that, if we want to render 4k resolution frames at 60 frames per second (FPS) there is a gap of ~ 1.51× to 55.50× in the desired performance on current GPUs. For AR and VR applications, there is an even larger gap of ~ 2--4 orders of magnitude (OOM) between the desired performance and the required system power. We identify that the input encoding and the multi-layer perceptron kernels are the performance bottlenecks, consuming 72.37%, 60.0% and 59.96% of application time for multi resolution hashgrid encoding, multi resolution densegrid encoding and low resolution densegrid encoding, respectively. We propose a neural graphics processing cluster (NGPC) - a scalable and flexible hardware architecture that directly accelerates the input encoding and multi-layer perceptron kernels through dedicated engines and supports a wide range of neural graphics applications. To achieve good overall application level performance improvements, we also accelerate the rest of the kernels by fusion into a single kernel, leading to a ~ 9.94× speedup compared to previous optimized implementations [17] which is sufficient to remove this performance bottleneck. Our results show that, NGPC gives up to 58.36× end-to-end application-level performance improvement, for multi resolution hashgrid encoding on average across the four neural graphics applications, the performance benefits are 12.94×, 20.85×, 33.73× and 39.04× for the hardware scaling factor of 8, 16, 32 and 64, respectively. Our results show that with multi resolution hashgrid encoding, NGPC enables the rendering of 4k Ultra HD resolution frames at 30 FPS for NeRF and 8k Ultra HD resolution frames at 120 FPS for all our other neural graphics applications.

Cited by
More filters
Posted Content
TL;DR: SpAtten is presented, an efficient algorithm-architecture co-design that leverages token sparsity, head Sparsity, and quantization opportunities to reduce the attention computation and memory access and proposes the novel cascade token pruning to prune away unimportant tokens in the sentence.
Abstract: The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction. Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings over A3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.

104 citations

Proceedings ArticleDOI
01 Feb 2021
TL;DR: SpAtten as discussed by the authors leverages token sparsity, head sparsity and quantization opportunities to reduce the attention computation and memory access in NLP NLP applications, and proposes cascade token pruning to prune away unimportant tokens in the sentence.
Abstract: The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.

43 citations

Proceedings ArticleDOI
01 Oct 2020
TL;DR: This work explores the hardware cost of inference engines for popular classification algorithms in EGT and CNT-TFT printed technologies and determines that Decision Trees and SVMs provide a good balance between accuracy and cost and concludes that their area and power overhead must be reduced.
Abstract: A large number of application domains have requirements on cost, conformity, and non-toxicity that silicon-based computing systems cannot meet, but that may be met by printed electronics. For several of these domains, a typical computational task to be performed is classification. In this work, we explore the hardware cost of inference engines for popular classification algorithms (Multi-Layer Perceptrons, Support Vector Machines (SVMs), Logistic Regression, Random Forests and Binary Decision Trees) in EGT and CNT-TFT printed technologies and determine that Decision Trees and SVMs provide a good balance between accuracy and cost. We evaluate conventional Decision Tree and SVM architectures in these technologies and conclude that their area and power overhead must be reduced. We explore, through SPICE and gate-level hardware simulations and multiple working prototypes, several classifier architectures that exploit the unique cost and implementation tradeoffs in printed technologies - a) Bespoke printed classifers that are customized to a model generated for a given application using specific training datasets, b) Lookup-based printed classifiers where key hardware computations are replaced by lookup tables, and c) Analog printed classifiers where some classifier components are replaced by their analog equivalents. Our evaluations show that bespoke implementation of EGT printed Decision Trees has 48.9× lower area (average) and 75.6× lower power (average) than their conventional equivalents; corresponding benefits for bespoke SVMs are 12.8× and Decision outperform 12.7× respectively. Lookup-based Trees their non-lookup bespoke equivalents by 38% and 70%; lookup-based SVMs are better by 8% and 0.6%. Analog printed Decision Trees provide 437× area and 27× power benefits over digital bespoke counterparts; analog SVMs yield 490× area and 12× power improvements. Our results and prototypes demonstrate feasibility of fabricating and deploying battery and self-powered printed classifiers in the application domains of interest.

20 citations

Proceedings ArticleDOI
01 Feb 2021
TL;DR: In this paper, the authors proposed a printed mixed-signal system, which substitutes complex and power-hungry conventional stochastic computing (SC) components by printed analog designs.
Abstract: Printed electronics (PE) offers flexible, extremely low-cost, and on-demand hardware due to its additive manufacturing process, enabling emerging ultra-low-cost applications, including machine learning applications. However, large feature sizes in PE limit the complexity of a machine learning classifier (e.g., a neural network (NN)) in PE. Stochastic computing Neural Networks (SC-NNs) can reduce area in silicon technologies, but still require complex designs due to unique implementation tradeoffs in PE. In this paper, we propose a printed mixed-signal system, which substitutes complex and power-hungry conventional stochastic computing (SC) components by printed analog designs. The printed mixed-signal SC consumes only 35% of power consumption and requires only 25% of area compared to a conventional 4-bit NN implementation. We also show that the proposed mixed-signal SC-NN provides good accuracy for popular neural network classification problems. We consider this work as an important step towards the realization of printed SC-NN hardware for near-sensor-processing.

7 citations

Proceedings ArticleDOI
29 Oct 2022
TL;DR: A stochastic aging-model is developed to describe the behavior of aged printed resistors and modify the training objective by considering the expected loss over the lifetime of the device, to provide acceptable accuracy over the device lifetime.
Abstract: Printed electronics allow for ultra-low-cost circuit fabrication with unique properties such as flexibility, non-toxicity, and stretchability. Because of these advanced properties, there is a growing interest in adapting printed electronics for emerging areas such as fast-moving consumer goods and wearable technologies. In such domains, analog signal processing in or near the sensor is favorable. Printed neuromorphic circuits have been recently proposed as a solution to perform such analog processing natively. Additionally, their learning-based design process allows high efficiency of their optimization and enables them to mitigate the high process variations associated with low-cost printed processes. In this work, we address the aging of the printed components. This effect can significantly degrade the accuracy of printed neuromorphic circuits over time. For this, we develop a stochastic aging-model to describe the behavior of aged printed resistors and modify the training objective by considering the expected loss over the lifetime of the device. This approach ensures to provide acceptable accuracy over the device lifetime. Our experiments show that an overall 35.8% improvement in terms of expected accuracy over the device lifetime can be achieved using the proposed learning approach.

6 citations