scispace - formally typeset
Search or ask a question
Author

Harvinder Singh

Bio: Harvinder Singh is an academic researcher from STMicroelectronics. The author has contributed to research in topics: Multiplexer & Bandwidth (signal processing). The author has an hindex of 3, co-authored 4 publications receiving 140 citations.

Papers
More filters
Proceedings ArticleDOI
01 Feb 2017
TL;DR: A booming number of computer vision, speech recognition, and signal processing applications, are increasingly benefiting from the use of deep convolutional neural networks, with a DCNN significantly outperforming classical approaches for the first time.
Abstract: A booming number of computer vision, speech recognition, and signal processing applications, are increasingly benefiting from the use of deep convolutional neural networks (DCNN) stemming from the seminal work of Y. LeCun et al. [1] and others that led to winning the 2012 ImageNet Large Scale Visual Recognition Challenge with AlexNet [2], a DCNN significantly outperforming classical approaches for the first time. In order to deploy these technologies in mobile and wearable devices, hardware acceleration plays a critical role for real-time operation with very limited power consumption and with embedded memory overcoming the limitations of fully programmable solutions.

143 citations

Patent
29 Aug 2005
TL;DR: In this article, a minimal area integrated polyphase interpolation filter using a symmetry of coefficients for a channel of input data is proposed, which includes an input interface block for synchronizing the input signal to a first internal clock signal; a memory block for providing multiple delayed output signals; a multiplexer input interface for outputting a selected plurality of signals for generating mirror image coefficient sets in response to a second set of internal control signals.
Abstract: A minimal area integrated polyphase interpolation filter uses a symmetry of coefficients for a channel of input data. The filter includes an input interface block for synchronizing the input signal to a first internal clock signal; a memory block for providing multiple delayed output signals; a multiplexer input interface block for outputting a selected plurality of signals for generating mirror image coefficient sets in response to a second set of internal control signals, a coefficient block for generating mirror image and/or symmetric coefficient sets, and to output a plurality of filtered signals, an output multiplexer block for performing selection, gain control and data width control on said plurality of filtered signals, an output register block synchronizing the filtered signals, and a control block generating clock signals for realization of the filter and to delay between two channels to access a coefficient set, thereby minimizing hardware in the filter.

10 citations

Proceedings ArticleDOI
29 May 2009
TL;DR: Introducing digital processing in this RF dominated application to sort and assemble user channels removes the need of in-band SAW filters, offers full flexibility of channel selection, and supports up to 50 users simultaneously.
Abstract: Satellite digital TV broadcast reception today requires a multiple Low-Noise Block (multi-LNB) head on the dish, as well as a multi-tuner set-top box (STB). Connecting multiple OutDoor Units (ODU) to the set-top boxes traditionally needed multiple cables. A first step has been achieved with so-called satellite Channel Stacking Switch™ technology (CSS), able to deliver the full suite of TV programs to all STBs in a single home through a reduced number of cables. However, this pure analog/RF technology does not offer enough flexibility in terms of the number of simultaneous users (12 users maximum) and requires multiple external components like SAW filters, increasing significantly the cost of the solution [1]. Introducing digital processing in this RF dominated application to sort and assemble user channels removes the need of in-band SAW filters, offers full flexibility of channel selection, and supports up to 50 users simultaneously.

3 citations

Journal ArticleDOI
TL;DR: In this article, a digital channel multiplexer for satellite outdoor unit running at 1 GHz clock frequency is implemented in 65 nm CMOS mixed oxide dual voltage technology, based on a 1 GS/s digital signal processor (DSP) approach with 500 MHz input and output bandwidth.
Abstract: A digital channel multiplexer for satellite outdoor unit running at 1 GHz clock frequency is implemented in 65 nm CMOS mixed oxide dual voltage technology. This multiplexer, based on a 1 GS/s digital signal processor (DSP) approach with 500 MHz input and output bandwidth, embeds two 8 bit 1 GS/s analog-digital converters (ADCs) and two 8 bit 1 GS/s digital-analog converter (DACs). It consumes less that 1022 mW at ambient temperature while achieving noise rejection up to 42.5 dB on a single tone, and > 37 dB on modulated satellite channels.

2 citations


Cited by
More filters
Journal ArticleDOI
23 Jan 2018
TL;DR: This comprehensive review summarizes state of the art, challenges, and prospects of the neuro-inspired computing with emerging nonvolatile memory devices and presents a device-circuit-algorithm codesign methodology to evaluate the impact of nonideal device effects on the system-level performance.
Abstract: This comprehensive review summarizes state of the art, challenges, and prospects of the neuro-inspired computing with emerging nonvolatile memory devices. First, we discuss the demand for developing neuro-inspired architecture beyond today’s von-Neumann architecture. Second, we summarize the various approaches to designing the neuromorphic hardware (digital versus analog, spiking versus nonspiking, online training versus offline training) and discuss why emerging nonvolatile memory is attractive for implementing the synapses in the neural network. Then, we discuss the desired device characteristics of the synaptic devices (e.g., multilevel states, weight update nonlinearity/asymmetry, variation/noise), and survey a few representative material systems and device prototypes reported in the literature that show the analog conductance tuning. These candidates include phase change memory, resistive memory, ferroelectric memory, floating-gate transistors, etc. Next, we introduce the crossbar array architecture to accelerate the weighted sum and weight update operations that are commonly used in the neuro-inspired machine learning algorithms, and review the recent progresses of array-level experimental demonstrations for pattern recognition tasks. In addition, we discuss the peripheral neuron circuit design issues and present a device-circuit-algorithm codesign methodology to evaluate the impact of nonideal device effects on the system-level performance (e.g., learning accuracy). Finally, we give an outlook on the customization of the learning algorithms for efficient hardware implementation.

730 citations

Journal ArticleDOI
01 Aug 2018
TL;DR: This Perspective argues that electronics is poised to enter a new era of scaling – hyper-scaling – driven by advances in beyond-Boltzmann transistors, embedded non-volatile memories, monolithic three-dimensional integration and heterogeneous integration techniques.
Abstract: In the past five decades, the semiconductor industry has gone through two distinct eras of scaling: the geometric (or classical) scaling era and the equivalent (or effective) scaling era. As transistor and memory features approach 10 nanometres, it is apparent that room for further scaling in the horizontal direction is running out. In addition, the rise of data abundant computing is exacerbating the interconnect bottleneck that exists in conventional computing architecture between the compute cores and the memory blocks. Here we argue that electronics is poised to enter a new, third era of scaling — hyper-scaling — in which resources are added when needed to meet the demands of data abundant workloads. This era will be driven by advances in beyond-Boltzmann transistors, embedded non-volatile memories, monolithic three-dimensional integration and heterogeneous integration techniques. This Perspective argues that electronics is poised to enter a new era of scaling – hyper-scaling – driven by advances in beyond-Boltzmann transistors, embedded non-volatile memories, monolithic three-dimensional integration, and heterogeneous integration techniques.

343 citations

Proceedings ArticleDOI
14 Oct 2017
TL;DR: The CirCNN architecture is proposed, a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (e.g., layer type, size, scales, etc) and FFT can be used as the key computing kernel which ensures universal and small-footprint implementations.
Abstract: Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning, which affects performance and throughput; 2) the increased training complexity; and 3) the lack of rigirous guarantee of compression ratio and inference accuracy.To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from $\mathrm {O}(n^{2})$ to $\mathrm {O}(n$ log n) and the storage complexity from $\mathrm {O}(n^{2})$ to O(n), with negligible accuracy loss. Compared to other approaches, CirCNN is distinct due to its mathematical rigor: the DNNs based on CirCNN can converge to the same “effectiveness” as DNNs without compression. We propose the CirCNN architecture, a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (e.g., layer type, size, scales, etc In CirCNN architecture: 1) Due to the recursive property, FFT can be used as the key computing kernel which ensures universal and small-footprint implementations. 2) The compressed but regular network structure avoids the pitfalls of the network pruning and facilitates high performance and throughput with highly pipelined and parallel design. To demonstrate the performance and energy efficiency, we test CIR-CNN in FPGA, ASIC and embedded processors. Our results show that CirCNN architecture achieves very high energy efficiency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CirCNN achieves 6 - 102X energy efficiency improvements compared with the best state-of-the-art results.CCS Concepts• Computer systems organization$\rightarrow $ Embedded hardware;

262 citations

Journal ArticleDOI
Jinmook Lee1, Changhyeon Kim1, Sanghoon Kang1, Dongjoo Shin1, Sangyeob Kim1, Hoi-Jun Yoo1 
TL;DR: An energy-efficient deep neural network (DNN) accelerator, unified neural processing unit (UNPU), is proposed for mobile deep learning applications and is the first DNN accelerator ASIC that can support fully variable weight bit precision from 1 to 16 bit.
Abstract: An energy-efficient deep neural network (DNN) accelerator, unified neural processing unit (UNPU), is proposed for mobile deep learning applications. The UNPU can support both convolutional layers (CLs) and recurrent or fully connected layers (FCLs) to support versatile workload combinations to accelerate various mobile deep learning applications. In addition, the UNPU is the first DNN accelerator ASIC that can support fully variable weight bit precision from 1 to 16 bit. It enables the UNPU to operate on the accuracy-energy optimal point. Moreover, the lookup table (LUT)-based bit-serial processing element (LBPE) in the UNPU achieves the energy consumption reduction compared to the conventional fixed-point multiply-and-accumulate (MAC) array by 23.1%, 27.2%, 41%, and 53.6% for the 16-, 8-, 4-, and 1-bit weight precision, respectively. Besides the energy efficiency improvement, the unified DNN core architecture of the UNPU improves the peak performance for CL by 1.15 $\times$ compared to the previous work. It makes the UNPU operate on the lower voltage and frequency for the given DNN to increase energy efficiency. The UNPU is implemented in 65-nm CMOS technology and occupies the $4 \times 4$ mm2 die area. The UNPU can operates from 0.63- to 1.1-V supply voltage with maximum frequency of 200 MHz. The UNPU has peak performance of 345.6 GOPS for 16-bit weight precision and 7372 GOPS for 1-bit weight precision. The wide operating range of UNPU makes the UNPU achieve the power efficiency of 3.08 TOPS/W for 16-bit weight precision and 50.6 TOPS/W for 1-bit weight precision. The functionality of the UNPU is successfully demonstrated on the verification system using ImageNet deep CNN (VGG-16).

225 citations

Journal ArticleDOI
TL;DR: Thinker is an energy efficient reconfigurable hybrid-NN processor fabricated in 65-nm technology designed to exploit data reuse and guarantee parallel data access, which improves computing throughput and energy efficiency.
Abstract: Hybrid neural networks (hybrid-NNs) have been widely used and brought new challenges to NN processors. Thinker is an energy efficient reconfigurable hybrid-NN processor fabricated in 65-nm technology. To achieve high energy efficiency, three optimization techniques are proposed. First, each processing element (PE) supports bit-width adaptive computing to meet various bit-widths of neural layers, which raises computing throughput by 91% and improves energy efficiency by $1.93 \times $ on average. Second, PE array supports on-demand array partitioning and reconfiguration for processing different NNs in parallel, which results in 13.7% improvement of PE utilization and improves energy efficiency by $1.11 \times $ . Third, a fused data pattern-based multi-bank memory system is designed to exploit data reuse and guarantee parallel data access, which improves computing throughput and energy efficiency by $1.11 \times $ and $1.17 \times $ , respectively. Measurement results show that this processor achieves 5.09-TOPS/W energy efficiency at most.

185 citations