scispace - formally typeset
Search or ask a question

Showing papers by "Massoud Pedram published in 2017"


Journal ArticleDOI
TL;DR: Four 4:2 compressors, which have the flexibility of switching between the exact and approximate operating modes, are proposed, which are used in the structures of parallel multipliers provides configurable multipliers whose accuracies may change dynamically during the runtime.
Abstract: In this paper, we propose four 4:2 compressors, which have the flexibility of switching between the exact and approximate operating modes. In the approximate mode, these dual-quality compressors provide higher speeds and lower power consumptions at the cost of lower accuracy. Each of these compressors has its own level of accuracy in the approximate mode as well as different delays and power dissipations in the approximate and exact modes. Using these compressors in the structures of parallel multipliers provides configurable multipliers whose accuracies (as well as their powers and speeds) may change dynamically during the runtime. The efficiencies of these compressors in a 32-bit Dadda multiplier are evaluated in a 45-nm standard CMOS technology by comparing their parameters with those of the state-of-the-art approximate multipliers. The results of comparison indicate, on average, 46% and 68% lower delay and power consumption in the approximate mode. Also, the effectiveness of these compressors is assessed in some image processing applications.

185 citations


Journal ArticleDOI
TL;DR: An approximate multiplier that is high speed yet energy efficient is proposed that is to round the operands to the nearest exponent of two improving speed and energy consumption at the price of a small error.
Abstract: In this paper, we propose an approximate multiplier that is high speed yet energy efficient. The approach is to round the operands to the nearest exponent of two. This way the computational intensive part of the multiplication is omitted improving speed and energy consumption at the price of a small error. The proposed approach is applicable to both signed and unsigned multiplications. We propose three hardware implementations of the approximate multiplier that includes one for the unsigned and two for the signed operations. The efficiency of the proposed multiplier is evaluated by comparing its performance with those of some approximate and accurate multipliers using different design parameters. In addition, the efficacy of the proposed approximate multiplier is studied in two image processing applications, i.e., image sharpening and smoothing.

109 citations


Journal ArticleDOI
TL;DR: This paper presents a row-based design methodology covering cell placement, clock tree synthesis, and routing steps for large SFQ circuits, which can be reduced by 27% compared with the results of a conventional CMOS placement accompanied by an H-tree clock network.
Abstract: This paper presents a row-based design methodology covering cell placement, clock tree synthesis, and routing steps for large SFQ circuits. The proposed placement tool initiates by running a state-of-the-art CMOS placer, which places fixed-height but variable-width cells in rows on the chip. Cells in each row are then grouped together such that each group contains at most $k$ cells with the same logic level. Next, for clock routing, this paper proposes HL-tree, which adopts an H-tree with passive transmission line connections to distribute the clock to groups, and within each group, a linear path composed of splitters and Josephson transmission lines (JTLs) provides the clock to cells. Increasing $k$ reduces the chip area, but also may incur a performance loss. To evaluate the effectiveness of the proposed approach, place-and-route results of a 32-bit Kogge–Stone adder for different values of $k$ are reported. By using this new design methodology, the overall chip area can be reduced by 27% compared with the results of a conventional CMOS placement accompanied by an H-tree clock network.

65 citations


Proceedings ArticleDOI
27 Mar 2017
TL;DR: A high speed yet energy efficient approximate divider where the division operation is performed by multiplying the dividend by the inverse of the divisor by truncated value of the dividend is multiplied exactly by the approximate inverse value ofdivisor.
Abstract: In this paper, we present a high speed yet energy efficient approximate divider where the division operation is performed by multiplying the dividend by the inverse of the divisor. In this structure, truncated value of the dividend is multiplied exactly (approximately) by the approximate inverse value of divisor. To assess the efficacy of the proposed divider, its design parameters are extracted and compared to those of a number of prior art dividers in a 45nm CMOS technology. Results reveal that this structure provides 66% and 52% improvements in the area and energy consumption, respectively, compared to the most advanced prior art approximate divider. In addition, delay and energy consumption of the division operation are reduced about 94.4% and 99.93%, respectively, compared to those of an exact SRT radix-4 divider. Finally, the efficacy of the proposed divider in image processing application is studied.

43 citations


Journal ArticleDOI
TL;DR: An energy efficient approximate multiplier design obtained by truncating the input operands is proposed, with an output quality-tunable multiplier providing ability to change the output quality during the multiplication operation.

34 citations


Proceedings ArticleDOI
01 Feb 2017
TL;DR: A partitioned FinFET RF is proposed that is able tosave 39% and 54% of the RF leakage and the dynamic energy, respectively, and suffers less than 2% performance overhead.
Abstract: GPU adoption for general purpose computing hasbeen accelerating. To support a large number of concurrentlyactive threads, GPUs are provisioned with a very large registerfile (RF). The RF power consumption is a critical concern. Oneoption to reduce the power consumption dramatically is touse near-threshold voltage(NTV) to operate the RF. However, operating MOSFET devices at NTV is fraught with stabilityand reliability concerns. The adoption of FinFET devices inchip industry is providing a promising path to operate theRF at NTV while satisfactorily tackling the stability andreliability concerns. However, the fundamental problem of NTVoperation, namely slow access latency, remains. To tackle thischallenge in this paper we propose to build a partitioned RFusing FinFET technology. The partitioned RF design exploitsour observation that applications exhibit strong preference toutilize a small subset of their registers. One way to exploitthis behavior is to cache the RF content as has been proposedin recent works. However, caching leads to unnecessary areaoverheads since a fraction of the RF must be replicated. Furthermore, we show that caching is not efficient as weincrease the number of issued instructions per cycle, which isthe expected trend in GPU designs. The proposed partitionedRF splits the registers into two partitions: the highly accessedregisters are stored in a small RF that switches betweenhigh and low power modes. We use the FinFET's back gatecontrol to provide low overhead switching between the twopower modes. The remaining registers are stored in a largeRF partition that always operates at NTV. The assignment ofthe registers to the two partitions will be based on statisticscollected by the a hybrid profiling technique that combines thecompiler based profiling and the pilot warp profiling techniqueproposed in this paper. The partitioned FinFET RF is able tosave 39% and 54% of the RF leakage and the dynamic energy, respectively, and suffers less than 2% performance overhead.

30 citations


Proceedings ArticleDOI
01 Jun 2017
TL;DR: In this article, the authors present designs of rapid single flux quantum (RSFQ) logic cells comprising multiple inputs and high-fanout splitters, which are subsequently utilized by an automated logic synthesis flow to reduce the area/power costs and logic depth of RSFQ logic circuits.
Abstract: This paper presents designs of rapid single flux quantum (RSFQ) logic cells comprising multiple (more than two) inputs and high-fanout splitters, which are subsequently utilized by an automated logic synthesis flow to reduce the area/power costs and logic depth of RSFQ logic circuits. The complex cells include multi-input AND and OR gates as well as specialized gates such as A+BC. The logic synthesis tool builds on the ABC tool, but adds features and capabilities that are unique to RSFQ circuits. Results show a sizeable reduction in the Area-Delay product of large multiplexer, decoder, and carrylook- ahead adder circuits while exhibiting small (manageable) degradation in margins.

24 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: A method to modify the standard RSFQ cells at their input/output interfaces to other cells in order to support a multiple-fanout drive capability, focusing on the clock signal driving more than one cell without the use of splitters.
Abstract: Rapid Single Flux Quantum (RSFQ) logic cells have traditionally been limited to driving one fanout cell because of the difficulty in distributing the single flux quantum pulse to multiple fanouts. This paper presents a method to modify the standard RSFQ cells at their input/output interfaces to other cells in order to support a multiple-fanout drive capability. This capability is especially useful for clock distribution in RSFQ logic. This is because RSFQ logic is requires the clock signal to be provided to every logic gate. This is why this paper focuses on the clock signal driving more than one cell without the use of splitters. The potential tradeoff is in lower margins for the cells. However, by careful design of the RSFQ cells, the yield is not compromised by our proposed technique.

24 citations


Proceedings ArticleDOI
27 Mar 2017
TL;DR: The results of the study reveal that using the proposed framework provides, on average, 17X higher output accuracy compared to the cases that the impact of the process variation is not considered at all.
Abstract: In this paper, an approach for increasing the sustainability of inverter-based memristive neuromorphic circuits in the presence of process variation is presented. The approach works based on extracting the impact of process variations on the neurons characteristics during the test phase through a proposed algorithm. In this method, first, some combinations of inputs and weights (based on the neuromorphic circuit structure) are injected into the circuit and the features of the neurons are determined. Next, these features which are back-annotated, are utilized in an efficient ex-situ training approach to determine the proper weights of the neurons. The approach provides a considerable improvement in the output accuracy. To evaluate the effectiveness of the proposed approach, some approximate applications are studied using 90nm CMOS technology. The results of the study reveal that using this framework provides, on average, 17X higher output accuracy compared to the cases that the impact of the process variation is not considered at all.

18 citations


Posted Content
TL;DR: In this paper, the authors proposed a Fast Fourier Transform (FFT)-based DNN training and inference model suitable for embedded platforms with reduced asymptotic complexity of both computation and storage.
Abstract: Deep learning has delivered its powerfulness in many application domains, especially in image and speech recognition. As the backbone of deep learning, deep neural networks (DNNs) consist of multiple layers of various types with hundreds to thousands of neurons. Embedded platforms are now becoming essential for deep learning deployment due to their portability, versatility, and energy efficiency. The large model size of DNNs, while providing excellent accuracy, also burdens the embedded platforms with intensive computation and storage. Researchers have investigated on reducing DNN model size with negligible accuracy loss. This work proposes a Fast Fourier Transform (FFT)-based DNN training and inference model suitable for embedded platforms with reduced asymptotic complexity of both computation and storage, making our approach distinguished from existing approaches. We develop the training and inference algorithms based on FFT as the computing kernel and deploy the FFT-based inference model on embedded platforms achieving extraordinary processing speed.

18 citations


Journal ArticleDOI
TL;DR: An accuracy-aware operating voltage management unit to improve the lifetime of processors by considering the error-resilient nature of some applications and dynamically adjusts the minimum acceptable operating voltage based on the impact of aging mechanisms is presented.

Journal ArticleDOI
TL;DR: TEI-power is presented, a dynamic voltage and frequency scaling--based dynamic thermal management technique that considers the TEI phenomenon and also the superlinear dependencies of power consumption components on the temperature and outlines a real-time trade-off between delay and power consumption as a function of the chip temperature to provide significant energy savings.
Abstract: FinFETs have emerged as a promising replacement for planar CMOS devices in sub-20nm technology nodes. However, based on the temperature effect inversion (TEI) phenomenon observed in FinFET devices, the delay characteristics of FinFET circuits in sub-, near-, and superthreshold voltage regimes may be fundamentally different from those of CMOS circuits with nominal voltage operation. For example, FinFET circuits may run faster in higher temperatures. Therefore, the existing CMOS-based and TEI-unaware dynamic power and thermal management techniques would not be applicable. In this article, we present TEI-power, a dynamic voltage and frequency scaling--based dynamic thermal management technique that considers the TEI phenomenon and also the superlinear dependencies of power consumption components on the temperature and outlines a real-time trade-off between delay and power consumption as a function of the chip temperature to provide significant energy savings, with no performance penalty—namely, up to 42% energy savings for small circuits where the logic cell delay is dominant and up to 36% energy savings for larger circuits where the interconnect delay is considerable.

Journal ArticleDOI
01 Oct 2017
TL;DR: This study addresses the problem of concurrent task scheduling and storage management for residential energy consumers with PV and storage systems, in order to minimise the electric bill using a negotiation-based iterative approach and a near-optimal storage control algorithm.
Abstract: Dynamic energy pricing policy introduces real-time power-consumption-reflective pricing in the smart grid in order to incentivise energy consumers to schedule electricity-consuming applications (tasks) more prudently to minimise electric bills. This has become a particularly interesting problem with the availability of photovoltaic (PV) power generation facilities and controllable energy storage systems. This study addresses the problem of concurrent task scheduling and storage management for residential energy consumers with PV and storage systems, in order to minimise the electric bill. A general type of dynamic pricing scenario is assumed where the energy price is both time-of-use and power dependent. Tasks are allowed to support suspend-now and resume-later operations. A negotiation-based iterative approach has been proposed. In each iteration, all tasks are ripped-up and rescheduled under a fixed storage charging/discharging scheme, and then the storage control scheme is derived based on the latest task scheduling. The concept of congestion is introduced to gradually adjust the schedule of each task, whereas dynamic programming is used to find the optimal schedule. A near-optimal storage control algorithm is effectively implemented. Experimental results demonstrate that the proposed algorithm can achieve up to 60.95% in the total energy cost reduction compared with various baseline methods.

Journal ArticleDOI
TL;DR: This paper addresses the co-scheduling problem of HVAC control and HEES system management to achieve energy-efficient smart buildings, while also accounting for the degradation of the battery state-of-health during charging and discharging operations (which determines the amortized cost of owning and utilizing a battery storage system).

Journal ArticleDOI
TL;DR: A modified carry select adder (CSLA) structure which is more power/energy and area-efficient compared to the existing CSLAs is proposed, which is performed using HSPICE simulations based on a 45nm bulk CMOS technology.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: The 6T SRAM cell design for gate-all-around nanowire transistors using a device-circuit co-optimization framework is investigated and read and write assist techniques are studied to relieve the negative impact of low on-currents on SRAM stabilities incurred by Nanowire channels.
Abstract: Gate-all-around nanowire transistor is deemed as one of the most promising solutions that enables continued CMOS scaling. Compared with FinFET, it further suppresses short-channel effects by providing superior electrostatic control over the channel. Due to the unique device structure, gate-all-around nanowire transistor also allows more efficient layout design by exploiting 3-dimensional stacking configurations. In this paper, we investigate the 6T SRAM cell design for gate-all-around nanowire transistors using a device-circuit co-optimization framework. At the device level, TCAD simulation and current source modeling method are applied to extract the model. Layout designs with horizontal, lateral, vertical stacking device structures are explored. At the circuit level, read and write assist techniques are studied to relieve the negative impact of low on-currents on SRAM stabilities incurred by nanowire channels. Operating at 300 mV, assist techniques can increase the read static noise margin and the write static noise margin of 6T SRAM up to 82% and 92%, respectively.

Journal ArticleDOI
TL;DR: A hybrid TFET-MOSFET soft-error resilient and low power master-slave flip-flop is introduced, exploiting advantages of TFETs and MOSFets to increase the reliability of low power digital circuits in the presence of soft errors.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: A variational inference based Bayesian neural network is proposed as the solution method, which implicitly finds a proper balance between exploration and exploitation in a cache-enabled multitier heterogeneous cellular network.
Abstract: Aggressive network densification in next generation cellular networks is accompanied by an increase of the system energy consumption and calls for more advanced power management techniques in base stations. In this paper, we present a novel proactive and decentralized power management method for small cell base stations in a cache-enabled multitier heterogeneous cellular network. User contexts are utilized to drive the decision of dynamically switching a small cell base station between the active mode and the sleep mode to minimize the total energy consumption. The online control problem is formulated as a contextual multi-armed bandit problem. A variational inference based Bayesian neural network is proposed as the solution method, which implicitly finds a proper balance between exploration and exploitation. Experimental results show that the proposed solution can achieve up to 46.9% total energy reduction compared to baseline algorithms in the high density deployment scenario and has comparable performance to an offline optimal solution.

Posted Content
TL;DR: In this article, an FPGA implementation of adaptive independent component analysis (ICA) is presented, which can be used in various machine learning problems that use stochastic gradient descent optimization.
Abstract: Independent Component Analysis (ICA) is a dimensionality reduction technique that can boost efficiency of machine learning models that deal with probability density functions, e.g. Bayesian neural networks. Algorithms that implement adaptive ICA converge slower than their nonadaptive counterparts, however, they are capable of tracking changes in underlying distributions of input features. This intrinsically slow convergence of adaptive methods combined with existing hardware implementations that operate at very low clock frequencies necessitate fundamental improvements in both algorithm and hardware design. This paper presents an algorithm that allows efficient hardware implementation of ICA. Compared to previous work, our FPGA implementation of adaptive ICA improves clock frequency by at least one order of magnitude and throughput by at least two orders of magnitude. Our proposed algorithm is not limited to ICA and can be used in various machine learning problems that use stochastic gradient descent optimization.

Journal ArticleDOI
TL;DR: The results show that the proposed technique leads to, on average, higher speed compared to those of the state-of-the-art technique, but at the cost of −1.7% precision lost compared to that of the Monte-Carlo method.
Abstract: In this brief, we propose an effective adaptation of viability analysis in statistical static timing analysis The adaption benefits well from a dynamic programming implementation of the viability function For a rapid identification of statistical longest true paths, the technique makes use of a fast preprocessing step identifying the gates with a small probability of being viable in the circuit, and a number of simple optimization techniques This makes the approach fast without lowering its accuracy The efficacy of the proposed statistical timing analysis is assessed using ISCAS benchmark circuits and carry skip adders The results show that the proposed technique leads to, on average, $18\times$ higher speed compared to those of the state-of-the-art technique This improvement is achieved at the cost of −17% precision lost compared to that of the Monte-Carlo method

Proceedings ArticleDOI
29 Mar 2017
TL;DR: In this article, a cost minimization problem in the form of mixed integer linear programming is formulated to optimally find the amount charging power and/or the battery pack to swap in at the same time for each EV that requests service.
Abstract: Battery charging and battery swapping are the two major techniques to re-energize an electric vehicle. This paper for the first time looks into the joint control problem in a battery charging and swapping station that adopts both re-energizing techniques. A cost minimization problem in the form of mixed integer linear programming is formulated to optimally find the amount charging power and/or the battery pack to swap in at the same time for each EV that requests service. The utility cost under a dynamic energy pricing scheme and the cost associated with battery aging are carefully modeled. Experimental results show that the proposed optimization framework consistently reduce the total energy cost by 80%, 11% and 63% comparing with three baseline algorithms.

Proceedings ArticleDOI
10 Jul 2017
TL;DR: In this paper, an FPGA implementation of adaptive independent component analysis (ICA) is presented, which can be used in various machine learning problems that use stochastic gradient descent optimization.
Abstract: Independent Component Analysis (ICA) is a dimensionality reduction technique that can boost efficiency of machine learning models that deal with probability density functions, e.g. Bayesian neural networks. Algorithms that implement adaptive ICA converge slower than their nonadaptive counterparts, however, they are capable of tracking changes in underlying distributions of input features. This intrinsically slow convergence of adaptive methods combined with existing hardware implementations that operate at very low clock frequencies necessitate fundamental improvements in both algorithm and hardware design. This paper presents an algorithm that allows efficient hardware implementation of ICA. Compared to previous work, our FPGA implementation of adaptive ICA improves clock frequency by at least one order of magnitude and throughput by at least two orders of magnitude. Our proposed algorithm is not limited to ICA and can be used in various machine learning problems that use stochastic gradient descent optimization.

Journal ArticleDOI
01 Oct 2017
TL;DR: This paper investigates a service level agreements (SLAs)-based resource allocation problem in a server cluster and proposes a near-optimal solution comprised of a central manager and distributed local agents, thereby achieving a desirable tradeoff between service response time and power consumption.
Abstract: This paper investigates a service level agreements (SLAs)-based resource allocation problem in a server cluster. The objective is to maximise the total profit, which is the total revenue minus the operational cost of the server cluster. The total revenue depends on the average request response time, whereas the operating cost depends on the total energy consumption of the server cluster. A joint optimisation framework is proposed, comprised of request dispatching, dynamic voltage and frequency scaling (DVFS) for individual cores of the servers, as well as server- and core-level consolidations. Each DVFS-enabled core in the server cluster is modelled by using a continuous-time Markov decision process (CTMDP). A near-optimal solution comprised of a central manager and distributed local agents is presented. Each local agent employs linear programming-based CTMDP solving method to solve the DVFS problem for the corresponding core. On the other hand, the central manager solves the request dispatch problem and finds the optimal number of ON cores and servers, thereby achieving a desirable tradeoff between service response time and power consumption. To reduce the computational overhead, a two-tier hierarchical solution is utilized. Experimental results demonstrate the outstanding performance of the proposed algorithm over the baseline algorithms.

Journal ArticleDOI
TL;DR: An SoH-aware charging aggregator design is presented, which decides the control sequences of a group of PEVs, and Experimental results show that the proposed optimal charging algorithm minimizes the combination of electricity cost and battery aging cost in the RS provisioning power market.
Abstract: Plug-in electric vehicles (PEVs) are considered the key to reducing fossil fuel consumption and an important part of the smart grid. The plug-in electric vehicle-to-grid (V2G) technology in the smart grid infrastructure enables energy flow from PEV batteries to the power grid so that the grid stability is enhanced and the peak power demand is shaped. PEV owners will also benefit from V2G technology, as they will be able to reduce energy cost through proper PEV charging and discharging scheduling. Moreover, power regulation service (RS) reserves have been playing an increasingly important role in modern power markets. It has been shown that by providing RS reserves, the power grid achieves a better match between energy supply and demand in presence of volatile and intermittent renewable energy generation. This article starts with the problem of PEV charging under dynamic energy pricing, properly taking into account the degradation of battery state-of-health (SoH) during V2G operations as well as RS provisioning. An overall optimization throughout the whole parking period is proposed for the PEV and an adaptive control framework is presented to dynamically update the optimal charging/discharging decision at each hour to mitigate the effect of RS tracking error.As more and more PEVs are being plugged into the power grid, the control or management issue of PEV charging arises, since mass unregulated charging processes of PEVs may result in degradation of power quality and damage utility equipments and customer appliances. To solve this problem, this article also presents an SoH-aware charging aggregator design, which decides the control sequences of a group of PEVs. An energy storage system is used in the charging aggregator to do a peak power shaving, and future parking PEVs are properly taken care of. Experimental results show that the proposed optimal charging algorithm minimizes the combination of electricity cost and battery aging cost in the RS provisioning power market. Experimental results also show that the introduction of charging aggregator can significantly reduce the peak power consumption caused by simultaneous PEV charging.

Proceedings ArticleDOI
27 Mar 2017
TL;DR: If buffers are judiciously inserted in global interconnects, the buffer delay decrease is more pronounced than the interconnect delay increase, resulting in an overall performance improvement at higher temperatures, as shown in this paper.
Abstract: As a result of the Temperature Effect Inversion (TEI) in FinFET-based designs, gate delays decrease with the increase of temperature. In contrast, the resistive characteristic and hence delay of global interconnects increase with the temperature. However, as shown in this paper, if buffers are judiciously inserted in global interconnects, the buffer delay decrease is more pronounced than the interconnect delay increase, resulting in an overall performance improvement at higher temperatures. More specifically, this work models the delay of buffer-inserted global interconnects vs. temperature in order to derive the optimal number and size of buffers for a given interconnect length and temperature. Furthermore, the paper addresses the problem of minimizing the buffered interconnect energy consumption by changing the supply voltage level or FinFET threshold voltage, and also presents a temperature-aware optimization policy for solving this problem. Simulation results show average interconnect energy savings of 16% with no performance penalty for five different benchmarks implemented on a 14nm FinFET technology.

01 Oct 2017
TL;DR: This project drew up a comprehensive research plan for developing a standard design methodology and supporting computer-aided design tools for the SFQ logic at the register-transfer-level and below and produced several preliminary, prototype software tools for proof-of-concept demonstrations.
Abstract: : The goal of this project was to investigate the state-of-the-art in design and optimization of single-flux quantum (SFQ) logic circuits, e.g., RSFQ and ERSFQ and draw up a comprehensive research plan for developing a standard design methodology and supporting computer-aided design tools for the SFQ logic at the register-transfer-level and below. In the process, this project produced several preliminary, prototype software tools for proof-of-concept demonstrations, including an RSFQ cell library, a prototype standard cell timing characterization tool, a prototype static timing analysis tool, a prototype frontend logic synthesis tool, and a prototype backend place and route tool. The RSFQ library, and software tools can be accessed at http://sportlab.usc.edu/downloads/download-protected/. For username and password, please contact pedram@usc.edu.