Showing papers on "Clock gating published in 2014"

PDF

Open Access

Journal Article•DOI•

16.1 A 340mV-to-0.9V 20.2Tb/s source-synchronous hybrid packet/circuit-switched 16×16 network-on-chip in 22nm tri-gate CMOS

[...]

Gregory K. Chen¹, Mark A. Anders¹, Himanshu Kaul¹, Sudhir K. Satpathy¹, Sanu Mathew¹, Steven K. Hsu¹, Amit Agarwal¹, Ram Krishnamurthy¹, Shekhar Borkar¹, Vivek De¹ - Show less +6 more•Institutions (1)

Intel¹

11 Dec 2014

TL;DR: In this article, a 16×16 mesh, 112b data, 256 voltage/clock domain NoC with source-synchronous operation, hybrid packet/circuit-switched flow control, and ultra-low-voltage optimizations is fabricated in 22nm tri-gate CMOS.

...read moreread less

Abstract: Energy-efficient networks-on-chip (NoCs) are key enablers for exa-scale computation by shifting power budget from communication toward computation. As core counts scale into the 100s, on-chip interconnect fabrics must support increasing heterogeneity and voltage/clock domains. Synchronous NoCs require either a single clock distributed globally or clock-crossing data FIFOs between clock domains [1]. A global clock requires costly full-chip margining and significant power and area for clock distribution, while synchronizing data FIFOs add power, performance, and area overhead per clock crossing. Source-synchronous NoCs mitigate these penalties by forwarding a local clock along with each packet, but still suffer from high data storage power due to packet switching. Circuit switching removes intra-route data storage, but suffers from low network utilization due to serialized channel setup and data transfer [2]. Hybrid packet/circuit switching parallelizes these operations for higher network utilization. A 16×16 mesh, 112b data, 256 voltage/clock domain NoC with source-synchronous operation, hybrid packet/circuit-switched flow control, and ultra-low-voltage optimizations is fabricated in 22nm tri-gate CMOS [3] to enable: i) 20.2Tb/s total throughput at 0.9V, 25°C, ii) a 2.7× increase in bisection bandwidth to 2.8Tb/s and 93% reduction in circuit-switched latency at 407ps/hop through source-synchronous operation, iii) a 62% latency improvement and 55% increase in energy efficiency to 7.0Tb/s/W through circuit switching, iv) a peak energy efficiency of 18.3Tb/s/W for near-threshold operation at 430mV, 25°C, and v) ultra-low-voltage operation down to 340mV with router power scaling to 363μW.

...read moreread less

69 citations

Patent•

Method and apparatus for source-synchronous signaling

[...]

Jared L. Zerbe¹, Brian S. Leibowitz¹, Hsuan-Jung (Bruce) Su¹, John Eble¹, Barry Daly¹, Lei Luo¹, Teva Stone¹, John Wilson¹, Jihong Ren¹, Wayne Dettloff¹ - Show less +6 more•Institutions (1)

Rambus¹

11 Aug 2014

TL;DR: In this paper, a low power, high performance source-synchronous chip interface which provides rapid turn-on and facilitates high signaling rates between a transmitter and a receiver located on different chips is described.

...read moreread less

Abstract: A low-power, high-performance source-synchronous chip interface which provides rapid turn-on and facilitates high signaling rates between a transmitter and a receiver located on different chips is described in various embodiments. Some embodiments of the chip interface include, among others: a segmented “fast turn-on” bias circuit to reduce power supply ringing during the rapid power-on process; current mode logic clock buffers in a clock path of the chip interface to further reduce the effect of power supply ringing; a multiplying injection-locked oscillator (MILO) clock generator to generate higher frequency clock signals from a reference clock; a digitally controlled delay line which can be inserted in the clock path to mitigate deterministic jitter caused by the MILO clock generator; and circuits for periodically re-evaluating whether it is safe to retime transmit data signals in the reference clock domain directly with the faster clock signals.

...read moreread less

66 citations

Journal Article•DOI•

Design Flow for Flip-Flop Grouping in Data-Driven Clock Gating

[...]

Shmuel Wimer¹, Israel Koren²•Institutions (2)

Technion – Israel Institute of Technology¹, University of Massachusetts Amherst²

01 Apr 2014-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: Data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies.

...read moreread less

Abstract: Clock gating is a predominant technique used for power saving. It is observed that the commonly used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal. The question of what is the group size maximizing the power savings is answered in a previous paper. Here we answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a practical solution based on the toggling activity correlations of FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies. These savings are achieved on top of the sClock gating is a predominant technique used for power saving. It is observed that the commonly used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal. The question of what is the group size maximizing the power savings is answered in a previous paper. Here we answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a practical solution based on the toggling activity correlations of FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies. These savings are achieved on top of the savings obtained by clock gating synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level design.avings obtained by clock gating synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level design.

...read moreread less

51 citations

Journal Article•DOI•

Temperature-Assisted Clock Synchronization and Self-Calibration for Sensor Networks

[...]

Zhe Yang¹, Liang He², Lin Cai³, Jianping Pan³•Institutions (3)

Northwestern Polytechnical University¹, Singapore University of Technology and Design², University of Victoria³

17 Jun 2014-IEEE Transactions on Wireless Communications

TL;DR: The temperature-assisted clock self-calibration (TACSC) scheme can improve the synchronization accuracy by more than one order of magnitude, which is verified by both simulation and testbed experimentation.

...read moreread less

Abstract: Synchronization is a pre-requisite for many sensor network applications. However, it remains challenging in sensor networks due to both the limited resources and the dynamic environments. In this paper, we propose a new two-phase clock synchronization scheme. The first one is the external clock synchronization phase, during which nodes update their clock by exchanging timestamp messages with the reference clock. Different from the conventional solutions, we propose to directly remove the clock skew during the external synchronization to achieve a higher synchronization accuracy and lower computational complexity. The second one is the clock self-calibration phase, as the accumulated clock skew will make the synchronized clock drift away again, we need to compensate the clock skew to maintain the clock synchronization accuracy. However, the compensation is non-trivial as the clock skew may not be constant due to the changing environment. Thus we propose the temperature-assisted clock self-calibration (TACSC) to dynamically compensate the clock skew according to the working temperature. Extensive simulation demonstrates that the proposed synchronization scheme can achieve a much lower root mean square error in the external synchronization phase. Furthermore, during the clock self-calibration phase, the TACSC scheme can improve the synchronization accuracy by more than one order of magnitude, which is verified by both simulation and testbed experimentation.

...read moreread less

41 citations

Proceedings Article•DOI•

Dynamic fine-grained scheduling for energy-efficient main-memory queries

[...]

Iraklis Psaroudakis¹, Thomas Kissinger², Danica Porobic¹, Thomas Ilsche², Erietta Liarou¹, Pinar Tözün¹, Anastasia Ailamaki¹, Wolfgang Lehner² - Show less +4 more•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Dresden University of Technology²

23 Jun 2014

TL;DR: It is argued that databases should employ a fine-grained approach by dynamically scheduling tasks using precise hardware models, and it is experimentally shown that energy efficiency can be improved by up to 4x for fundamental memory-intensive database operations, such as scans.

...read moreread less

Abstract: Power and cooling costs are some of the highest costs in data centers today, which make improvement in energy efficiency crucial. Energy efficiency is also a major design point for chips that power whole ranges of computing devices. One important goal in this area is energy proportionality, arguing that the system's power consumption should be proportional to its performance. Currently, a major trend among server processors, which stems from the design of chips for mobile devices, is the inclusion of advanced power management techniques, such as dynamic voltage-frequency scaling, clock gating, and turbo modes.A lot of recent work on energy efficiency of database management systems is focused on coarse-grained power management at the granularity of multiple machines and whole queries. These techniques, however, cannot efficiently adapt to the frequently fluctuating behavior of contemporary workloads. In this paper, we argue that databases should employ a fine-grained approach by dynamically scheduling tasks using precise hardware models. These models can be produced by calibrating operators under different combinations of scheduling policies, parallelism, and memory access strategies. The models can be employed at run-time for dynamic scheduling and power management in order to improve the overall energy efficiency. We experimentally show that energy efficiency can be improved by up to 4x for fundamental memory-intensive database operations, such as scans.

...read moreread less

26 citations

Journal Article•DOI•

10.1 A 28nm DSP powered by an on-chip LDO for high-performance and energy-efficient mobile applications

[...]

Martin Saint-Laurent¹, Paul Bassett¹, Ken Lin¹, Baker Mohammad¹, Yuhe Wang¹, Xufeng Chen¹, Maen Alradaideh¹, Tom Wernimont¹, Kartik Ayyar¹, Dan Bui¹, Dwight Galbi¹, Allan Lester¹, Marzio Pedrali-Noy¹, Willie Anderson¹ - Show less +10 more•Institutions (1)

Qualcomm¹

12 Dec 2014

TL;DR: The very-long instruction word Hexagon™ DSP is fabricated using a 28 nm high-κ metal-gate process technology optimized for mobile applications, and pursues high IPC as opposed to high frequency.

...read moreread less

Abstract: This paper describes the implementation of a Qualcomm Hexagon digital signal processor (DSP) in a 28 nm high-κ metal gate technology. The DSP is a multi-threaded very-long- instruction-word (VLIW) machine optimized for low leakage and energy efficiency. It uses a clock distribution network, clock gating cells, and pulsed latches that are optimized for low switching energy. The processor can be powered using a low-dropout (LDO) voltage regulator or a head switch. It operates from 255 MHz at 0.60 V to 1.24 GHz at 1.05 V. When operating from the LDO, the power consumption of the core can be as low as 58 µW/MHz, which is two to three times lower than comparable cores optimized for ultra-low voltage operation.

...read moreread less

24 citations

Journal Article•DOI•

A Look-Ahead Clock Gating Based on Auto-Gated Flip-Flops

[...]

Shmuel Wimer¹, Arye Albahari²•Institutions (2)

Technion – Israel Institute of Technology¹, Intel²

06 Jan 2014-IEEE Transactions on Circuits and Systems

TL;DR: A novel method called Look-Ahead Clock Gating (LACG) is presented, which combines all the three gating methods of AGFF, and implies a breakeven curve, dividing the FFs space into two regions of positive and negative gating return on investment.

...read moreread less

Abstract: Clock gating is very useful for reducing the power consumed by digital systems. Three gating methods are known. The most popular is synthesis-based, deriving clock enabling signals based on the logic of the underlying system. It unfortunately leaves the majority of the clock pulses driving the flip-flops (FFs) redundant. A data-driven method stops most of those and yields higher power savings, but its implementation is complex and application dependent. A third method called auto-gated FFs (AGFF) is simple but yields relatively small power savings. This paper presents a novel method called Look-Ahead Clock Gating (LACG), which combines all the three. LACG computes the clock enabling signals of each FF one cycle ahead of time, based on the present cycle data of those FFs on which it depends. It avoids the tight timing constraints of AGFF and data-driven by allotting a full clock cycle for the computation of the enabling signals and their propagation. A closed-form model characterizing the power saving per FF is presented. It is based on data-to-clock toggling probabilities, capacitance parameters and FFs' fan-in. The model implies a breakeven curve, dividing the FFs space into two regions of positive and negative gating return on investment. While the majority of the FFs fall in the positive region and hence should be gated, those falling in the negative region should not. Experimentation on industry-scale data showed 22.6% reduction of the clock power, translated to 12.5% power reduction of the entire system.

...read moreread less

24 citations

Patent•

Clock gating for system-on-chip elements

[...]

Sailesh Kumar, Sandip Das, Poonacha Kongetira

01 Oct 2014

TL;DR: In this paper, a hardware element in a NoC includes a clock gating circuit configures one or more neighboring hardware elements to activate before receiving new incoming data and to sleep after a defined number of cycles.

...read moreread less

Abstract: An aspect of the present disclosure provides a hardware element in a Network on Chip (NoC), wherein the hardware element includes a clock gating circuit configures one or more neighboring hardware elements to activate before receiving new incoming data and to sleep after a defined number of cycles, wherein the defined number of cycles can be counted from a cycle having non-receipt of incoming data and/or having a clearance of all data within an input queue of a source hardware element.

...read moreread less

22 citations

Patent•

Self-test solution for delay locked loops

[...]

Edzel Gerald Dela Cruz Raffiñan

14 Mar 2014

TL;DR: In this article, a built-in self test (BIST) circuit and method is provided to test a first and a second DLL, where the BIST circuitry provides a first delay amount over the first delay input creating a start offset between the first and second clock output signals.

...read moreread less

Abstract: A built-in self test (BIST) circuit and method is provided to test a first and a second DLL. The first DLL has a first delay input, a first clock input disposed to receive a clock input signal, and a first clock output that provides a first clock output signal delayed in comparison with the clock input signal. The second DLL has a second delay input, a second clock input disposed to receive the clock input signal, and a second clock output signal delayed in comparison with the clock input signal. The BIST circuitry provides a first delay amount over the first delay input creating a start offset between the first and second clock output signals. If the first DLL is functioning properly the start offset between the output signals should remain unchanged even after the BIST circuitry provides an additional common delay amount to the first and second delay inputs.

...read moreread less

21 citations

Proceedings Article•DOI•

5.3 Wide-frequency-range resonant clock with on-the-fly mode changing for the POWER8 TM microprocessor

[...]

Phillip J. Restle¹, David Shan¹, David Hogenmiller¹, Yong Kim¹, Alan J. Drake¹, Jason D. Hibbeler¹, T.J. Bucelot¹, Gregory Scott Still¹, Keith A. Jenkins¹, Joshua Friedrich¹ - Show less +6 more•Institutions (1)

IBM¹

06 Mar 2014

TL;DR: A resonant-clock design for the IBM POWER8 processor core was implemented with 2 resonant modes (and a non-resonant mode), saving clock power over a wide frequency range from 2.5GHz to more than 5GHz.

...read moreread less

Abstract: A resonant-clock design for the IBM POWER8 processor core was implemented with 2 resonant modes (and a non-resonant mode), saving clock power over a wide frequency range from 2.5GHz to more than 5GHz. The POWER8 microprocessor is composed of 12 chiplets, each containing a single resonant clock grid for one core and its L2 cache, and a half-frequency, non-resonant clock grid for the L3 cache. The clock grids drive the local clock buffers (LCBs) that in turn drive the latches. The LCBs are gated off to measure the global clock power from the PLL to the LCBs. The resonant core communicates synchronously with the L3, requiring low skew between the domains. The chip was designed in a 22nm SOI process, including two ultra-thick-metal (UTM) layers (3 microns thick) for power distribution, I/O, all long global clock wires, and the resonant clock inductors. The UTM technology reduces wire resistance and simplifies inductor design, but requires accurate transmission line modeling and special routing.

...read moreread less

21 citations

Book Chapter•DOI•

Electric Clock for NanoMagnet Logic Circuits

[...]

Marco Vacca¹, Mariagrazia Graziano¹, Alessandro Chiolerio², Andrea Lamberti², Marco Laurenti², D. Balma³, Emanuele Enrico, Federica Celegato, Paola Tiberto, Luca Boarino, Maurizio Zamboni¹ - Show less +7 more•Institutions (3)

Polytechnic University of Turin¹, Istituto Italiano di Tecnologia², École Polytechnique Fédérale de Lausanne³

01 Jan 2014

TL;DR: Among Field-Coupled technologies, NanoMagnet Logic (NML) is one of the most promising, but the necessity of using an external magnetic field to locally control the circuit represents the weakest point of this technology.

...read moreread less

Abstract: Among Field-Coupled technologies, NanoMagnet Logic (NML) is one of the most promising. Low dynamic power consumption, total absence of static power, remarkable heat and radiations resistance, in association with the possibility of combining memory and logic in the same device, make this technology the ideal candidate for low power, portable applications. However, the necessity of using an external magnetic field to locally control the circuit represents, currently, the weakest point of this technology. The high power losses in the clock generation system adopted up to now wipes out the most important advantages of this technology.

...read moreread less

Journal Article•DOI•

High-performance hardware architectures for multi-level lifting-based discrete wavelet transform

[...]

Anand D. Darji¹, Shailendra Singh Kushwah, Shabbir N. Merchant¹, A.N. Chandorkar¹•Institutions (1)

Indian Institute of Technology Bombay¹

04 Oct 2014-Eurasip Journal on Image and Video Processing

TL;DR: The proposed PMA is very much efficient in terms of operating frequency due to pipelining and reduces and totals computing cycles significantly as compared to the existing multi-level architectures.

...read moreread less

Abstract: In this paper, three hardware efficient architectures to perform multi-level 2-D discrete wavelet transform (DWT) using lifting (5, 3) and (9, 7) filters are presented. They are classified as folded multi-level architecture (FMA), pipelined multi-level architecture (PMA), and recursive multi-level architecture (RMA). Efficient FMA is proposed using dual-input Z-scan block (B1) with 100% hardware utilization efficiency (HUE). Modular PMA is proposed with the help of block (B1) and dual-input raster scan block (B2) with 60% to 75% HUE. Block B1 and B2 are micro-pipelined to achieve critical path as single adder and single multiplier for lifting (5, 3) and (9, 7) filters, respectively. The clock gating technique is used in PMA to save power and area. Hardware-efficient RMA is proposed with the help of block (B1) and single-input recursive block (B3). Block (B3) uses only single processing element to compute both predict and update; thus, 50% multipliers and adders are saved. Dual-input per clock cycle minimizes total frame computing cycles, latency, and on-chip line buffers. PMA for five-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-5 XC5VLX110T field-programmable gate array (FPGA) target device (Xilinx, Inc., San Jose, CA, USA). The proposed PMA is very much efficient in terms of operating frequency due to pipelining. Moreover, this approach reduces and totals computing cycles significantly as compared to the existing multi-level architectures. RMA for three-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-4 VFX100 FPGA target device.

...read moreread less

Journal Article•DOI•

Energy-Efficient Soft-Input Soft-Output Signal Detector for Iterative MIMO Receivers

[...]

Liang Liu¹•Institutions (1)

Lund University¹

04 Mar 2014-IEEE Transactions on Circuits and Systems

TL;DR: This paper presents the VLSI design of an energy-efficient, high-throughput soft-input soft-output signal detector for iterative multiple-input multiple-output (MIMO) receiver and adopts several new algorithm-level techniques to exploit the available a priori information of transmitted bits.

...read moreread less

Abstract: This paper presents the VLSI design of an energy-efficient, high-throughput soft-input soft-output signal detector for iterative multiple-input multiple-output (MIMO) receiver. The detector is evolved from our previously developed imbalanced fixed complexity sphere decoder and adopts several new algorithm-level techniques to exploit the available a priori information of transmitted bits. More specifically, an adaptive tree-travel control scheme, a reliability-dependent log-likelihood ratio correction method and an iteration-based hybrid node enumeration technique are proposed to provide near-optimal detection performance with much reduced computational complexity. A multi-stage parallel VLSI architecture is developed to implement the proposed algorithm with high detection throughput. Furthermore, the block-level clock gating is deployed to save power when the tree-search space is reduced, while still preserving the constant-throughput feature. As a proof of concept, we designed the iterative detector using a 65-nm CMOS technology and conducted post-layout simulation. The core area is 0.64 mm(2) with 198.2 k gates. Working at 240-MHz clock frequency with 1.0-V voltage supply, the detector achieves a maximum 1.44-Gbps throughput. Under frequency-selective channels, the detector core consumes 98.5-, 127.9-, and 149.5-pJ energy per bit detection in open-loop, 2-iteration, and 4-iteration modes, respectively. (Less)

...read moreread less

Proceedings Article•DOI•

Coarse grain clock gating of streaming applications in programmable logic implementations

[...]

Endri Bezati¹, Simone Casale Brunet¹, Marco Mattavelli¹, Jorn W. Janneck²•Institutions (2)

École Normale Supérieure¹, Lund University²

10 Jul 2014

TL;DR: This paper presents a set of techniques for taking advantage of the streaming character of the algorithm by selectively switching off parts of the circuit that cannot execute, thus saving power.

...read moreread less

Abstract: Streaming applications describe a broad class of computing algorithms in areas such as signal processing, media coding and compression, cryptography, video analytics, network touting and packet processing and many others. For many of these applications, programmable logic devices such as FP-GAs are the implementation platform of choice due to their higher flexibility compared to ASICs and lower power consumption and higher performance compared to processors. This paper presents a set of techniques for taking advantage of the streaming character of the algorithm by selectively switching off parts of the circuit that cannot execute, thus saving power. The implementation is integrated into an existing high-level synthesis flow, and applied to a variety of appli-cations, resulting in up to 20% power reduction with a very small additional logic footprint and no loss in throughput. © 2014 European Electronic Chips & Systems design ECSI.

...read moreread less

Journal Article•DOI•

Pulsed-Latch Utilization for Clock-Tree Power Optimization

[...]

Hong-Ting Lin¹, Yi-Lin Chuang², Zong-Han Yang¹, Tsung-Yi Ho¹•Institutions (2)

National Cheng Kung University¹, TSMC²

01 Apr 2014-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This is the first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and flip-flops, based on minimum-cost maximum-flow formulation to globally determine the tree topology.

...read moreread less

Abstract: Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine the tree topology, which maintains load balance and considers the wirelength between pulse generators and pulsed latches. Experimental results indicate that the proposed migration approach can improve the power consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively.

...read moreread less

Journal Article•DOI•

Active Mode Subclock Power Gating

[...]

Jatin N. Mistry, James Myers, Bashir M. Al-Hashimi¹, David Walter Flynn, John Philip Biggs, Geoff V. Merrett¹ - Show less +2 more•Institutions (1)

University of Southampton¹

01 Sep 2014-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents a technique, called subclock power gating, for reducing leakage power during the active mode in low performance, energy-constrained applications and validated it by incorporating it with an ARM Cortex-M0 microprocessor, which was fabricated in a 65-nm process.

...read moreread less

Abstract: This paper presents a technique, called subclock power gating, for reducing leakage power during the active mode in low performance, energy-constrained applications. The proposed technique achieves power reduction through two mechanisms: 1) power gating the combinational logic within the clock period (subclock) and 2) reducing the virtual supply to less than Vth rather than shutting down completely as is the case in conventional power gating. To achieve this reduced voltage, a pair of nMOS and pMOS transistors are used at the head and foot of the power gated logic for symmetric virtual rail clamping of the power and ground supplies. The subclock power gating technique has been validated by incorporating it with an ARM Cortex-M0 microprocessor, which was fabricated in a 65-nm process. Two sets of experiments are done: the first experimentally validates the functionality of the proposed technique in the fabricated test chip and the second investigates the utility of the proposed technique in example applications. Measured results from the fabricated chip show 27% power saving during the active mode for an example wireless sensor node application when compared with the same microprocessor without subclock power gating.

...read moreread less

Proceedings Article•DOI•

OCV-aware top-level clock tree optimization

[...]

Tuck-Boon Chan¹, Kwangsoo Han¹, Andrew B. Kahng¹, Jae-Gon Lee², Siddhartha Nath¹ - Show less +1 more•Institutions (2)

University of California, San Diego¹, Samsung²

20 May 2014

TL;DR: This paper presents a new CTS methodology that optimizes clock logic cell placements and buffer insertions in the top level of a clock tree as a linear program that minimizes a weighted sum of timing slacks, clock uncertainty and wirelength.

...read moreread less

Abstract: The clock trees of high-performance synchronous circuits have many clock logic cells (e.g., clock gating cells, multiplexers and dividers) in order to achieve aggressive clock gating and required performance across a wide range of operating modes and conditions. As a result, clock tree structures have become very complex and difficult to optimize with automatic clock tree synthesis (CTS) tools. In advanced process nodes, CTS becomes even more challenging due to on-chip variation (OCV) effects. In this paper, we present a new CTS methodology that optimizes clock logic cell placements and buffer insertions in the top level of a clock tree. We formulate the top-level clock tree optimization problem as a linear program that minimizes a weighted sum of timing slacks, clock uncertainty and wirelength. Experimental results in a commercial 28nm FDSOI technology show that our method can improve post-CTS worst negative slack across all modes/corners by up to 320ps compared to a leading commercial provider's CTS flow.

...read moreread less

Journal Article•DOI•

Design of an Elliptic Curve Cryptography processor for RFID tag chips.

[...]

Zilong Liu¹, Dongsheng Liu¹, Xuecheng Zou¹, Hui Lin², Jian Cheng¹ - Show less +1 more•Institutions (2)

Huazhong University of Science and Technology¹, Wuhan University of Technology²

26 Sep 2014-Sensors

TL;DR: A modified circular shift register architecture is presented in this paper, which is an effective way to reduce the area of register files and makes the Elliptic Curve Cryptography Processor (ECP) a prospective candidate for application in the RFID tag chip.

...read moreread less

Abstract: Radio Frequency Identification (RFID) is an important technique for wireless sensor networks and the Internet of Things. Recently, considerable research has been performed in the combination of public key cryptography and RFID. In this paper, an efficient architecture of Elliptic Curve Cryptography (ECC) Processor for RFID tag chip is presented. We adopt a new inversion algorithm which requires fewer registers to store variables than the traditional schemes. A new method for coordinate swapping is proposed, which can reduce the complexity of the controller and shorten the time of iterative calculation effectively. A modified circular shift register architecture is presented in this paper, which is an effective way to reduce the area of register files. Clock gating and asynchronous counter are exploited to reduce the power consumption. The simulation and synthesis results show that the time needed for one elliptic curve scalar point multiplication over GF(2163) is 176.7 K clock cycles and the gate area is 13.8 K with UMC 0.13 μm Complementary Metal Oxide Semiconductor (CMOS) technology. Moreover, the low power and low cost consumption make the Elliptic Curve Cryptography Processor (ECP) a prospective candidate for application in the RFID tag chip.

...read moreread less

Patent•

Clock signal error correction in a digital-to-analog converter

[...]

Bernd Schafferer¹, Ping Wing Lai¹, Qiurong He¹•Institutions (1)

Analog Devices¹

13 Mar 2014

TL;DR: In this paper, a digital-to-analog converter (DAC) including a correction circuit for a clock, including a differential clock, is described, where replica cells that are substantially similar to conversion cells are configured to provide a feedback signal to a clock receiver with information for correcting the clock signal.

...read moreread less

Abstract: In an example, there is disclosed herein a digital-to-analog converter (DAC) including a correction circuit for a clock, including a differential clock. Error correction may take place within the DAC core, by means of replica cells that are substantially similar to conversion cells. Rather than contributing their output to the converted signal, the replica cells may be configured to provide a feedback signal to a clock receiver with information for correcting the clock signal. The feedback signal may be operable to correct errors, for example, in duty cycle and crosspoint, as measured at the DAC core.

...read moreread less

Patent•

Data input circuit

[...]

Kyoung Hwan Kwon¹, Tae Jin Kang¹, Sang Kwon Lee¹•Institutions (1)

SK Hynix¹

18 Sep 2014

TL;DR: In this article, a data input circuit includes a clock sampling unit, a final clock generation unit and a write latch signal generation unit, which is configured to generate a level signal by latching the shifting signal in synchronization with the sampling clock.

...read moreread less

Abstract: A data input circuit includes a clock sampling unit, a final clock generation unit, and a write latch signal generation unit. The sampling unit is configured to generate a shifting signal including a pulse generated after a write latency is elapsed, and generate a sampling clock by sampling an internal clock during a burst period from substantially a time when the pulse of the shifting signal is generated. The final clock generation unit is configured to generate a level signal by latching the shifting signal in synchronization with the sampling clock and generate a final clock from the level signal in response to a burst signal. The write latch signal generation unit is configured to generate an enable signal by latching the final clock and generate a write latch signal for latching and outputting aligned data in response to the enable signal.

...read moreread less

An improved instruction-level power model for ARM11 microprocessor

[...]

Wei Wang¹, Mark Zwolinski¹•Institutions (1)

University of Southampton¹

23 Jan 2014

TL;DR: An instruction-level power model based on an ARM1176JZF-S processor to predict the power of software applications and it is proved that energy per operation (EPO) decreases with increasing operations per clock cycle, and the relationship empirically is confirmed.

...read moreread less

Abstract: The power and energy consumed by a chip has become the primary design constraint for embedded systems, which has led to a lot of work in hardware design techniques such as clock gating and power gating. The software can also affect the power usage of a chip, hence good software design can be used to reduce the power further. In this paper we present an instruction-level power model based on an ARM1176JZF-S processor to predict the power of software applications. Our model takes substantially less input data than existing high accuracy models and does not need to consider each instruction individually. We show that the power is related to both the distribution of instruction types and the operations per clock cycle (OPC) of the program. Our model does not need to consider the effect of two adjacent instructions, which saves a lot of calculation and measurements. Pipeline stall effects are also considered by OPC instead of cache miss, because there are a lot of other reasons that can cause the pipeline to stall. The model shows good performance with a maximum estimation error of -8.28\% and an average absolute estimation error is 4.88\% over six benchmarks. Finally, we prove that energy per operation (EPO) decreases with increasing operations per clock cycle, and we confirm the relationship empirically.

...read moreread less

Patent•

Delay locked loop and method of generating clock

[...]

Seong-Ook Jung¹, Dong-Hoon Jung¹, Kyungho Ryu¹, Park Jung Hyun¹•Institutions (1)

Yonsei University¹

17 Jan 2014

TL;DR: In this article, a delay-locked loop with a ring oscillator (RO) and a delay line is presented. But the delay line can be used to delay a reference clock signal and generate a delayed clock signal.

...read moreread less

Abstract: Provided is a delay locked loop (DLL) including a ring oscillator (RO) including a delay line to delay a reference clock signal and generate a delayed clock signal, wherein the RO circulates, through the delay line, a feedback clock signal corresponding to the delayed clock signal to synchronize N cycles of the feedback clock signal with a cycle of the reference clock signal (where N is an integer number equal to or larger than 2); and a first frequency divider dividing the frequency of the delayed clock signal by 1/N (where N is an integer number equal to or larger than 2) to generate an output clock signal.

...read moreread less

Patent•

Performing an operating frequency change using a dynamic clock control technique

[...]

Alexander Gendler¹, Inder M. Sodhi¹•Institutions (1)

Los Angeles Mission College¹

07 Nov 2014

TL;DR: In this article, a processor includes a core to execute instructions, where the core includes a clock generation circuit to receive and distribute a first clock signal at a first operating frequency provided from a phase lock loop of the processor to a plurality of units of the core.

...read moreread less

Abstract: In an embodiment, a processor includes a core to execute instructions, where the core includes a clock generation circuit to receive and distribute a first clock signal at a first operating frequency provided from a phase lock loop of the processor to a plurality of units of the core. The clock generation circuit may include a dynamic clock logic to receive a dynamic clock frequency command and to cause the clock generation circuit to distribute the first clock signal to at least one of the units at a second operating frequency. Other embodiments are described and claimed.

...read moreread less

Journal Article•DOI•

Intermittent Resonant Clocking Enabling Power Reduction at Any Clock Frequency for Near/Sub-Threshold Logic Circuits

[...]

Hiroshi Fuketa¹, Masahiro Nomura, Makoto Takamiya¹, Takayasu Sakurai¹•Institutions (1)

University of Tokyo¹

06 Jan 2014-IEEE Journal of Solid-state Circuits

TL;DR: Measurement results show that IRC reduces the clock power by 36% at 980 kHz and the clock leakage power by 81% compared with conventional non-resonant clocking when IRC is applied to the adder array with latches, which enables flexible selection of the clock frequency.

...read moreread less

Abstract: In order to eliminate the limitation of a narrow frequency range of conventional resonant clocking, intermittent resonant clocking (IRC) is proposed for near/sub-threshold logic circuits. In this paper, IRC is applied to 0.37 V 32-bit adder array with latches and adder array with flip-flops fabricated in a 40 nm CMOS process. Measurement results show that IRC reduces the clock power by 36% at 980 kHz and the clock leakage power by 81% compared with conventional non-resonant clocking when IRC is applied to the adder array with latches. The same power reduction is achieved when IRC is applied to the adder array with flip-flops. IRC can reduce the clock power at any clock frequency, which enables flexible selection of the clock frequency.

...read moreread less

Patent•

Low power toggle latch-based flip-flop including integrated clock gating logic

[...]

Matthew S. Berzins¹, Christina Wells¹•Institutions (1)

Samsung¹

01 May 2014

TL;DR: In this paper, an integrated clock gating logic that can generate an internal glitch-free clock signal was proposed. But it was not shown that the clock signal can be quiescent when the input data to the flip-flop remains constant, thereby reducing power consumption.

...read moreread less

Abstract: Inventive aspects include integrated clock gating logic that can generate an internal glitch-free clock signal. Inventive aspects further include a toggle latch that is coupled to the integrated clock gating logic. The toggle latch can receive the internal clock signal from the integrated clock gating logic. The toggle latch can toggle and latch a data value responsive to the internal clock signal. The integrated clock gating logic can include a latch to latch a clock gating logic signal responsive to a clock signal. The clock gating logic signal can cause the internal clock signal to be quiescent when the input data to the flip-flop remains constant, thereby conserving power consumption.

...read moreread less

Proceedings Article•DOI•

A low-power pipelined MAC architecture using Baugh-Wooley based multiplier

[...]

Rakesh Warrier¹, Chan Hua Vun¹, Wei Zhang²•Institutions (2)

Nanyang Technological University¹, Hong Kong University of Science and Technology²

01 Oct 2014

TL;DR: A low power pipelined MAC architecture that incorporates a 16×16 multiplier using Baugh-Wooley algorithm with high performance multiplier tree, together with clock gating the idle pipeline stages to reduce the power consumption is proposed.

...read moreread less

Abstract: Multiply-accumulator (MAC) is the central unit used in digital signal processors (DSP) that are now widely found in many consumer electronic devices. With current emphasis on minimizing operating power and yet maximizing computation performance for DSPs, efficient MAC architecture with low power consumption and high computation performance is hence desired. This paper proposes a low power pipelined MAC architecture that incorporates a 16×16 multiplier using Baugh-Wooley algorithm with high performance multiplier tree, together with clock gating the idle pipeline stages to reduce the power consumption. Our simulations show that the power consumption of the proposed architecture is 30% to 80% less than the other contemporary MAC architectures, without compromising its computation performance.

...read moreread less

Clock Gating For Dynamic Power Reduction In Synchronous Circuits

[...]

Pooja Singh, Lakshay Sachdeva, Suresh Gyan, Archana Kumari

01 Jan 2014

TL;DR: A method to reduce power dissipation by automatically synthesizing gated-clocks is presented, which generates a derived clock synchronous with the mast er clock, and shows that the clock gating technique significantly improves total dynamic power consumption.

...read moreread less

Abstract: A method to reduce power dissipation by automatically synthesizing gated-clocks is presente d for low power VLSI (very large scale integration) circuit design. Clock power is a major source of dynamic power consumed in synchronous circuits because the clock is fed to most of the circuit blo cks, and the clock switches every cycle. Thus the total clock power is a substantial component of total pow er dissipation in a digital circuit. Clock-gating is a well- known technique to reduce clock power. In clock gating clock to an idle block is disabled. Thus significant amount of power consumption is reduced by employing clock gating. In this method a 4-bit synchronous counter is designed using clock gating. A technique for clock gating is also presented, which generates a derived clock synchronous with the mast er clock. Design examples using gated clocks are provided next. Simulation is performed on Xilinx IS E design tool.Result shows that the clock gating technique significantly improves total dynamic power consumption. It is observed that approximately 11% of dynamic power is saved. Index Terms—Clock gating, low power, synchronous counter.

...read moreread less

Proceedings Article•DOI•

Asynchronous fine-grain power-gated logic

[...]

G Karthikeyan¹, S. Manickavasagam¹•Institutions (1)

Jerusalem College of Engineering, Chennai¹

01 Feb 2014

TL;DR: The project proposed method is an innovative way to reduce power consumption by low-power logic family, called asynchronous fine-grain power-gated logic (AFPL), and the comparison of power consumption of the proposed system with conventional system was done.

...read moreread less

Abstract: With the increasing popularity of battery-driven portable electronics, there is a growing demand for low-power circuit designs. In a typical CMOS digital circuit, power dissipation can be categorized into the dynamic power dissipation, leakage power dissipation, and short-circuit power dissipation. While dynamic power dissipation remains to be the most dominant in many digital circuits, leakage power dissipation has become increasingly more significant especially when the fabrication process enters into deep-sub-micro- or nano-meter-scaled ranges. Asynchronous circuits are well-known for their benefits in terms of dynamic power savings, because asynchronous logic does not switch when inactive. Nevertheless, in deep submicron technologies, leakage currents have become an increasing issue, and thus asynchronous circuits need to focus on reducing power consumption. The project proposed method is an innovative way to reduce power consumption by low-power logic family, called asynchronous fine-grain power-gated logic (AFPL). Here the comparison of power consumption of the proposed system with conventional system was also done.

...read moreread less

Patent•

Image sensor synchronization without input clock and data transmission clock

[...]

Laurent Blanquart, Donald M. Wichern

15 Mar 2014

TL;DR: In this paper, the area of an image sensor by reducing the imaging sensor pad count used for data transmission and clock generation is discussed, as well as methods for reducing the image sensor area.

...read moreread less

Abstract: The disclosure extends to systems and methods for reducing the area of an image sensor by reducing the imaging sensor pad count used for data transmission and clock generation.

...read moreread less

Journal Article•DOI•

A Fine-Grained Clock Buffer Polarity Assignment for High-Speed and Low-Power Digital Systems

[...]

Deokjin Joo¹, Taewhan Kim¹•Institutions (1)

Seoul National University¹

01 Mar 2014-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A completely new fine-grained approach to the clock buffer polarity assignment combined with buffer sizing is proposed, formulating the problem into a multiobjective shortest path problem and solving it effectively for designs with a single power mode, while exploiting the flexibility of the multiobjectives shortest path formulation for Designs with multiple power modes.

...read moreread less

Abstract: The clock buffer polarity assignment is one of the effective design schemes to mitigate the power/ground noise caused by the clock signal propagation in high-speed digital systems. This paper overcomes a set of fundamental limitations of the conventional clock buffer polarity assignment methods, which are: 1) the unawareness of the signal delay (i.e., arrival time) differences to the leaf clock buffering elements; 2) the ignorance of the effect of the current fluctuation of nonleaf clock buffering elements on the total peak current waveform; and 3) the inability of supporting low-power digital designs with multiple (dynamically operating) power modes. Clearly, not addressing 1 and 2 in the polarity assignment may cause a severe inaccuracy on the peak current estimation, which results in unnecessarily high peak current. Moreover, without tackling 3, designs may suffer from clock skew violation in some of the power modes, affecting circuit speed or reliability. To overcome the limitations, we propose a completely new fine-grained approach to the clock buffer polarity assignment combined with buffer sizing, formulating the problem into a multiobjective shortest path problem and solving it effectively for designs with a single power mode, while exploiting the flexibility of our multiobjective shortest path formulation for designs with multiple power modes. Through experiments using benchmark circuits, it is shown that the proposed approach is able to produce designs with 17% lower peak current and 20% lower power noise on average, compared with the results produced by the best ever known method.

...read moreread less

Collapse