scispace - formally typeset
Search or ask a question

Showing papers on "Clock gating published in 2014"


Journal ArticleDOI
11 Dec 2014
TL;DR: In this article, a 16×16 mesh, 112b data, 256 voltage/clock domain NoC with source-synchronous operation, hybrid packet/circuit-switched flow control, and ultra-low-voltage optimizations is fabricated in 22nm tri-gate CMOS.
Abstract: Energy-efficient networks-on-chip (NoCs) are key enablers for exa-scale computation by shifting power budget from communication toward computation. As core counts scale into the 100s, on-chip interconnect fabrics must support increasing heterogeneity and voltage/clock domains. Synchronous NoCs require either a single clock distributed globally or clock-crossing data FIFOs between clock domains [1]. A global clock requires costly full-chip margining and significant power and area for clock distribution, while synchronizing data FIFOs add power, performance, and area overhead per clock crossing. Source-synchronous NoCs mitigate these penalties by forwarding a local clock along with each packet, but still suffer from high data storage power due to packet switching. Circuit switching removes intra-route data storage, but suffers from low network utilization due to serialized channel setup and data transfer [2]. Hybrid packet/circuit switching parallelizes these operations for higher network utilization. A 16×16 mesh, 112b data, 256 voltage/clock domain NoC with source-synchronous operation, hybrid packet/circuit-switched flow control, and ultra-low-voltage optimizations is fabricated in 22nm tri-gate CMOS [3] to enable: i) 20.2Tb/s total throughput at 0.9V, 25°C, ii) a 2.7× increase in bisection bandwidth to 2.8Tb/s and 93% reduction in circuit-switched latency at 407ps/hop through source-synchronous operation, iii) a 62% latency improvement and 55% increase in energy efficiency to 7.0Tb/s/W through circuit switching, iv) a peak energy efficiency of 18.3Tb/s/W for near-threshold operation at 430mV, 25°C, and v) ultra-low-voltage operation down to 340mV with router power scaling to 363μW.

69 citations


Patent
11 Aug 2014
TL;DR: In this paper, a low power, high performance source-synchronous chip interface which provides rapid turn-on and facilitates high signaling rates between a transmitter and a receiver located on different chips is described.
Abstract: A low-power, high-performance source-synchronous chip interface which provides rapid turn-on and facilitates high signaling rates between a transmitter and a receiver located on different chips is described in various embodiments. Some embodiments of the chip interface include, among others: a segmented “fast turn-on” bias circuit to reduce power supply ringing during the rapid power-on process; current mode logic clock buffers in a clock path of the chip interface to further reduce the effect of power supply ringing; a multiplying injection-locked oscillator (MILO) clock generator to generate higher frequency clock signals from a reference clock; a digitally controlled delay line which can be inserted in the clock path to mitigate deterministic jitter caused by the MILO clock generator; and circuits for periodically re-evaluating whether it is safe to retime transmit data signals in the reference clock domain directly with the faster clock signals.

66 citations


Journal ArticleDOI
TL;DR: Data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies.
Abstract: Clock gating is a predominant technique used for power saving. It is observed that the commonly used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal. The question of what is the group size maximizing the power savings is answered in a previous paper. Here we answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a practical solution based on the toggling activity correlations of FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies. These savings are achieved on top of the sClock gating is a predominant technique used for power saving. It is observed that the commonly used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal. The question of what is the group size maximizing the power savings is answered in a previous paper. Here we answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a practical solution based on the toggling activity correlations of FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies. These savings are achieved on top of the savings obtained by clock gating synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level design.avings obtained by clock gating synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level design.

51 citations


Journal ArticleDOI
TL;DR: The temperature-assisted clock self-calibration (TACSC) scheme can improve the synchronization accuracy by more than one order of magnitude, which is verified by both simulation and testbed experimentation.
Abstract: Synchronization is a pre-requisite for many sensor network applications. However, it remains challenging in sensor networks due to both the limited resources and the dynamic environments. In this paper, we propose a new two-phase clock synchronization scheme. The first one is the external clock synchronization phase, during which nodes update their clock by exchanging timestamp messages with the reference clock. Different from the conventional solutions, we propose to directly remove the clock skew during the external synchronization to achieve a higher synchronization accuracy and lower computational complexity. The second one is the clock self-calibration phase, as the accumulated clock skew will make the synchronized clock drift away again, we need to compensate the clock skew to maintain the clock synchronization accuracy. However, the compensation is non-trivial as the clock skew may not be constant due to the changing environment. Thus we propose the temperature-assisted clock self-calibration (TACSC) to dynamically compensate the clock skew according to the working temperature. Extensive simulation demonstrates that the proposed synchronization scheme can achieve a much lower root mean square error in the external synchronization phase. Furthermore, during the clock self-calibration phase, the TACSC scheme can improve the synchronization accuracy by more than one order of magnitude, which is verified by both simulation and testbed experimentation.

41 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: It is argued that databases should employ a fine-grained approach by dynamically scheduling tasks using precise hardware models, and it is experimentally shown that energy efficiency can be improved by up to 4x for fundamental memory-intensive database operations, such as scans.
Abstract: Power and cooling costs are some of the highest costs in data centers today, which make improvement in energy efficiency crucial. Energy efficiency is also a major design point for chips that power whole ranges of computing devices. One important goal in this area is energy proportionality, arguing that the system's power consumption should be proportional to its performance. Currently, a major trend among server processors, which stems from the design of chips for mobile devices, is the inclusion of advanced power management techniques, such as dynamic voltage-frequency scaling, clock gating, and turbo modes.A lot of recent work on energy efficiency of database management systems is focused on coarse-grained power management at the granularity of multiple machines and whole queries. These techniques, however, cannot efficiently adapt to the frequently fluctuating behavior of contemporary workloads. In this paper, we argue that databases should employ a fine-grained approach by dynamically scheduling tasks using precise hardware models. These models can be produced by calibrating operators under different combinations of scheduling policies, parallelism, and memory access strategies. The models can be employed at run-time for dynamic scheduling and power management in order to improve the overall energy efficiency. We experimentally show that energy efficiency can be improved by up to 4x for fundamental memory-intensive database operations, such as scans.

26 citations


Journal ArticleDOI
12 Dec 2014
TL;DR: The very-long instruction word Hexagon™ DSP is fabricated using a 28 nm high-κ metal-gate process technology optimized for mobile applications, and pursues high IPC as opposed to high frequency.
Abstract: This paper describes the implementation of a Qualcomm Hexagon digital signal processor (DSP) in a 28 nm high-κ metal gate technology. The DSP is a multi-threaded very-long- instruction-word (VLIW) machine optimized for low leakage and energy efficiency. It uses a clock distribution network, clock gating cells, and pulsed latches that are optimized for low switching energy. The processor can be powered using a low-dropout (LDO) voltage regulator or a head switch. It operates from 255 MHz at 0.60 V to 1.24 GHz at 1.05 V. When operating from the LDO, the power consumption of the core can be as low as 58 µW/MHz, which is two to three times lower than comparable cores optimized for ultra-low voltage operation.

24 citations


Journal ArticleDOI
TL;DR: A novel method called Look-Ahead Clock Gating (LACG) is presented, which combines all the three gating methods of AGFF, and implies a breakeven curve, dividing the FFs space into two regions of positive and negative gating return on investment.
Abstract: Clock gating is very useful for reducing the power consumed by digital systems. Three gating methods are known. The most popular is synthesis-based, deriving clock enabling signals based on the logic of the underlying system. It unfortunately leaves the majority of the clock pulses driving the flip-flops (FFs) redundant. A data-driven method stops most of those and yields higher power savings, but its implementation is complex and application dependent. A third method called auto-gated FFs (AGFF) is simple but yields relatively small power savings. This paper presents a novel method called Look-Ahead Clock Gating (LACG), which combines all the three. LACG computes the clock enabling signals of each FF one cycle ahead of time, based on the present cycle data of those FFs on which it depends. It avoids the tight timing constraints of AGFF and data-driven by allotting a full clock cycle for the computation of the enabling signals and their propagation. A closed-form model characterizing the power saving per FF is presented. It is based on data-to-clock toggling probabilities, capacitance parameters and FFs' fan-in. The model implies a breakeven curve, dividing the FFs space into two regions of positive and negative gating return on investment. While the majority of the FFs fall in the positive region and hence should be gated, those falling in the negative region should not. Experimentation on industry-scale data showed 22.6% reduction of the clock power, translated to 12.5% power reduction of the entire system.

24 citations


Patent
01 Oct 2014
TL;DR: In this paper, a hardware element in a NoC includes a clock gating circuit configures one or more neighboring hardware elements to activate before receiving new incoming data and to sleep after a defined number of cycles.
Abstract: An aspect of the present disclosure provides a hardware element in a Network on Chip (NoC), wherein the hardware element includes a clock gating circuit configures one or more neighboring hardware elements to activate before receiving new incoming data and to sleep after a defined number of cycles, wherein the defined number of cycles can be counted from a cycle having non-receipt of incoming data and/or having a clearance of all data within an input queue of a source hardware element.

22 citations


Patent
14 Mar 2014
TL;DR: In this article, a built-in self test (BIST) circuit and method is provided to test a first and a second DLL, where the BIST circuitry provides a first delay amount over the first delay input creating a start offset between the first and second clock output signals.
Abstract: A built-in self test (BIST) circuit and method is provided to test a first and a second DLL. The first DLL has a first delay input, a first clock input disposed to receive a clock input signal, and a first clock output that provides a first clock output signal delayed in comparison with the clock input signal. The second DLL has a second delay input, a second clock input disposed to receive the clock input signal, and a second clock output signal delayed in comparison with the clock input signal. The BIST circuitry provides a first delay amount over the first delay input creating a start offset between the first and second clock output signals. If the first DLL is functioning properly the start offset between the output signals should remain unchanged even after the BIST circuitry provides an additional common delay amount to the first and second delay inputs.

21 citations


Proceedings ArticleDOI
06 Mar 2014
TL;DR: A resonant-clock design for the IBM POWER8 processor core was implemented with 2 resonant modes (and a non-resonant mode), saving clock power over a wide frequency range from 2.5GHz to more than 5GHz.
Abstract: A resonant-clock design for the IBM POWER8 processor core was implemented with 2 resonant modes (and a non-resonant mode), saving clock power over a wide frequency range from 2.5GHz to more than 5GHz. The POWER8 microprocessor is composed of 12 chiplets, each containing a single resonant clock grid for one core and its L2 cache, and a half-frequency, non-resonant clock grid for the L3 cache. The clock grids drive the local clock buffers (LCBs) that in turn drive the latches. The LCBs are gated off to measure the global clock power from the PLL to the LCBs. The resonant core communicates synchronously with the L3, requiring low skew between the domains. The chip was designed in a 22nm SOI process, including two ultra-thick-metal (UTM) layers (3 microns thick) for power distribution, I/O, all long global clock wires, and the resonant clock inductors. The UTM technology reduces wire resistance and simplifies inductor design, but requires accurate transmission line modeling and special routing.

21 citations


Book ChapterDOI
01 Jan 2014
TL;DR: Among Field-Coupled technologies, NanoMagnet Logic (NML) is one of the most promising, but the necessity of using an external magnetic field to locally control the circuit represents the weakest point of this technology.
Abstract: Among Field-Coupled technologies, NanoMagnet Logic (NML) is one of the most promising. Low dynamic power consumption, total absence of static power, remarkable heat and radiations resistance, in association with the possibility of combining memory and logic in the same device, make this technology the ideal candidate for low power, portable applications. However, the necessity of using an external magnetic field to locally control the circuit represents, currently, the weakest point of this technology. The high power losses in the clock generation system adopted up to now wipes out the most important advantages of this technology.

Journal ArticleDOI
TL;DR: The proposed PMA is very much efficient in terms of operating frequency due to pipelining and reduces and totals computing cycles significantly as compared to the existing multi-level architectures.
Abstract: In this paper, three hardware efficient architectures to perform multi-level 2-D discrete wavelet transform (DWT) using lifting (5, 3) and (9, 7) filters are presented. They are classified as folded multi-level architecture (FMA), pipelined multi-level architecture (PMA), and recursive multi-level architecture (RMA). Efficient FMA is proposed using dual-input Z-scan block (B1) with 100% hardware utilization efficiency (HUE). Modular PMA is proposed with the help of block (B1) and dual-input raster scan block (B2) with 60% to 75% HUE. Block B1 and B2 are micro-pipelined to achieve critical path as single adder and single multiplier for lifting (5, 3) and (9, 7) filters, respectively. The clock gating technique is used in PMA to save power and area. Hardware-efficient RMA is proposed with the help of block (B1) and single-input recursive block (B3). Block (B3) uses only single processing element to compute both predict and update; thus, 50% multipliers and adders are saved. Dual-input per clock cycle minimizes total frame computing cycles, latency, and on-chip line buffers. PMA for five-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-5 XC5VLX110T field-programmable gate array (FPGA) target device (Xilinx, Inc., San Jose, CA, USA). The proposed PMA is very much efficient in terms of operating frequency due to pipelining. Moreover, this approach reduces and totals computing cycles significantly as compared to the existing multi-level architectures. RMA for three-level 2-D wavelet decomposition is synthesized using Xilinx ISE 10.1 for Virtex-4 VFX100 FPGA target device.

Journal ArticleDOI
Liang Liu1
TL;DR: This paper presents the VLSI design of an energy-efficient, high-throughput soft-input soft-output signal detector for iterative multiple-input multiple-output (MIMO) receiver and adopts several new algorithm-level techniques to exploit the available a priori information of transmitted bits.
Abstract: This paper presents the VLSI design of an energy-efficient, high-throughput soft-input soft-output signal detector for iterative multiple-input multiple-output (MIMO) receiver. The detector is evolved from our previously developed imbalanced fixed complexity sphere decoder and adopts several new algorithm-level techniques to exploit the available a priori information of transmitted bits. More specifically, an adaptive tree-travel control scheme, a reliability-dependent log-likelihood ratio correction method and an iteration-based hybrid node enumeration technique are proposed to provide near-optimal detection performance with much reduced computational complexity. A multi-stage parallel VLSI architecture is developed to implement the proposed algorithm with high detection throughput. Furthermore, the block-level clock gating is deployed to save power when the tree-search space is reduced, while still preserving the constant-throughput feature. As a proof of concept, we designed the iterative detector using a 65-nm CMOS technology and conducted post-layout simulation. The core area is 0.64 mm(2) with 198.2 k gates. Working at 240-MHz clock frequency with 1.0-V voltage supply, the detector achieves a maximum 1.44-Gbps throughput. Under frequency-selective channels, the detector core consumes 98.5-, 127.9-, and 149.5-pJ energy per bit detection in open-loop, 2-iteration, and 4-iteration modes, respectively. (Less)

Proceedings ArticleDOI
10 Jul 2014
TL;DR: This paper presents a set of techniques for taking advantage of the streaming character of the algorithm by selectively switching off parts of the circuit that cannot execute, thus saving power.
Abstract: Streaming applications describe a broad class of computing algorithms in areas such as signal processing, media coding and compression, cryptography, video analytics, network touting and packet processing and many others. For many of these applications, programmable logic devices such as FP-GAs are the implementation platform of choice due to their higher flexibility compared to ASICs and lower power consumption and higher performance compared to processors. This paper presents a set of techniques for taking advantage of the streaming character of the algorithm by selectively switching off parts of the circuit that cannot execute, thus saving power. The implementation is integrated into an existing high-level synthesis flow, and applied to a variety of appli-cations, resulting in up to 20% power reduction with a very small additional logic footprint and no loss in throughput. © 2014 European Electronic Chips & Systems design ECSI.

Journal ArticleDOI
TL;DR: This is the first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and flip-flops, based on minimum-cost maximum-flow formulation to globally determine the tree topology.
Abstract: Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine the tree topology, which maintains load balance and considers the wirelength between pulse generators and pulsed latches. Experimental results indicate that the proposed migration approach can improve the power consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively.

Journal ArticleDOI
TL;DR: This paper presents a technique, called subclock power gating, for reducing leakage power during the active mode in low performance, energy-constrained applications and validated it by incorporating it with an ARM Cortex-M0 microprocessor, which was fabricated in a 65-nm process.
Abstract: This paper presents a technique, called subclock power gating, for reducing leakage power during the active mode in low performance, energy-constrained applications. The proposed technique achieves power reduction through two mechanisms: 1) power gating the combinational logic within the clock period (subclock) and 2) reducing the virtual supply to less than Vth rather than shutting down completely as is the case in conventional power gating. To achieve this reduced voltage, a pair of nMOS and pMOS transistors are used at the head and foot of the power gated logic for symmetric virtual rail clamping of the power and ground supplies. The subclock power gating technique has been validated by incorporating it with an ARM Cortex-M0 microprocessor, which was fabricated in a 65-nm process. Two sets of experiments are done: the first experimentally validates the functionality of the proposed technique in the fabricated test chip and the second investigates the utility of the proposed technique in example applications. Measured results from the fabricated chip show 27% power saving during the active mode for an example wireless sensor node application when compared with the same microprocessor without subclock power gating.

Proceedings ArticleDOI
20 May 2014
TL;DR: This paper presents a new CTS methodology that optimizes clock logic cell placements and buffer insertions in the top level of a clock tree as a linear program that minimizes a weighted sum of timing slacks, clock uncertainty and wirelength.
Abstract: The clock trees of high-performance synchronous circuits have many clock logic cells (e.g., clock gating cells, multiplexers and dividers) in order to achieve aggressive clock gating and required performance across a wide range of operating modes and conditions. As a result, clock tree structures have become very complex and difficult to optimize with automatic clock tree synthesis (CTS) tools. In advanced process nodes, CTS becomes even more challenging due to on-chip variation (OCV) effects. In this paper, we present a new CTS methodology that optimizes clock logic cell placements and buffer insertions in the top level of a clock tree. We formulate the top-level clock tree optimization problem as a linear program that minimizes a weighted sum of timing slacks, clock uncertainty and wirelength. Experimental results in a commercial 28nm FDSOI technology show that our method can improve post-CTS worst negative slack across all modes/corners by up to 320ps compared to a leading commercial provider's CTS flow.

Journal ArticleDOI
26 Sep 2014-Sensors
TL;DR: A modified circular shift register architecture is presented in this paper, which is an effective way to reduce the area of register files and makes the Elliptic Curve Cryptography Processor (ECP) a prospective candidate for application in the RFID tag chip.
Abstract: Radio Frequency Identification (RFID) is an important technique for wireless sensor networks and the Internet of Things. Recently, considerable research has been performed in the combination of public key cryptography and RFID. In this paper, an efficient architecture of Elliptic Curve Cryptography (ECC) Processor for RFID tag chip is presented. We adopt a new inversion algorithm which requires fewer registers to store variables than the traditional schemes. A new method for coordinate swapping is proposed, which can reduce the complexity of the controller and shorten the time of iterative calculation effectively. A modified circular shift register architecture is presented in this paper, which is an effective way to reduce the area of register files. Clock gating and asynchronous counter are exploited to reduce the power consumption. The simulation and synthesis results show that the time needed for one elliptic curve scalar point multiplication over GF(2163) is 176.7 K clock cycles and the gate area is 13.8 K with UMC 0.13 μm Complementary Metal Oxide Semiconductor (CMOS) technology. Moreover, the low power and low cost consumption make the Elliptic Curve Cryptography Processor (ECP) a prospective candidate for application in the RFID tag chip.

Patent
13 Mar 2014
TL;DR: In this paper, a digital-to-analog converter (DAC) including a correction circuit for a clock, including a differential clock, is described, where replica cells that are substantially similar to conversion cells are configured to provide a feedback signal to a clock receiver with information for correcting the clock signal.
Abstract: In an example, there is disclosed herein a digital-to-analog converter (DAC) including a correction circuit for a clock, including a differential clock. Error correction may take place within the DAC core, by means of replica cells that are substantially similar to conversion cells. Rather than contributing their output to the converted signal, the replica cells may be configured to provide a feedback signal to a clock receiver with information for correcting the clock signal. The feedback signal may be operable to correct errors, for example, in duty cycle and crosspoint, as measured at the DAC core.

Patent
18 Sep 2014
TL;DR: In this article, a data input circuit includes a clock sampling unit, a final clock generation unit and a write latch signal generation unit, which is configured to generate a level signal by latching the shifting signal in synchronization with the sampling clock.
Abstract: A data input circuit includes a clock sampling unit, a final clock generation unit, and a write latch signal generation unit. The sampling unit is configured to generate a shifting signal including a pulse generated after a write latency is elapsed, and generate a sampling clock by sampling an internal clock during a burst period from substantially a time when the pulse of the shifting signal is generated. The final clock generation unit is configured to generate a level signal by latching the shifting signal in synchronization with the sampling clock and generate a final clock from the level signal in response to a burst signal. The write latch signal generation unit is configured to generate an enable signal by latching the final clock and generate a write latch signal for latching and outputting aligned data in response to the enable signal.

23 Jan 2014
TL;DR: An instruction-level power model based on an ARM1176JZF-S processor to predict the power of software applications and it is proved that energy per operation (EPO) decreases with increasing operations per clock cycle, and the relationship empirically is confirmed.
Abstract: The power and energy consumed by a chip has become the primary design constraint for embedded systems, which has led to a lot of work in hardware design techniques such as clock gating and power gating. The software can also affect the power usage of a chip, hence good software design can be used to reduce the power further. In this paper we present an instruction-level power model based on an ARM1176JZF-S processor to predict the power of software applications. Our model takes substantially less input data than existing high accuracy models and does not need to consider each instruction individually. We show that the power is related to both the distribution of instruction types and the operations per clock cycle (OPC) of the program. Our model does not need to consider the effect of two adjacent instructions, which saves a lot of calculation and measurements. Pipeline stall effects are also considered by OPC instead of cache miss, because there are a lot of other reasons that can cause the pipeline to stall. The model shows good performance with a maximum estimation error of -8.28\% and an average absolute estimation error is 4.88\% over six benchmarks. Finally, we prove that energy per operation (EPO) decreases with increasing operations per clock cycle, and we confirm the relationship empirically.

Patent
17 Jan 2014
TL;DR: In this article, a delay-locked loop with a ring oscillator (RO) and a delay line is presented. But the delay line can be used to delay a reference clock signal and generate a delayed clock signal.
Abstract: Provided is a delay locked loop (DLL) including a ring oscillator (RO) including a delay line to delay a reference clock signal and generate a delayed clock signal, wherein the RO circulates, through the delay line, a feedback clock signal corresponding to the delayed clock signal to synchronize N cycles of the feedback clock signal with a cycle of the reference clock signal (where N is an integer number equal to or larger than 2); and a first frequency divider dividing the frequency of the delayed clock signal by 1/N (where N is an integer number equal to or larger than 2) to generate an output clock signal.

Patent
07 Nov 2014
TL;DR: In this article, a processor includes a core to execute instructions, where the core includes a clock generation circuit to receive and distribute a first clock signal at a first operating frequency provided from a phase lock loop of the processor to a plurality of units of the core.
Abstract: In an embodiment, a processor includes a core to execute instructions, where the core includes a clock generation circuit to receive and distribute a first clock signal at a first operating frequency provided from a phase lock loop of the processor to a plurality of units of the core. The clock generation circuit may include a dynamic clock logic to receive a dynamic clock frequency command and to cause the clock generation circuit to distribute the first clock signal to at least one of the units at a second operating frequency. Other embodiments are described and claimed.

Journal ArticleDOI
TL;DR: Measurement results show that IRC reduces the clock power by 36% at 980 kHz and the clock leakage power by 81% compared with conventional non-resonant clocking when IRC is applied to the adder array with latches, which enables flexible selection of the clock frequency.
Abstract: In order to eliminate the limitation of a narrow frequency range of conventional resonant clocking, intermittent resonant clocking (IRC) is proposed for near/sub-threshold logic circuits. In this paper, IRC is applied to 0.37 V 32-bit adder array with latches and adder array with flip-flops fabricated in a 40 nm CMOS process. Measurement results show that IRC reduces the clock power by 36% at 980 kHz and the clock leakage power by 81% compared with conventional non-resonant clocking when IRC is applied to the adder array with latches. The same power reduction is achieved when IRC is applied to the adder array with flip-flops. IRC can reduce the clock power at any clock frequency, which enables flexible selection of the clock frequency.

Patent
01 May 2014
TL;DR: In this paper, an integrated clock gating logic that can generate an internal glitch-free clock signal was proposed. But it was not shown that the clock signal can be quiescent when the input data to the flip-flop remains constant, thereby reducing power consumption.
Abstract: Inventive aspects include integrated clock gating logic that can generate an internal glitch-free clock signal. Inventive aspects further include a toggle latch that is coupled to the integrated clock gating logic. The toggle latch can receive the internal clock signal from the integrated clock gating logic. The toggle latch can toggle and latch a data value responsive to the internal clock signal. The integrated clock gating logic can include a latch to latch a clock gating logic signal responsive to a clock signal. The clock gating logic signal can cause the internal clock signal to be quiescent when the input data to the flip-flop remains constant, thereby conserving power consumption.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: A low power pipelined MAC architecture that incorporates a 16×16 multiplier using Baugh-Wooley algorithm with high performance multiplier tree, together with clock gating the idle pipeline stages to reduce the power consumption is proposed.
Abstract: Multiply-accumulator (MAC) is the central unit used in digital signal processors (DSP) that are now widely found in many consumer electronic devices. With current emphasis on minimizing operating power and yet maximizing computation performance for DSPs, efficient MAC architecture with low power consumption and high computation performance is hence desired. This paper proposes a low power pipelined MAC architecture that incorporates a 16×16 multiplier using Baugh-Wooley algorithm with high performance multiplier tree, together with clock gating the idle pipeline stages to reduce the power consumption. Our simulations show that the power consumption of the proposed architecture is 30% to 80% less than the other contemporary MAC architectures, without compromising its computation performance.

01 Jan 2014
TL;DR: A method to reduce power dissipation by automatically synthesizing gated-clocks is presented, which generates a derived clock synchronous with the mast er clock, and shows that the clock gating technique significantly improves total dynamic power consumption.
Abstract: A method to reduce power dissipation by automatically synthesizing gated-clocks is presente d for low power VLSI (very large scale integration) circuit design. Clock power is a major source of dynamic power consumed in synchronous circuits because the clock is fed to most of the circuit blo cks, and the clock switches every cycle. Thus the total clock power is a substantial component of total pow er dissipation in a digital circuit. Clock-gating is a well- known technique to reduce clock power. In clock gating clock to an idle block is disabled. Thus significant amount of power consumption is reduced by employing clock gating. In this method a 4-bit synchronous counter is designed using clock gating. A technique for clock gating is also presented, which generates a derived clock synchronous with the mast er clock. Design examples using gated clocks are provided next. Simulation is performed on Xilinx IS E design tool.Result shows that the clock gating technique significantly improves total dynamic power consumption. It is observed that approximately 11% of dynamic power is saved. Index Terms—Clock gating, low power, synchronous counter.

Proceedings ArticleDOI
01 Feb 2014
TL;DR: The project proposed method is an innovative way to reduce power consumption by low-power logic family, called asynchronous fine-grain power-gated logic (AFPL), and the comparison of power consumption of the proposed system with conventional system was done.
Abstract: With the increasing popularity of battery-driven portable electronics, there is a growing demand for low-power circuit designs. In a typical CMOS digital circuit, power dissipation can be categorized into the dynamic power dissipation, leakage power dissipation, and short-circuit power dissipation. While dynamic power dissipation remains to be the most dominant in many digital circuits, leakage power dissipation has become increasingly more significant especially when the fabrication process enters into deep-sub-micro- or nano-meter-scaled ranges. Asynchronous circuits are well-known for their benefits in terms of dynamic power savings, because asynchronous logic does not switch when inactive. Nevertheless, in deep submicron technologies, leakage currents have become an increasing issue, and thus asynchronous circuits need to focus on reducing power consumption. The project proposed method is an innovative way to reduce power consumption by low-power logic family, called asynchronous fine-grain power-gated logic (AFPL). Here the comparison of power consumption of the proposed system with conventional system was also done.

Patent
15 Mar 2014
TL;DR: In this paper, the area of an image sensor by reducing the imaging sensor pad count used for data transmission and clock generation is discussed, as well as methods for reducing the image sensor area.
Abstract: The disclosure extends to systems and methods for reducing the area of an image sensor by reducing the imaging sensor pad count used for data transmission and clock generation.

Journal ArticleDOI
TL;DR: A completely new fine-grained approach to the clock buffer polarity assignment combined with buffer sizing is proposed, formulating the problem into a multiobjective shortest path problem and solving it effectively for designs with a single power mode, while exploiting the flexibility of the multiobjectives shortest path formulation for Designs with multiple power modes.
Abstract: The clock buffer polarity assignment is one of the effective design schemes to mitigate the power/ground noise caused by the clock signal propagation in high-speed digital systems. This paper overcomes a set of fundamental limitations of the conventional clock buffer polarity assignment methods, which are: 1) the unawareness of the signal delay (i.e., arrival time) differences to the leaf clock buffering elements; 2) the ignorance of the effect of the current fluctuation of nonleaf clock buffering elements on the total peak current waveform; and 3) the inability of supporting low-power digital designs with multiple (dynamically operating) power modes. Clearly, not addressing 1 and 2 in the polarity assignment may cause a severe inaccuracy on the peak current estimation, which results in unnecessarily high peak current. Moreover, without tackling 3, designs may suffer from clock skew violation in some of the power modes, affecting circuit speed or reliability. To overcome the limitations, we propose a completely new fine-grained approach to the clock buffer polarity assignment combined with buffer sizing, formulating the problem into a multiobjective shortest path problem and solving it effectively for designs with a single power mode, while exploiting the flexibility of our multiobjective shortest path formulation for designs with multiple power modes. Through experiments using benchmark circuits, it is shown that the proposed approach is able to produce designs with 17% lower peak current and 20% lower power noise on average, compared with the results produced by the best ever known method.