scispace - formally typeset
Search or ask a question

Showing papers on "Clock gating published in 2013"


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.
Abstract: General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

558 citations


Journal ArticleDOI
Keith Bowman1, Carlos Tokunaga1, Tanay Karnik1, Vivek De1, J. Tschanz1 
TL;DR: An all-digital dynamically adaptive clock distribution mitigates the impact of high-frequency supply voltage (VCC) droops on microprocessor performance and energy efficiency by integrating a tunable-length delay prior to the global clock distribution to prolong the clock-data delay compensation in critical paths during a VCC droop.
Abstract: An all-digital dynamically adaptive clock distribution mitigates the impact of high-frequency supply voltage (VCC) droops on microprocessor performance and energy efficiency. The design integrates a tunable-length delay prior to the global clock distribution to prolong the clock-data delay compensation in critical paths during a VCC droop. The tunable-length delay prevents critical-path timing-margin degradation for multiple cycles after the VCC droop occurs, thus allowing a sufficient response time for dynamic adaptation. An on-die dynamic variation monitor detects the onset of the VCC droop to proactively gate the clock at the end of the tunable-length delay to eliminate the clock edges that would otherwise degrade critical-path timing margin. In comparison to a conventional clock distribution, silicon measurements from a 22 nm test chip demonstrate simultaneous throughput gains and energy reductions of 14% and 3% at 1.0 V, 18% and 5% at 0.8 V, and 31% and 15% at 0.6 V, respectively, for a 10% VCC droop.

68 citations


Proceedings ArticleDOI
07 Dec 2013
TL;DR: This work proposes a Gating Aware Two-level warp scheduler (GATES) that issues clusters of instructions of the same type before switching to another instruction type, and proposes a new power gating scheme, called Blackout, that forces a power gated execution unit to sleep for at least the break-even time necessary to overcome thePower gating overhead before returning to the active state.
Abstract: With the widespread adoption of GPGPUs in varied application domains, new opportunities open up to improve GPGPU energy efficiency. Due to inherent application-level inefficiencies, GPGPU execution units experience significant idle time. In this work we propose to power gate idle execution units to eliminate leakage power, which is becoming a significant concern with technology scaling. We show that GPGPU execution units are idle for short windows of time and conventional microprocessor power gating techniques cannot fully exploit these idle windows efficiently due to power gating overhead. Current warp schedulers greedily intersperse integer and floating point instructions, which limit power gating opportunities for any given execution unit type. In order to improve power gating opportunities in GPGPU execution units, we propose a Gating Aware Two-level warp scheduler (GATES) that issues clusters of instructions of the same type before switching to another instruction type. We also propose a new power gating scheme, called Blackout, that forces a power gated execution unit to sleep for at least the break-even time necessary to overcome the power gating overhead before returning to the active state. The combination of GATES and Blackout, which we call Warped Gates, can save 31.6% and 46.5% of integer and floating point unit static energy. The proposed solutions suffer less than 1% performance and area overhead.

60 citations


Patent
Melvin P. Roberts1
25 Jan 2013
TL;DR: In this paper, an implantable medical device includes a local clock generator for generating a system clock signal, which is periodically calibrated to maintain the accuracy of the generated system clock signals.
Abstract: An implantable medical device includes a local clock generator for generating a system clock signal. The local clock generator is periodically calibrated to maintain accuracy of the generated system clock signal. A clocking circuit is coupled to the local clock generator to provide the calibration factor for calibrating the local clock generator. The implantable medical device receives an accurate clock signal that is transmitted from an external device and the accurate clock signal is provided to the clocking circuit. The system clock signal is also provided to the clocking circuit and a computation is performed to derive the calibration factor.

51 citations


Patent
13 Feb 2013
TL;DR: In this paper, a cascaded phase-locked loop (PLL) clock generation technique was proposed to reduce frequency drift of a low-jitter clock signal in a holdover mode.
Abstract: A cascaded phase-locked loop (PLL) clock generation technique reduces frequency drift of a low-jitter clock signal in a holdover mode. An apparatus includes a first PLL circuit configured to generate a control signal based on a first clock signal and a first divider value. The apparatus includes a second PLL circuit configured to generate the first clock signal based on a low-jitter clock signal and a second divider value. The apparatus includes a third PLL circuit configured to generate the second divider value based on the first clock signal, a third divider value, and a second clock signal. The low-jitter clock signal may have a greater temperature dependence than the second clock signal and the second clock signal may have a higher jitter than the low-jitter clock signal.

43 citations


Proceedings ArticleDOI
10 Apr 2013
TL;DR: In this paper, latch free clock gating techniques is applied in ALU to reduce clock power and dynamic power consumption of ALU and there is 14.57% reduction in junction temperature on 10GHz operating frequency in compare to temperature without using clock gater techniques.
Abstract: In this paper, latch free clock gating techniques is applied in ALU to reduce clock power and dynamic power consumption of ALU. Clock power is 50%, 41.46%, 51.30%, 55.15% and 55.78% of total dynamic power when device operating frequency is 100MHz, 1GHz, 10GHz, 100GHz and 1 THz. After implementation of clock gating techniques in ALU, Clock power reduces to 17.85%, 23.39%, 26.49% and 27.19% of total dynamic power, when device operating frequency is 1GHz, 10GHz, 100GHz and 1 THz. On 1 THz operating frequency, when we use clock gating, there are 72.77% reduction in clock power, 38.88% reduction in IOs power and 44% reduction in dynamic power in compare to power consumption without using clock gating techniques. Target device is 90-nm Spartan-3. There is 14.57% reduction in junction temperature on 10GHz operating frequency in compare to temperature without using clock gating techniques. Clock gating saves power but increases over all area. There is 32.35%, 37.84%, 43.31% and 44% reduction in dynamic current when we use clock gate on 1GHz, 10GHz, 100GHz and 1THz operating frequency respectively.

41 citations


Proceedings ArticleDOI
11 Apr 2013
TL;DR: This work designs and implementation of Virtex-6 circuit to re-assure power reduction in sequential circuit and shows that there is reduction in dynamic power especialy significant reduction in clock power.
Abstract: In this work, our focus is on study and analysis of various clock gating technique and design and analysis of clock gating based low power sequential circuit at RTL level. Virtex-6 is 40-nm FPGA, on which we implement our circuit to re-assure power reduction in sequential circuit. Clock gating is implemented on smaller circuit called D flip-flop and on larger circuit called 16-bit register. The percentage of reduction in dynamic power especially clock power is verified for different device operating frequency. Here, we achieved 87.09%, 88.02%, 88.02%, and 88.01% clock power reduction in this work when clock period is 1ns, 0.1ns, 0.01ns and 0.001ns respectively. Design and implementation result shows that there is reduction in dynamic power especialy significant reduction in clock power We also achieved 15%, 14.22%, 14.58%, 14.57% and 14.57% dynamic power reduction when clock period is 10ns, 1ns, 0.1ns, 0.01ns, and 01ps respectively.

36 citations


Patent
11 Mar 2013
TL;DR: In this article, the clock distribution network (CDN) is separated from the rest of the logic to improve the clock tree design and reduce the area footprint, and the CDN is connected to the logic tier(s) via high-density inter-tier vias.
Abstract: Exemplary embodiments of the invention are directed to systems and method for designing a clock distribution network for an integrated circuit. The embodiments identify critical sources of clock skew, tightly control the timing of the clock and build that timing into the overall clock distribution network and integrated circuit design. The disclosed embodiments separate the clock distribution network (CDN), i.e., clock generation circuitry, wiring, buffering and registers, from the rest of the logic to improve the clock tree design and reduce the area footprint. In one embodiment, the CDN is separated to a separate tier of a 3D integrated circuit, and the CDN is connected to the logic tier(s) via high-density inter-tier vias. The embodiments are particularly advantageous for implementation with monolithic 3D integrated circuits.

35 citations


Patent
11 Sep 2013
TL;DR: In this paper, the authors proposed a clock recovery mechanism including a phase-locked loop (PLL) with a PDV compensation feature built-in, which can enable a slave clock to recover the master clock to a higher quality as if the communication path between master and slave is free of PDV.
Abstract: This invention relates to methods and devices for frequency distribution based on, for example, the IEEE 1588 Precision Time Protocol (PTP). Packet delay variation (PDV) is a direct contributor to the noise in the recovered clock and various techniques have been proposed to mitigate its effects. Embodiments of the invention provide a mechanism to directly measure and remove PDV effects in the clock recovery mechanism at a slave clock. One particular embodiment provides a clock recovery mechanism including a phase-locked loop (PLL) with a PDV compensation feature built-in. An aim of the invention is to enable a slave clock to recover the master clock to a higher quality as if the communication path between master and slave is free of PDV. This technique may allow a packet network to provide clock synchronization services to the same level as time division multiplexing (TDM) networks and Global Positioning System (GPS).

34 citations


Journal ArticleDOI
TL;DR: This paper deals with the design and implementation of a Clock Gating Aware Low Power Arithmetic and Logic Unit that has been developed as part of low power processor design in the platform Xilinx ISE 14.2 and synthesized on 90nm Spartan-3 FPGA.
Abstract: This paper deals with the design and implementation of a Clock Gating Aware Low Power Arithmetic and Logic Unit that has been developed as part of low power processor design in the platform Xilinx ISE 14.2 and synthesized on 90nm Spartan-3 FPGA. Clock power contributes 45-60 percent of total dynamic power. Hence, clock power reduction is necessary in low power design. In this paper, we analyze theoretical 93.75% clock power reduction in ALU using clock gating techniques. On simulator, we achieved 88.23% clock power reduction using latch based clock gating and 70.58% clock power reduction using latch free clock gating. Index Terms—Clock gate, ALU, FPGA, LUT, clock power, register transfer level, dynamic power, leakage power

34 citations


Journal ArticleDOI
TL;DR: Full asynchronous operation and boosted self-power gating are proposed to improve conversion accuracy and reduce static leakage power by designing with MOSFET of high threshold voltage and low threshold voltage by reducing leakage power without decrease of maximum sampling frequency.
Abstract: This paper presents an ultralow-power and ultralow-voltage SAR ADC. Full asynchronous operation and boosted self-power gating are proposed to improve conversion accuracy and reduce static leakage power. By designing with MOSFET of high threshold voltage (HVt) and low threshold voltage (LVt), the leakage power is reduced without decrease of maximum sampling frequency. The test chip in 40-nm CMOS process has successfully reduced leakage power by 98%, and it achieves 8.2-bit ENOB and while consuming only 650 pW at 0.1 kS/s from 0.5-V power supply. The power consumption is scalable up to 4 MS/s and power supply range from 0.4 to 0.7 V. The best figure of merit at 0.5 V is 5.2 fJ/conversion-step at 20 kS/s.

Patent
26 Sep 2013
TL;DR: In this article, an integrated circuit (IC) with a phase locked loop with capability of fast locking is described, which comprises: a node to provide a reference clock, a digitally controlled oscillator (DCO) to generate an output clock; a divider coupled to the DCO, the divider to divide the output clock and to generate a feedback clock; and control logic operable to reset the DCO and the dividers, and operability to release reset in synchronization with the reference clock.
Abstract: Described is an integrated circuit (IC) with a phase locked loop with capability of fast locking. The IC comprises: a node to provide a reference clock; a digitally controlled oscillator (DCO) to generate an output clock; a divider coupled to the DCO, the divider to divide the output clock and to generate a feedback clock; and control logic operable to reset the DCO and the divider, and operable to release reset in synchronization with the reference clock. An apparatus for zeroing phase error is provided which comprises a first node to provide a reference clock; a second node to provide a feedback clock; a time-to-digital converter, coupled to the first and second nodes, to measure phase error between the reference and feedback clocks; a digital loop filter; and a control unit to adjust the measured phase error, and to provide the adjusted phase error to the digital loop filter.

Journal ArticleDOI
TL;DR: An all-digital phase-locked loop (ADPLL) clock generator for globally asynchronous locally synchronous (GALS) multiprocessor systems-on-chip (MPSoCs) with low power consumption and ultra small chip area, which meets the specification for DDR2/DDR3 memory interfaces.
Abstract: This paper presents an all-digital phase-locked loop (ADPLL) clock generator for globally asynchronous locally synchronous (GALS) multiprocessor systems-on-chip (MPSoCs). With its low power consumption of 2.7 mW and ultra small chip area of 0.0078 mm2 it can be instantiated per core for fine-grained power management like DVFS. It is based on an ADPLL providing a multiphase clock signal from which core frequencies from 83 to 666 MHz with 50% duty cycle are generated by phase rotation and frequency division. The clock meets the specification for DDR2/DDR3 memory interfaces. Additionally, it provides a dedicated high-speed clock up to 4 GHz for serial network-on-chip data links. Core frequencies can be changed arbitrarily within one clock cycle for fast dynamic frequency scaling applications. The performance including statistical analysis of mismatch has been verified by a prototype in 65-nm CMOS technology.

Patent
Young-Pyo Joo1, Shin Taek-Kyun1
08 Oct 2013
TL;DR: An application processor includes a main central processing device that operates based on an external main clock signal received from at least one external clock source when the application processor is in an active mode as discussed by the authors.
Abstract: An application processor includes a main central processing device that operates based on an external main clock signal received from at least one external clock source when the application processor is in an active mode, at least one internal clock source that generates an internal clock signal, and a sensor sub-system that processes sensing-data received from at least one sensor module on a predetermined cycle when the application processor is in the active mode or a sleep mode, and that operates based on the internal clock signal or an external sub clock signal received from the external clock source depending on an operating speed required for processing the sensing-data.

Journal ArticleDOI
TL;DR: The results of the study show that by using partial reconfiguration to eliminate the power consumption of the accelerator when it is inactive, the study can accelerate program execution and at the same time reduce the overall energy consumption by half.
Abstract: One major advantage of reconfigurable computing systems is their ability to reconfigure hardware at runtime. In this paper, we study the feasibility of achieving energy efficiency in reconfigurable computing systems (e.g., FPGAs) through runtime partial reconfiguration (PR) techniques. In the ideal scenario, we use a hardware accelerator to accelerate certain parts of the program execution; when the accelerator is not active, we use partial reconfiguration to unload it to reduce power consumption. Since the reconfiguration process may introduce a high energy overhead, it is unclear whether this approach is efficient. To approach this problem, we first analytically identify the conditions under which partial reconfiguration can reduce energy consumption. Our results indicate that the key to reduce partial reconfiguration energy overhead is to minimize the time overhead of the reconfiguration process. Based on this analysis, we design and implement a fast reconfiguration engine that achieves close-to-ideal throughput on Xilinx Virtex-4 FPGAs. Our fast reconfiguration engine utilizes a master-slave DMA pair to stream data between the SRAM and the Internal Configuration Access Port (ICAP). We experimentally verify our proposed solutions and compare our design to existing energy reduction techniques, such as clock gating. The results of our study show that by using partial reconfiguration to eliminate the power consumption of the accelerator when it is inactive, we can accelerate program execution and at the same time reduce the overall energy consumption by half.

Journal ArticleDOI
J. Aweya1
TL;DR: This paper describes the architecture, servo algorithm, and phase-locked loop (PLL) of a method for implementing differential clock recovery over packet networks that is general enough to be applied in a wide variety of packet networks such IP, MPLS, Ethernet, etc.
Abstract: Accurate timing transfer and recovery over packet networks (IP, Ethernet, MPLS, etc.) has become an important requirement for delivering many telecommunication services. This requirement stems from the fact that current networks are migrating from time-division multiplexing (TDM) technologies to packet based ones, and also the need to synchronize the many timing-dependent devices like TDM access devices and wireless base stations. Unlike TDM, packet networks are asynchronous by design and do not have embedded timing transfer capabilities. Differential clocking is used when there is a network interface with its own reference source clock (the service clock) and there is the need to transfer this clock over a core packet network (with its own independent reference network clock) to another interface. The network clock serves as a sampling clock for the service clock. Timing transfer and recovery over a packet network is a networked control problem given the difficulty in making the recovered clock at the remote location compliant with strict telecom standards. In this paper, we describe the architecture, servo algorithm, and phase-locked loop (PLL) of a method for implementing differential clock recovery over packet networks. The technique involves a clock source or transmitter sending counter values to a receiver from a counter that is clocked and reset, respectively, by the service clock and network clock. It is general enough to be applied in a wide variety of packet networks such IP, MPLS, Ethernet, etc.

Patent
18 Mar 2013
TL;DR: In this article, a first local clock recovery circuit in a first receiver can be caused to produce a test clock which simulates a condition to be tested, and while a second receiver in the plurality of receivers that includes a second local clock and data recovery circuit is caused to use the test clock in place of the reference clock while receiving a test data sequence at its input.
Abstract: An integrated circuit includes a plurality of receivers, each having a clock and data recovery circuit. A first local clock recovery circuit in a first receiver can be caused to produce a test clock which simulates a condition to be tested, and while a second receiver in the plurality of receivers that includes a second local clock recovery circuit is caused to use the test clock in place of the reference clock while receiving a test data sequence at its input. The clock and data recovery circuits in the receivers can include clock control loops responsive to loop control signals to modify the selected reference clock to generate the local clock in response to selective one of (i) a corresponding data signal for normal operation or during a test, and (ii) a test signal applied to the clock control loop in which case the test clock is produced.

Patent
04 Oct 2013
TL;DR: In this article, the forward path delay of a clock circuit in terms of a number of clock cycles of an output clock signal provided by the clock circuit is calculated and the number of additional clock cycles is based at least in part on the number.
Abstract: Apparatuses and methods related to altering the timing of command signals for executing commands is disclosed. One such method includes calculating a forward path delay of a clock circuit in terms of a number of clock cycles of an output clock signal provided by the clock circuit and adding a number of additional clock cycles of delay to a forward path delay of a signal path. The forward path delay of the clock circuit is representative of the forward path delay of the signal path and the number of additional clock cycles is based at least in part on the number of clock cycles of forward path delay.

Journal ArticleDOI
TL;DR: This paper presents a four channel receiver for high-speed signal conditioning that consists of a continuous time linear equalizer (CTLE) and a dual loop CDR with phase-interpolator that generates and distributes quadrature clock phases to each CDR for data recovery.
Abstract: This paper presents a four channel receiver for high-speed signal conditioning. Each channel consists of a continuous time linear equalizer (CTLE) and a dual loop CDR with phase-interpolator. All channels share a single PLL that generates and distributes quadrature clock phases to each CDR for data recovery. Clock amplitude, phase INL and phase DNL are derived for IQ phase error and predict phase-dependent jitter contributions to the recovered clock. The multilane receiver was designed in 130-nm CMOS technology. The die occupies an area of 1930 μm by 1250 μm and consumes 67.9 mW per channel. It achieves a maximum data rate of 7 Gbps per channel for 0 and ±200 ppm clock frequency deviation.

Journal ArticleDOI
TL;DR: Harmonic rejection (HR) mixing techniques to obtain a high level of HR are described and a new technique is presented that enables better HR for the (N-1)th harmonic while preserving the level of rejection for other harmonics.
Abstract: In this paper, harmonic rejection (HR) mixing techniques to obtain a high level of HR are described. This is achieved by reducing the sensitivity to mismatches in devices operating at high frequencies. A design fabricated in a 110-nm CMOS process rejects up to the first 14 local oscillator (LO) harmonics and achieves third, fifth, and seventh HR ratios in excess of 52, 54, and 55 dB, respectively, without any calibration or trimming. This mixer also rejects flicker noise and has improved quadrature matching and IIP2 performance. By using a clock N times the desired LO frequency, this scheme rejects the (N-1)th LO harmonic only by an amount of 20log(N-1) dB. A new technique is presented that enables better HR for the (N-1)th harmonic while preserving the level of rejection for other harmonics. This mixer fabricated in a 55-nm standard CMOS process has a programmable number of 8, 10, 12, or 14 mixer phases and achieves an improvement of 29 dB for the (N-1)th harmonic while achieving 52 dB of rejection for the third harmonic. It also rejects flicker noise and has an IIP2 performance of 68 dBm.

Patent
Michael D. Hutton1, David Lewis1
18 Dec 2013
TL;DR: In this article, a skew generator unit includes a delay chain coupled to a clock line that transmits a clock signal, and a selector is coupled to the delay chain and the clock line and may select one of the clock signal and skewed clock signal.
Abstract: A skew generator unit includes a delay chain. The delay chain is coupled to a clock line that transmits a clock signal. The delay chain generates a skewed clock signal having a unit of delay from the clock signal. The skew generator unit also includes a selector. The selector is coupled to the delay chain and the clock line and may select one of the clock signal and the skewed clock signal.

Proceedings ArticleDOI
05 Jun 2013
TL;DR: This paper presents a low-cost, low-power consumption design to calculate the square root using the IEEE754 single-precision floating-point format and demonstrates much lower power consumption than that at 65 nm.
Abstract: Floating-point square root is a fundamental operation in signal processing and various HPC applications. Since this is an expensive operation in resource and energy consumption, its efficient implementation should be of priority in future multicores that will face dark silicon issues. This paper presents a low-cost, low-power consumption design to calculate the square root using the IEEE754 single-precision floating-point format. Two versions of the design are investigated with and without clock gating (CG), respectively. Evaluation involves FPGA and ASIC technologies at 40 and 65 nm. Substantial performance growth and reduced power consumption are gained as compared to a popular iterative solution. The ASIC design demonstrates much lower power consumption, which at 40 nm is lower than that at 65 nm by about a threefold. At 40 nm, CG for the ASIC realization is justified primarily for low activity rates.

Patent
Arifur Rahman1
01 Aug 2013
TL;DR: In this article, a multichip package that includes an interposer and integrated circuits mounted on the interposers is provided, and the interPOSer can be configured to support fine, intermediate, and/or coarse power gating granularities.
Abstract: A multichip package that includes an interposer and integrated circuits mounted on the interposer is provided. The interposer may include interposer routing circuitry and programmable power gating circuitry. At least one of the on-interposer integrated circuits may include power gating control logic that controls the programmable power gating circuitry. Circuitry on the integrated circuit may receive power supply voltage signals from the programmable power gating circuitry via the interposer routing circuitry. The programmable power gating circuitry may be configured to support fine, intermediate, and/or coarse power gating granularities. The programmable power gating circuitry may be used to selectively power down certain portions of the integrated circuit and may be used to provide desired power supply voltage levels to different voltage islands on the integrated circuit.

Proceedings ArticleDOI
24 Mar 2013
TL;DR: Inspired by ionic bonding in Chemistry, this work direct flip-flops to merging friendly locations thus facilitating flip-Flop merging and can save 27% clock power on average.
Abstract: Clock power contributes a significant portion of chip power in modern IC design. Applying multi-bit flip-flops can effectively reduce clock power. State-of-the-art work performs multi-bit flip-flop clustering at the post-placement stage. However, the solution quality may be limited because the combinational gates are immovable during the clustering process. To overcome the deficiency, in this paper, we propose multi-bit flip-flop bonding at placement. Inspired by ionic bonding in Chemistry, we direct flip-flops to merging friendly locations thus facilitating flip-flop merging. Experimental results show that our algorithm, called FF-Bond, can save 27% clock power on average. Compared with state-of-the-art post-placement multi-bit flip-flop clustering, FF-Bond can further reduce 14% clock power.

Journal ArticleDOI
TL;DR: This paper proposes a novel TSV fault-tolerant unit (TFU) to provide tolerance against TSV failures and is the first work in the literature that considers the fault tolerance of a 3-D clock network.
Abstract: Clock network synthesis is one of the most important and challenging problems in 3-D ICs. The clock signals have to be delivered by through-silicon vias (TSVs) to different tiers with minimum skew. While there are a few related works in literature, none consider the reliability of TSVs in a clock tree. Accordingly, the failure of any TSV in the clock tree yields a bad chip. The naive solution using double-TSV can alleviate the problem, but the significant area overhead renders it less practical for large designs. In this paper, we propose a novel TSV fault-tolerant unit (TFU) to provide tolerance against TSV failures. The TFU makes use of the existing 2-D redundant trees designed for prebond testing, and thus has minimum area overhead. In addition, the number of TSVs in a TFU is also adjustable to allow flexibility during clock network synthesis. Compared with the conventional double TSV technique, the 3-D clock network constructed by TFUs can achieve 58% area overhead reduction with similar yield rate on an industrial case. To the best of the authors' knowledge, this is the first work in the literature that considers the fault tolerance of a 3-D clock network. It can be easily integrated with any bottom-up clock network synthesis algorithm.

Patent
13 Dec 2013
TL;DR: In this paper, the authors describe a multi-chip apparatus capable of performing multi-rate synchronous communication between component chips, where each chip may receive a common clock reference signal, and may generate an internal clock signal dependent on the clock reference signals.
Abstract: Embodiments are disclosed of a multi-chip apparatus capable of performing multi-rate synchronous communication between component chips. Each chip may receive a common clock reference signal, and may generate an internal clock signal dependent on the clock reference signal. A clock distribution tree and phase-locked loop may be used to minimize internal clock skew at I/O circuitry at the chip perimeter. Each chip may also generate an internal synchronizing signal that is phase-aligned to the received clock reference signal. Each chip may use its respective synchronizing signal to synchronize multiple clock dividers that provide software-selectable reduced-frequency clock signals to the I/O cells of the chip. In this way, the reduced-frequency clock signals of the multiple chips are edge-aligned to the low-skew internal clock signals, and phase-aligned to the common clock reference signal, allowing the I/O cells of the multiple chips to perform synchronous communication at multiple rates with low clock skew.

Proceedings ArticleDOI
19 May 2013
TL;DR: Experimental results show that the architecture achieves the throughput that is required by the WiMax standard and the design has additional features compared to the previous approaches.
Abstract: This paper presents a reconfigurable FFT architecture for variable-length and multi-streaming WiMax wireless standard. The architecture processes 1 stream of 2048-point FFT, up to 2 streams of 1024-point FFT or up to 4 streams of 512-point FFT. The architecture consists of a modified radix-2 single delay feedback (SDF) FFT. The sampling frequency of the system is varied in accordance with the FFT length. The latch-free clock gating technique is used to reduce power consumption. The proposed architecture has been synthesized for the Virtex-6 XCVLX760 FPGA. Experimental results show that the architecture achieves the throughput that is required by the WiMax standard and the design has additional features compared to the previous approaches. The design uses 1% of the total available FPGA resources and maximum clock frequency of 313.67 MHz is achieved. Furthermore, this architecture can be expanded to suit other wireless standards.

Proceedings ArticleDOI
Kanad Basu1, Prabhat Mishra1, Priyadarsan Patra2, Amir Nahir3, Alon Adir3 
11 Dec 2013
TL;DR: An efficient signal selection algorithm and a low-overhead trace controller design that would enable verification engineers to dynamically select a set of trace signals for improved error detection and demonstrate that this approach can detect up to 3 times more errors compared to existing techniques.
Abstract: Post-silicon validation is one of the most expensive and complex tasks in today's System-on-Chip (SoC) design methodology. A major challenge in post-silicon debug is limited observability of the internal signals. Existing approaches address this issue by selecting a small set of useful signals. These signal states are stored in an on-chip trace buffer during execution. The applicability of existing methods is limited to a specific debug scenario where every component has equal importance all the time. In reality, a verification engineer would like to focus on a specific set of components (functional regions). Some regions can be ignored in a certain duration during execution due to clock gating and other considerations. Similarly, certain regions may be well verified datapath and less likely to have errors compared to other control-intensive regions. In this paper, we propose an efficient signal selection algorithm and a low-overhead trace controller design that would enable verification engineers to dynamically select a set of trace signals for improved error detection. Our experimental results using both ISCAS'89 benchmarks and Opencores circuits demonstrate that our approach can detect up to 3 times more errors compared to existing techniques.

Patent
25 Nov 2013
TL;DR: In this paper, an integrated clock gating (ICG) cell coupled to a NOR gate receives an enable signal and a latch is configured to generate a latch output in response to the state of the enable signal.
Abstract: In an integrated clock gating (ICG) cell a latch is coupled to a NOR gate. The NOR gate receives an enable signal. The latch is configured to generate a latch output in response to the state of the enable signal. The latch includes a tri-state inverter. A NAND gate is coupled to the latch and the NAND gate is configured to generate an inverted clock signal in response to the latch output and a clock input.

Patent
14 Mar 2013
TL;DR: In this article, a technique was proposed to generate small scale clock trees using a spine-based architecture (using spine routing) while also using clustered placement. But this approach also provides the user with ample structure and control to customize small efficient clock trees.
Abstract: A technique generates small scale clock trees using a spine-based architecture (using spine routing) while also using clustered placement. Techniques are used to control clock sink cluster contents in order to minimize clock skew, minimize clock buffer count, and minimize use of routing resources. This approach also provides the user with ample structure and control to customize small efficient clock trees, and can also reduce clock power consumption.