scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2011"


Journal ArticleDOI
TL;DR: Two high-speed and low-power full-adder cells designed with an alternative internal logic structure and pass-transistor logic styles that lead to have a reduced power-delay product (PDP) outperform its counterparts exhibiting an average PDP advantage.
Abstract: We present two high-speed and low-power full-adder cells designed with an alternative internal logic structure and pass-transistor logic styles that lead to have a reduced power-delay product (PDP). We carried out a comparison against other full-adders reported as having a low PDP, in terms of speed, power consumption and area. All the full-adders were designed with a 0.18-μm CMOS technology, and were tested using a comprehensive testbench that allowed to measure the current taken from the full-adder inputs, besides the current provided from the power-supply. Post-layout simulations show that the proposed full-adders outperform its counterparts exhibiting an average PDP advantage of 80%, with only 40% of relative area.

283 citations


Journal ArticleDOI
TL;DR: A new approach towards the design and modeling of Memory resistor (Memristor)-based CAM using a combination of memristor MOS devices to form the core of a memory/compare logic cell that forms the building block of the CAM architecture is provided.
Abstract: Large-capacity content addressable memory (CAM) is a key element in a wide variety of applications. The inevitable complexities of scaling MOS transistors introduce a major challenge in the realization of such systems. Convergence of disparate technologies, which are compatible with CMOS processing, may allow extension of Moore's Law for a few more years. This paper provides a new approach towards the design and modeling of Memory resistor (Memristor)-based CAM (MCAM) using a combination of memristor MOS devices to form the core of a memory/compare logic cell that forms the building block of the CAM architecture. The non-volatile characteristic and the nanoscale geometry together with compatibility of the memristor with CMOS processing technology increases the packing density, provides for new approaches towards power management through disabling CAM blocks without loss of stored data, reduces power dissipation, and has scope for speed improvement as the technology matures.

250 citations


Journal ArticleDOI
TL;DR: This paper formalizes the temperature-aware real-time MP soC assignment and scheduling problem and presents an optimal phased steady-state mixed integer linear programming-based solution that considers the impact of scheduling and assignment decisions on MPSoC thermal profiles to directly minimize the chip peak temperature.
Abstract: Increasing integrated circuit (IC) power densities and temperatures may hamper multiprocessor system-on-chip (MPSoC) use in hard real-time systems. This paper formalizes the temperature-aware real-time MPSoC assignment and scheduling problem and presents an optimal phased steady-state mixed integer linear programming-based solution that considers the impact of scheduling and assignment decisions on MPSoC thermal profiles to directly minimize the chip peak temperature. We also introduce a flexible heuristic framework for task assignment and scheduling that permits system designers to trade off accuracy for running time when solving large problem instances. Finally, for task sets with sufficient slack, we show that inserting idle times between task executions can further reduce the peak temperature of the MPSoC quite significantly.

179 citations


Journal ArticleDOI
TL;DR: This paper quantitatively studies how different memory cell sizing may impact the overall computing system performance, and shows that different computing workloads may have conflicting expectations on memory cell size.
Abstract: Because of its high storage density with superior scalability, low integration cost and reasonably high access speed, spin-torque transfer random access memory (STT RAM) appears to have a promising potential to replace SRAM as last-level on-chip cache (e.g., L2 or L3 cache) for microprocessors. Due to unique operational characteristics of its storage device magnetic tunneling junction (MTJ), STT RAM is inherently subject to a write latency versus read latency tradeoff that is determined by the memory cell size. This paper first quantitatively studies how different memory cell sizing may impact the overall computing system performance, and shows that different computing workloads may have conflicting expectations on memory cell sizing. Leveraging MTJ device switching characteristics, we further propose an STT RAM architecture design method that can make STT RAM cache with relatively small memory cell size perform well over a wide spectrum of computing benchmarks. This has been well demonstrated using CACTI-based memory modeling and computing system performance simulations using SimpleScalar. Moreover, we show that this design method can also reduce STT RAM cache energy consumption by up to 30% over a variety of benchmarks.

144 citations


Journal ArticleDOI
TL;DR: Experimental results on two real-life applications demonstrate that the proposed fixed-width modified Booth multipliers can improve the average peak signal-to-noise ratio of output images by at least 2.0 dB and 1.1 dB, respectively.
Abstract: The fixed-width multiplier is attractive to many multimedia and digital signal processing systems which are desirable to maintain a fixed format and allow a little accuracy loss to output data. This paper presents the design of high-accuracy fixed-width modified Booth multipliers. To reduce the truncation error, we first slightly modify the partial product matrix of Booth multiplication and then derive an effective error compensation function that makes the error distribution be more symmetric to and centralized in the error equal to zero, leading the fixed-width modified Booth multiplier to very small mean and mean-square errors. In addition, a simple compensation circuit mainly composed of the simplified sorting network is also proposed. Compared to the previous circuits, the proposed error compensation circuit can achieve a tiny mean error and a significant reduction in mean-square error (e.g., at least 12.3% reduction for the 16-bit fixed-width multiplier) while maintaining the approximate hardware overhead. Furthermore, experimental results on two real-life applications also demonstrate that the proposed fixed-width multipliers can improve the average peak signal-to-noise ratio of output images by at least 2.0 dB and 1.1 dB, respectively.

127 citations


Journal ArticleDOI
TL;DR: It is shown that it is possible to achieve 2-D-like, or even better, power quality by increasing C4 granularity and by selecting suitable TSV size and spacing and this is the first detailed architectural-level analysis for 3-D power delivery.
Abstract: 3-D integrated circuits promise high bandwidth, low latency, low device power, and a small form factor. Increased device density and asymmetrical packaging, however, renders the design of 3-D power delivery a challenge. We investigate in this paper various methods to improve 3-D power delivery. We analyze the impact of through-silicon via (TSV) size and spacing, of controlled collapse chip connection (C4) spacing, and of dedicated power delivery TSVs. In addition to considering typical cylindrical or square metal-filled TSVs (core TSVs), we also investigate using coaxial TSVs for power delivery resulting in reduced routing blockages and added coupling capacitance. Our 3-D evaluation system is composed of a quad-core chip multiprocessor, a memory die, and an accelerator engine, and it is evaluated using representative SPEC benchmark traces. This is the first detailed architectural-level analysis for 3-D power delivery. Our findings provide clear guidelines for 3-D power delivery design. More importantly, we show that it is possible to achieve 2-D-like, or even better, power quality by increasing C4 granularity and by selecting suitable TSV size and spacing.

122 citations


Journal ArticleDOI
TL;DR: In this paper, an extensive comparison of existing flip-flop classes and topologies is carried out and analysis explicitly accounts for effects that arise in nanometer technologies and affect the energy-delay-area tradeoff.
Abstract: In this paper (split into Parts I and II), an extensive comparison of existing flip-flop (FF) classes and topologies is carried out. In contrast to previous works, analysis explicitly accounts for effects that arise in nanometer technologies and affect the energy-delay-area tradeoff (e.g., leakage and the impact of layout and interconnects). Compared to previous papers on FFs comparison, the analysis involves a significantly wider range of FF classes and topologies. In particular, in this Part I, the comparison strategy, which includes the simulation setup, the energy-delay estimation methodology, and an overview of an optimum design strategy, together with the introduction of the analyzed FF classes and topologies, are reported.

115 citations


Journal ArticleDOI
TL;DR: This paper proposes a scheme of layout-aware as well as coverage-driven ILS design, where the partitioning of the flip-flops into ILS segments is determined by their geometric locations, whereas the set of the flips to be placed in parallel are determined by the minimum incompatibility relations among the corresponding bits of a test set.
Abstract: The Illinois Scan Architecture (ILS) consists of several scan path segments and is useful in reducing test application time and test data volume for high density chips. In this paper, we propose a scheme of layout-aware as well as coverage-driven ILS design. The partitioning of the flip-flops into ILS segments is determined by their geometric locations, whereas the set of the flip-flops to be placed in parallel is determined by the minimum incompatibility relations among the corresponding bits of a test set, to enhance fault coverage in broadcast mode. As a result, the number of serial test patterns also reduces.

113 citations


Journal ArticleDOI
TL;DR: The proposed method, called a Matrix code, combines Hamming and Parity codes to assure the improvement of reliability and yield of the memory chips in the presence of high defects and multiple bit-upsets and achieves reliability increase by more than 50% in some cases.
Abstract: This paper presents a method to protect memories against multiple bit upsets and to improve manufacturing yield. The proposed method, called a Matrix code, combines Hamming and Parity codes to assure the improvement of reliability and yield of the memory chips in the presence of high defects and multiple bit-upsets. The method is evaluated using fault injection experiments. The results are compared to well-known techniques such as Reed-Muller and Hamming codes. The proposed technique performs better than the Hamming codes and achieves comparable performance with Reed-Muller codes with very favorable implementation gains such as 25% reduction in area and power consumption. It also achieves reliability increase by more than 50% in some cases. Further, the yield benefits provided by the proposed method, measured by the yield improvements per cost metric, is up to 300% better than the ones provided by Reed-Muller codes.

110 citations


Journal ArticleDOI
TL;DR: Novel RRAM routing switches are developed to replace the CMOS routing switches to achieve significant density enhancement and power reduction and demonstrate that 2-D and 3-D rFPGAs provide at least 2× -3× overall improvement in terms of area with 20% lower power consumption, compared with the corresponding CMOS FPGAs.
Abstract: In this paper, a novel CMOS-nano hybrid reconfigurable field-prgrammable gate array architecture (rFPGA) is introduced based on resistive memory (RRAM) devices. Different from the existing crossbar-based CMOS-nano architectures, rFPGA consists of mainly 1T1R RRAM structures that can be fabricated by using a CMOS-compatible process. These devices can efficiently establish FPGA block memories. More importantly, novel RRAM routing switches are developed to replace the CMOS routing switches to achieve significant density enhancement and power reduction. The simulation results demonstrate that 2-D and 3-D rFPGAs provide at least 2× -3× overall improvement in terms of area with 20% lower power consumption, compared with the corresponding CMOS FPGAs.

106 citations


Journal ArticleDOI
TL;DR: A novel maximum power point (MPP) tracking scheme is proposed to harvest the maximum power from the vibration system based on piezoelectric conversion for micropower applications.
Abstract: In this work, we present a vibration-based energy scavenging system based on piezoelectric conversion for micropower applications. A novel maximum power point (MPP) tracking scheme is proposed to harvest the maximum power from the vibration system. A time-multiplexing mechanism is employed to perform energy harvesting and MPP tracking alternately. In the MPP tracking mode, a voltage reference that represents the optimal output voltage at the MPP is generated. A control unit then uses this reference to track the system operation around the MPP. The proposed system is capable of self-starting up without the help of an energy buffer. As a result, it is suitable for battery-less applications or when the energy buffer is completely drained. This tracking scheme has very small power overhead and is simple to implement in VLSI. Hence, it is especially applicable for micropower systems. The entire design was fabricated in a 0.35-μ m CMOS process. Experimental results verified the proposed MPP tracking scheme and demonstrated the system operation. Measurement results show that the power harvesting efficiency of the electrical circuitry is higher than 90%.

Journal ArticleDOI
TL;DR: The proposed reconfigurable NoC architecture supports multiple applications by appropriately configuring itself to a topology that matches the traffic pattern of the currently running application.
Abstract: In this paper, we present a reconfigurable architecture for networks-on-chip (NoC) on which arbitrary application-specific topologies can be implemented. When a new application starts, the proposed NoC tailors its topology to the application traffic pattern by changing the inter-router connections to some predefined configuration corresponding to the application. It addresses one of the main drawbacks of the existing application-specific NoC optimization methods, i.e., optimization of NoCs based on the traffic pattern of a single application. Supporting multiple applications is a critical feature of an NoC when several different applications are integrated into a single modern and complex multicore system-on-chip or chip multiprocessor. The proposed reconfigurable NoC architecture supports multiple applications by appropriately configuring itself to a topology that matches the traffic pattern of the currently running application. This paper first introduces the proposed reconfigurable topology and then addresses the problems of core to network mapping and topology exploration. Further on, we evaluate the impact of different architectural attributes on the performance of the proposed NoC. Evaluations consider network latency, power consumption, and area complexity.

Journal ArticleDOI
TL;DR: This paper presents a lightweight concurrent fault detection scheme for the AES, and shows that both the application-specific integrated circuit and field-programmable gate-array implementations of the fault detection structures using the obtained optimum composite fields, have better hardware and time complexities compared to their counterparts.
Abstract: The faults that accidently or maliciously occur in the hardware implementations of the Advanced Encryption Standard (AES) may cause erroneous encrypted/decrypted output. The use of appropriate fault detection schemes for the AES makes it robust to internal defects and fault attacks. In this paper, we present a lightweight concurrent fault detection scheme for the AES. In the proposed approach, the composite field S-box and inverse S-box are divided into blocks and the predicted parities of these blocks are obtained. Through exhaustive searches among all available composite fields, we have found the optimum solutions for the least overhead parity-based fault detection structures. Moreover, through our error injection simulations for one S-box (respectively inverse S-box), we show that the total error coverage of almost 100% for 16 S-boxes (respectively inverse S-boxes) can be achieved. Finally, it is shown that both the application-specific integrated circuit and field-programmable gate-array implementations of the fault detection structures using the obtained optimum composite fields, have better hardware and time complexities compared to their counterparts.

Journal ArticleDOI
TL;DR: An architectural template to enable design space exploration of different possible CGRA designs is proposed, called the template expression-grained reconfigurable array (EGRA), as its ability to generate complex computational cells, executing expressions as opposed to single operations is a defining feature.
Abstract: Reconfigurable arrays combine the benefit of spatial execution, typical of hardware solutions, with that of programmability, present in microprocessors. When mapping software applications (or parts of them) onto hardware, however, fine-grain arrays, such as field-programmable gate arrays (FPGAs), often provide more flexibility than is needed, and do not implement coarser-level operations efficiently. Therefore, coarse grained reconfigurable arrays (CGRAs) have been proposed to this aim. Most CGRA design emerged in research present ad-hoc solutions in many aspects; in this paper we propose an architectural template to enable design space exploration of different possible CGRA designs. We called the template expression-grained reconfigurable array (EGRA), as its ability to generate complex computational cells, executing expressions as opposed to single operations, is a defining feature. Other notable EGRA characteristics include the ability to support heterogeneous cells and different storage requirements through various memory interfaces. The performed design explorations, as shown trough the experimental data provided, can effectively drive designers to further close the performance gap between reconfigurable and hardwired logic by providing guidelines on architectural design choices. Performance results on a number of embedded applications show that EGRA instances can be used as a reconfigurable fabric for customizable processors, outperforming more traditional CGRA designs.

Journal ArticleDOI
TL;DR: This paper presents a systematic design approach to provide the optimized Rivest-Shamir-Adleman (RSA) processors based on high-radix Montgomery multipliers satisfying various user requirements, such as circuit area, operating time, and resistance against side-channel attacks.
Abstract: This paper presents a systematic design approach to provide the optimized Rivest-Shamir-Adleman (RSA) processors based on high-radix Montgomery multipliers satisfying various user requirements, such as circuit area, operating time, and resistance against side-channel attacks. In order to involve the tradeoff between the performance and the resistance, we apply four types of exponentiation algorithms: two variants of the binary method with/without Chinese Remainder Theorem (CRT). We also introduces three multiplier-based datapath-architectures using different intermediate data forms: 1) single form, 2) semi carry-save form, and 3) carry-save form, and combined them with a wide variety of arithmetic components. Their radices are parameterized from 28 to 2128. A total of 242 datapaths for 1024-bit RSA processors were obtained for each radix. The potential of the proposed approach is demonstrated through an experimental synthesis of all possible processors with a 90-nm CMOS standard cell library. As a result, the smallest design of 861 gates with 118.47 ms/RSA to the fastest design of 0.67 ms/RSA at 153\thinspace 862 gates were obtained. In addition, the use of the CRT technique reduced the RSA operation time of the fastest design to 0.24 ms. Even if we employed the exponentiation algorithm resistant to typical side-channel attacks, the fastest design can perform the RSA operation in less than 1.0 ms.

Journal ArticleDOI
Yun Ye1, Frank Liu2, Min Chen1, Sani R. Nassif2, Yu Cao1 
TL;DR: An efficient SPICE simulation method and statistical variation model that accurately predict threshold variation as a function of dopant fluctuations and gate length change caused by lithography and the etching process is developed.
Abstract: The threshold voltage (Vth) of a nanoscale transistor is severely affected by random dopant fluctuations and line-edge roughness. The analysis of these effects usually requires atomistic simulations which are too expensive in computation for statistical design. In this work, we develop an efficient SPICE simulation method and statistical variation model that accurately predict threshold variation as a function of dopant fluctuations and gate length change caused by lithography and the etching process. By understanding the physical principles of atomistic simulations, we: 1) identify the appropriate method to divide a nonuniform gate into slices in order to map those fluctuations into the device model; 2) extract the variation of Vth from the strong-inversion region instead of the leakage current, benefiting from the linearity of the saturation current with respect to Vth ; 3) propose a compact model of Vth variation that is scalable with gate size and the amount of dopant and gate length fluctuations; and 4) investigate the interaction with non-rectangular gate (NRG) and reverse narrow width effect (RNWE). The proposed SPICE simulation method is validated with atomistic simulation results. Given the post-lithography gate geometry, this approach correctly models the variation of device output current in all operating regions. Based on the new results, we further project the amount of Vth variation at advanced technology nodes, helping shed light on the challenges of future robust circuit design.

Journal ArticleDOI
TL;DR: This work presents a generic, flexible parallel architecture, which is suitable for all ranges of object detection applications and image sizes, and implements the AdaBoost-based detection algorithm, considered one of the most efficient object detection algorithms.
Abstract: Real-time object detection is becoming necessary for a wide number of applications related to computer vision and image processing, security, bioinformatics, and several other areas. Existing software implementations of object detection algorithms are constrained in small-sized images and rely on favorable conditions in the image frame to achieve real-time detection frame rates. Efforts to design hardware architectures have yielded encouraging results, yet are mostly directed towards a single application, targeting specific operating environments. Consequently, there is a need for hardware architectures capable of detecting several objects in large image frames, and which can be used under several object detection scenarios. In this work, we present a generic, flexible parallel architecture, which is suitable for all ranges of object detection applications and image sizes. The architecture implements the AdaBoost-based detection algorithm, which is considered one of the most efficient object detection algorithms. Through both field-programmable gate array emulation and large-scale implementation, and register transfer level synthesis and simulation, we illustrate that the architecture can detect objects in large images (up to 1024 × 768 pixels) with frame rates that can vary between 64-139 fps for various applications and input image frame sizes.

Journal ArticleDOI
TL;DR: A comparison of the most representative flip-flop classes and topologies in a 65-nm CMOS technology is carried out to derive several considerations on each FF class and to identify the best topologies for a targeted application.
Abstract: In Part II of this paper, a comparison of the most representative flip-flop (FF) classes and topologies in a 65-nm CMOS technology is carried out. The comparison, which is performed on the energy-delay-area domain, exploits the strategies and methodologies for FFs analysis and design reported in Part I. In particular, the analysis accounts for the impact of leakage and layout parasitics on the optimization of the circuits. The tradeoffs between leakage, area, clock load, delay, and other interesting properties are extensively discussed. The investigation permits to derive several considerations on each FF class and to identify the best topologies for a targeted application.

Journal ArticleDOI
TL;DR: A novel clocked pair shared flip-flop is proposed which reduces the number of local clocked transistors by approximately 40% and a 24% reduction of clock driving power is achieved.
Abstract: Power consumption is a major bottleneck of system performance and is listed as one of the top three challenges in International Technology Roadmap for Semiconductor 2008. In practice, a large portion of the on chip power is consumed by the clock system which is made of the clock distribution network and flop-flops. In this paper, various design techniques for a low power clocking system are surveyed. Among them is an effective way to reduce capacity of the clock load by minimizing number of clocked transistors. To approach this, we propose a novel clocked pair shared flip-flop which reduces the number of local clocked transistors by approximately 40%. A 24% reduction of clock driving power is achieved. In addition, low swing and double edge clocking, can be easily incorporated into the new flip-flop to build clocking systems.

Journal ArticleDOI
TL;DR: It is shown that the hybrid adder has better performance (in terms of latency) in QCA than a Ladner-Fischer or a ripple carry adder, and has a smaller area-delay product than existing adder designs inQCA.
Abstract: This paper first presents an efficient quantum-dot cellular automata (QCA) design for the Ladner-Fischer prefix adder. We then present an efficient QCA design of a hybrid adder that combines the Ladner-Fischer adder with a ripple carry adder. We show that the hybrid adder has better performance (in terms of latency) in QCA than a Ladner-Fischer or a ripple carry adder. We also show that the hybrid adder has a smaller area-delay product than existing adder designs in QCA. Results of simulation in QCADesigner are also given and they support the theory presented.

Journal ArticleDOI
TL;DR: This paper begins by proposing the use of adaptive body bias (ABB) and adaptive supply voltage (ASV) to maintain optimal performance of an aged circuit, and demonstrates its advantages over a guard banding technique such as synthesis.
Abstract: Negative bias temperature instability (NBTI) in pMOS transistors has become a major reliability concern in present-day digital circuit design. Further, with the recent introduction of Hf-based high-k dielectrics for gate leakage reduction, positive bias temperature instability (PBTI), the dual effect in nMOS transistors, has also reached significant levels. Consequently, designs are required to build in substantial guardbands in order to guarantee reliable operation over the lifetime of a chip, and these involve large area and power overheads. In this paper, we begin by proposing the use of adaptive body bias (ABB) and adaptive supply voltage (ASV) to maintain optimal performance of an aged circuit, and demonstrate its advantages over a guard banding technique such as synthesis. We then present a hybrid approach, utilizing the merits of both ABB and synthesis, to ensure that the resultant circuit meets the performance constraints over its lifetime, and has a minimal area and power overhead, as compared with a nominally designed circuit.

Journal ArticleDOI
TL;DR: A novel check node processing scheme and corresponding VLSI architectures are proposed for the Min-max NB-LDPC decoding algorithm, which aims to store the most reliable v-to-c messages as “compressed” c- to-v messages so the memory requirement of the overall decoder can be substantially reduced.
Abstract: Non-binary low-density parity-check (NB-LDPC) codes can achieve better error-correcting performance than binary LDPC codes when the code length is moderate at the cost of higher decoding complexity. The high complexity is mainly caused by the complicated computations in the check node processing and the large memory requirement. In this paper, a novel check node processing scheme and corresponding VLSI architectures are proposed for the Min-max NB-LDPC decoding algorithm. The proposed scheme first sorts out a limited number of the most reliable variable-to-check (v-to-c) messages, then the check-to-variable (c-to-v) messages to all connected variable nodes are derived independently from the sorted messages without noticeable performance loss. Compared to the previous iterative forward-backward check node processing, the proposed scheme not only significantly reduced the computation complexity, but eliminated the memory required for storing the intermediate messages generated from the forward and backward processes. Inspired by this novel c-to-v message computation method, we propose to store the most reliable v-to-c messages as “compressed” c-to-v messages. The c-to-v messages will be recovered from the compressed format when needed. Accordingly, the memory requirement of the overall decoder can be substantially reduced. Compared to the previous Min-max decoder architecture, the proposed design for a (837, 726) code over GF(25) can achieve the same throughput with only 46% of the area.

Journal ArticleDOI
TL;DR: Three new hardened designs for CMOS latches at 32 nm feature size are proposed; these circuits are Schmitt trigger based, while the third one utilizes a cascode configuration in the feedback loop.
Abstract: Deep sub-micrometer/nano CMOS circuits are more sensitive to externally induced radiation phenomena that are likely to cause the occurrence of so-called soft errors. Therefore, the tolerance of the circuit to the soft errors is a strict requirement in nanoscale circuit designs. Since the traditional error tolerant methods result in significant cost penalties in terms of power, area, and performance, the development of low-cost hardened designs for storage cells (such as latches and memories) is of increasing importance. This paper proposes three new hardened designs for CMOS latches at 32 nm feature size; these circuits are Schmitt trigger based, while the third one utilizes a cascode configuration in the feedback loop. The Cascode ST latch has 112% higher critical charge than the conventional reference latch with only 10% area increase. A novel design metric (QPAR) for latches is introduced to assess the overall design effectiveness such as area, performance, power, and soft error tolerance. The novel metric (QPAR) shows the proposed cascode ST latch achieves up to 36% improvement in terms of QPAR compared with the existing hardening designs. Monte Carlo analysis has confirmed the robustness of the proposed hardened latches to process, voltage, and temperature (PVT) variations.

Journal ArticleDOI
TL;DR: A two-stage 3-D fixed-outline floorplaning algorithm that simultaneously plans hard macros and TSV-blocks for wirelength reduction and improves the wirelength by reassigning signal TSVs.
Abstract: In this paper, we will study floorplanning in 3-D integrated circuits (3D-ICs). Although literature is abundant on 3D-IC floorplanning, none of them consider the areas and positions of signal through-silicon vias (TSVs). In previous research, signal TSVs are viewed as points during the floorplanning stage. Ignoring the areas, positions and connections of signal TSVs, previous research estimates wirelength by measuring the half-perimeter wirelength of pins in a net only. Experimental results reveal that 29.7% of nets possess signal TSVs that cannot be put into the white space within the bounding boxes of pins. Moreover, the total wirelength is underestimated by 26.8% without considering the positions of signal TSVs. The considerable error in wirelength estimation severely degrades the optimality of the floorplan result. Therefore, in this paper, we will propose a two-stage 3-D fixed-outline floorplaning algorithm. Stage one simultaneously plans hard macros and TSV-blocks for wirelength reduction. Stage two improves the wirelength by reassigning signal TSVs. Experimental results show that stage one outperforms a post-processing TSV planning algorithm in successful rate by 57%. Compared to the post-processing TSV planning algorithm, the average wirelength of our result is shorter by 22.3%. In addition, stage two further reduces the wirelength by 3.45% without any area overhead.

Journal ArticleDOI
TL;DR: An adaptive quantization scheme based on fast boundary adaptation rule (FBAR) and differential pulse code modulation (DPCM) procedure followed by an online, least storage quadrant tree decomposition (QTD) processing is proposed enabling a robust and compact image compression processor.
Abstract: This paper presents the architecture, algorithm, and VLSI hardware of image acquisition, storage, and compression on a single-chip CMOS image sensor. The image array is based on time domain digital pixel sensor technology equipped with nondestructive storage capability using 8-bit Static-RAM device embedded at the pixel level. The pixel-level memory is used to store the uncompressed illumination data during the integration mode as well as the compressed illumination data obtained after the compression stage. An adaptive quantization scheme based on fast boundary adaptation rule (FBAR) and differential pulse code modulation (DPCM) procedure followed by an online, least storage quadrant tree decomposition (QTD) processing is proposed enabling a robust and compact image compression processor. A prototype chip including 64×64 pixels, read-out and control circuitry as well as an on-chip compression processor was implemented in 0.35 μm CMOS technology with a silicon area of 3.2×3.0 mm2 and an overall power of 17 mW. Simulation and measurements results show compression figures corresponding to 0.6-1 bit-per-pixel (BPP), while maintaining reasonable peak signal-to-noise ratio levels.

Journal ArticleDOI
TL;DR: A generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC) is presented.
Abstract: This paper presents a generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC). The proposed addressing scheme considers the continuous-flow operation with minimum shared memory requirements. To improve throughput, parallel high-radix processing units are employed. We prove that the solution to non-conflict memory access satisfying the constraints of the continuous-flow, variable-size, higher-radix, and parallel-processing operations indeed exists. In addition, a rescheduling technique for twiddle-factor multiplication is developed to reduce hardware complexity and to enhance hardware efficiency. From the results, we can see that the proposed processor has high utilization and efficiency to support flexible configurability for various FFT sizes with fewer computation cycles than the conventional radix-2/radix-4 memory-based FFT processors.

Journal ArticleDOI
TL;DR: A transient negative bit-line voltage technique is presented to improve write-ability of SRAM cell and shows a 103X reduction in the Write-failure probability with the proposed method.
Abstract: Increasing variations in device parameters significantly degrades the write-ability of SRAM cells in deep sub-100 nm CMOS technology. In this paper, a transient negative bit-line voltage technique is presented to improve write-ability of SRAM cell. Capacitive coupling is used to generate a transient negative voltage at the low-going bit-line during Write operation without using any on-chip or off-chip negative voltage source. Statistical simulations in a 45-nm PD/SOI technology show a 103X reduction in the Write-failure probability with the proposed method.

Journal ArticleDOI
TL;DR: Results show that 3T and MT standard cells turn out to be the only viable option to apply back biasing in FinFET standard cell circuits, and 4T standard cells have an unacceptably worse layout density.
Abstract: In this paper, issues related to the physical design and layout density of FinFET standard cells are discussed. Analysis significantly extends previous analyses, which considered the simplistic case of a single FinFET device or extremely simple circuits. Results show that analysis of a single device cannot predict the layout density of FinFET cells, due to the additional spacing constraints imposed by the standard cell structure. Results on the layout density of FinFET standard cell circuits are derived by building and analyzing various cell libraries in 32-nm technology, based on three-terminal (3T) and four-terminal (4T) devices, as well as on the recently proposed cells with mixed 3T-4T devices (MT). The results obtained for spacer- and lithography-defined FinFETs are observed from the technology scaling perspective by also considering 45- and 65-nm libraries. The effect of the fin and cell height on the layout density is studied. Results show that 3T and MT FinFET standard cells can have the same layout density as bulk cells (or better) for low (moderate) fin heights. Instead, 4T standard cells have an unacceptably worse layout density. Hence, MT standard cells turn out to be the only viable option to apply back biasing in FinFET standard cell circuits.

Journal ArticleDOI
TL;DR: A novel explicit-pulsed dual-edge triggered sense-amplifier flip-flop for low-power and high-performance applications is presented and a modification to the proposed circuit has led to an improved common-mode rejection ratio (CMRR) DET-SAFF.
Abstract: A novel explicit-pulsed dual-edge triggered sense-amplifier flip-flop (DET-SAFF) for low-power and high-performance applications is presented in this paper By incorporating the dual-edge triggering mechanism in the new fast latch and employing conditional precharging, the DET-SAFF is able to achieve low-power consumption that has small delay To further reduce the power consumption at low switching activities, a clock-gated sense-amplifier (CG-SAFF) is engaged Extensive post-layout simulations proved that the proposed DET-SAFF exhibits both the low-power and high-speed properties, with delay and power reduction of up to 433% and 335% of those of the prior art, respectively When the switching activity is less than 05, the proposed CG-SAFF demonstrates its superiority in terms of power reduction During zero input switching activity, CG-SAFF can realize up to 86% in power saving Lastly, a modification to the proposed circuit has led to an improved common-mode rejection ratio (CMRR) DET-SAFF

Journal ArticleDOI
TL;DR: The proposed hardened memory cell overcomes the problems associated with the previous design by utilizing novel access and refreshing mechanisms and achieves 55% reduction in power delay product compared to the DICE cell (with 12 transistors) providing a significant improvement in soft error tolerance.
Abstract: This paper proposes a new hardening design for an 11 transistors (11T) CMOS memory cell at 32 nm feature size. The proposed hardened memory cell overcomes the problems associated with the previous design by utilizing novel access and refreshing mechanisms. Simulation shows that the data stored in the proposed hardened memory cell does not change even for a transient pulse of more than twice the charge than a conventional memory cell. Moreover it achieves 55% reduction in power delay product compared to the DICE cell (with 12 transistors) providing a significant improvement in soft error tolerance. Simulation results are provided using the predictive technology file for 32 nm feature size in CMOS.