scispace - formally typeset
Search or ask a question

Showing papers presented at "Asia and South Pacific Design Automation Conference in 2013"


Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper provides a framework for floorplanning existing 2D IP blocks into 3D-ICs using Monolithic Inter-tier vias, and shows that while TSV-based 3D cannot improve the performance and power unless the TSV capacitance is reduced, MIV-based3D offers significant reduction of upto 33% in the longest path delay and 35% inThe inter-block net power.
Abstract: Three dimensional integrated circuits (3D-ICs) have emerged as a promising solution to continue device scaling. They can be realized using Through Silicon Vias (TSVs), or monolithic integration using Monolithic Inter-tier vias (MIVs), an emerging alternative that provides much higher via densities. In this paper, we provide a framework for floorplanning existing 2D IP blocks into 3D-ICs using MIVs. We take the floorplanning solution all the way through place-and-route and report post-layout metrics for area, wirelength, timing, and power consumption. Results show that the wirelength of TSV-based 3D designs outperform 2D designs by upto 14% in large-scale circuits only. MIV-based 3D designs, however, offer an average wirelength improvement of 33% for a wide range of benchmark circuits. We also show that while TSV-based 3D cannot improve the performance and power unless the TSV capacitance is reduced, MIV-based 3D offers significant reduction of upto 33% in the longest path delay and 35% in the inter-block net power.

74 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: It is shown how the algorithm can be efficiently implemented in OpenCL and optimized for multi-CPUs, GPUs, and FPGAs and compared to a hand coded FPGA implementation to showcase the effectiveness of an OpenCL-to-FPGA compilation tool.
Abstract: Fractal compression is an efficient technique for image and video encoding that uses the concept of self-referential codes. Although offering compression quality that matches or exceeds traditional techniques with a simpler and faster decoding process, fractal techniques have not gained widespread acceptance due to the computationally intensive nature of its encoding algorithm. In this paper, we present a real-time implementation of a fractal compression algorithm in OpenCL [1]. We show how the algorithm can be efficiently implemented in OpenCL and optimized for multi-CPUs, GPUs, and FPGAs. We demonstrate that the core computation implemented on the FPGA through OpenCL is 3× faster than a high-end GPU and 114× faster than a multi-core CPU, with significant power advantages. We also compare to a hand coded FPGA implementation to showcase the effectiveness of an OpenCL-to-FPGA compilation tool.

64 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: An extension of reversible gates which allow multiple target lines in a single gate is introduced which enables a significantly cheaper mapping to quantum circuits.
Abstract: The efficient synthesis of quantum circuits is an active research area. Since many of the known quantum algorithms include a large Boolean component (e.g. the database in the Grover search algorithm), quantum circuits are commonly synthesized in a two-stage approach. First, the desired function is realized as a reversible circuit making use of existing synthesis methods for this domain. Afterwards, each reversible gate is mapped to a functionally equivalent quantum gate cascade. In this paper, we propose an improved mapping of reversible circuits to quantum circuits which exploits a certain structure of many reversible circuits. In fact, it can be observed that reversible circuits are often composed of similar gates which only differ in the position of their target lines. We introduce an extension of reversible gates which allow multiple target lines in a single gate. This enables a significantly cheaper mapping to quantum circuits. Experiments show that considering multiple target lines leads to improvements of up to 85% in the resulting quantum cost.

61 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: A fine-grained partial wear leveling policy is proposed in Curling-PCM, by which only part of the hot region is moved during each request handling period, which can effectively evenly distribute write traffic in PCM chips compared with previous work.
Abstract: Phase change memory (PCM) has been used as NOR flash replacement in embedded systems with its attractive features. However, the endurance of PCM keeps drifting down and greatly limits its adoption in embedded systems. As most embedded systems are application-oriented, we can better utilize PCM by exploring application-specific features such as fixed access patterns and update frequencies to prolong the lifetime of PCM. In this paper, we propose an application-specific wear leveling technique, called Curling-PCM, to evenly distribute write activities across the PCM chip in order to improve the endurance of PCM. The basic idea is to exploit application-specific features in embedded systems and periodically move the hot region across the whole PCM chip. To further reduce the overhead of moving the hot region and improve the performance of PCM-based embedded systems, a fine-grained partial wear leveling policy is proposed in Curling-PCM, by which only part of the hot region is moved during each request handling period. The experimental results show that Curling-PCM can effectively evenly distribute write traffic in PCM chips compared with previous work. We expect this work can serve as a first step towards the full exploration of application-specific features in PCM-based embedded systems.

60 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: A minimal and defect-resilient routing algorithm is proposed in order to route packets adaptively through the shortest paths in the presence of a faulty link, as long as a path exists.
Abstract: The communication requirements of many-core embedded systems are convened by the emerging Network-on-Chip (NoC) paradigm. As on-chip communication reliability is a crucial factor in many-core systems, the NoC paradigm should address the reliability issues. Using fault-tolerant routing algorithms to reroute packets around faulty regions will increase the packet latency and create congestion around the faulty region. On the other hand, the performance of NoC is highly affected by the network congestion. Congestion in the network can increase the delay of packets to route from a source to a destination, so it should be avoided. In this paper, a minimal and defect-resilient (MD) routing algorithm is proposed in order to route packets adaptively through the shortest paths in the presence of a faulty link, as long as a path exists. To avoid congestion, output channels can be adaptively chosen whenever the distance from the current to destination node is greater than one hop along both directions. In addition, an analytical model is presented to evaluate MD for two-faulty cases.

56 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: This work synthesizes the control logic that is used by the biochip controller to automatically execute the biochemical application, and proposes a control pin count minimization scheme aimed at efficiently utilizing chip area, reducing macro-assembly around the chip and enhancing chip scalability.
Abstract: In this paper we are interested in flow-based microfluidic biochips, which are able to integrate the necessary functions for biochemical analysis on-chip. In these chips, the flow of liquid is manipulated using integrated microvalves. By combining several microvalves, more complex units, such as micropumps, mixers, and multiplexers, can be built. In this paper we propose, for the first time to our knowledge, a top-down control synthesis framework for the flow-based biochips. Starting from a given biochemical application and a biochip architecture, we synthesize the control logic that is used by the biochip controller to automatically execute the biochemical application. We also propose a control pin count minimization scheme aimed at efficiently utilizing chip area, reducing macro-assembly around the chip and enhancing chip scalability. We have evaluated our approach using both real-life applications and synthetic benchmarks.

55 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: A routing protocol that exchanges the network information between all chips in a given SiP to establish efficient deadlock-free routing paths and a optimization technique that analyzes the application traffic patterns and selects different spanning tree roots so as to minimize the average hop counts and improve the application performance.
Abstract: Inductive-coupling is yet another 3D integration technique that can be used to stack more than three known-good-dies in a SiP without wire connections. We present a topology-agnostic 3D CMP architecture using inductive-coupling that offers great flexibility in customizing the number of processor chips, SRAM chips, and DRAM chips in a SiP after chips have been fabricated. In this paper, first, we propose a routing protocol that exchanges the network information between all chips in a given SiP to establish efficient deadlock-free routing paths. Second, we propose its optimization technique that analyzes the application traffic patterns and selects different spanning tree roots so as to minimize the average hop counts and improve the application performance.

46 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: A simple but effective method leading to the reduction of the routing congestion and the final routed wirelength for large-scale mixed-size designs and improving routability is described.
Abstract: One of the necessary requirements for the placement process is that it should be capable of generating routable solutions. This paper describes a simple but effective method leading to the reduction of the routing congestion and the final routed wirelength for large-scale mixed-size designs. In order to reduce routing congestion and improve routability, we propose blocking narrow regions on the chip. We also propose dummy-cell insertion inside regions characterized by reduced fixed-macro density. Our placer consists of three major components: (i) narrow channel reduction by performing neighbor-based fixed-macro inflation; (ii) dummy-cell insertion inside large regions with reduced fixed-macro density; and (iii) pre-placement inflation by detecting tangled logic structures in the netlist and minimizing the maximum pin density. We evaluated the quality of our placer using the newly released DAC 2012 routability-driven placement contest designs and we compared our results to the top four teams that participated in the placement contest. The experimental results reveal that our placer improves the routability of the DAC 2012 placement contest designs and effectively reduces the routing congestion.

43 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper provides an overview of memristor based PUF structures and circuits that illustrate the potential for nanoelectronic hardware security solutions.
Abstract: Hardware security has emerged as an important field of study aimed at mitigating issues such as piracy, counterfeiting, and side channel attacks. One popular solution for such hardware security attacks are physical unclonable functions (PUF) which provide a hardware specific unique signature or identification. The uniqueness of a PUF depends on intrinsic process variations within individual integrated circuits. As process variations become more prevalent due to technology scaling into the nanometer regime, novel nanoelectronic technologies such as memristors become viable options for improved security in emerging integrated circuits. In this paper, we provide an overview of memristor based PUF structures and circuits that illustrate the potential for nanoelectronic hardware security solutions.

42 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: A routing method of generating a feasible wafer image satisfying the connection requirements and hotspot reduction by dummy pattern flipping is proposed and the effectiveness of the proposed framework is confirmed.
Abstract: Although Self-Aligned Double and Quadruple Patterning (SADP, SAQP) have become the most promising processes for sub-20 nm and sub-14 nm node advanced technologies, not all wafer images are realized by them. In advanced technologies, feasible wafer images should be generated effectively by utilizing SADP and SAQP where a wafer image is uniquely determined by a selected mandrel pattern. However, predicting the wafer image of a mandrel pattern is not easy. In this paper, we propose a routing method of generating a feasible wafer image satisfying the connection requirements. Routing algorithms comprising simple connecting and cutting rules are performed on a new grid structure where two (SADP) or three colors (SAQP) are assigned alternately to grid-nodes. Then a mandrel pattern is selected without complex coloring or decomposition methods. Also, hotspot reduction by dummy pattern flipping is proposed. In experiments, feasible wafer images meeting the connection requirements are generated and the effectiveness of the proposed framework is confirmed.

42 citations


Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper proposes ScanPUF, a novel PUF implementation using a common on-chip structure used for improving circuit testability, namely scan chain, which exploits path delay variations between the scan flip-flops in a scan chain to create high-quality (in terms of uniqueness and robustness) secret keys.
Abstract: Physical Unclonable Functions (PUFs) have emerged as an attractive primitive to address diverse hardware security issues, such as chip authentication, intellectual property (IP) protection and cryptographic key generation. Existing PUFs, typically acquired and integrated in a design as a commodity, often incur considerable hardware overhead. Many of these PUFs also suffer from insufficient challenge-response pairs. In this paper, we propose ScanPUF, a novel PUF implementation using a common on-chip structure used for improving circuit testability, namely scan chain. It exploits path delay variations between the scan flip-flops in a scan chain to create high-quality (in terms of uniqueness and robustness) secret keys. Furthermore, since a scan chain provides large pool of scan paths to create a signature, we can achieve high volume of secret keys from each chip. Since it uses a prevalent on-chip structure, the overhead is extremely small (2.3% area of the RO-PUF), primarily contributed by small additional logic in the signature-generation cycle controller. Circuit-level simulation results with 1000 chips under inter- and intra-die process variations show high uniqueness of 49.9% average inter-die Hamming distance and good reproducibility of 5% intra-die Hamming distance below 85 °C. The temporal variations due to device aging effect e.g. bias temperature instability (BTI) lead to only 4% unstable bits for ten-year usage. The experimental evaluation on FPGA (Altera Cyclone-III) exhibits 47.1% average inter-Hamming distance, as well as 3.2% unstable bits at room temperature.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: Results show bit-level optimizations in HLS based on static analysis reduce circuit area by 9%, on average, while additional optimizations based on dynamic analysis provide 34% area reduction.
Abstract: We consider the extent to which the bit-level representation of variables can be used to optimize hardware generated by high-level synthesis (HLS). Two approaches to bit-level optimization are considered (individually and together): 1) range analysis, and 2) bitmask analysis. Range analysis aims to predetermine min/max ranges for variables to reduce the bitwidth required to represent variables in hardware. Bitmask analysis characterizes individual bits within a word as either constants (1 or 0), sign bits, or unknowns, where constants/don't-cares permit hardware to be eliminated under certain conditions. Static compiler-based analysis is contrasted with dynamic profiling-based analysis in terms of their potential to impact area and speed of HLS-generated hardware. For a set of benchmarks implemented in the Altera Cyclone II FPGA, results show bit-level optimizations in HLS based on static analysis reduce circuit area by 9%, on average, while additional optimizations based on dynamic analysis provide 34% area reduction.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper proposes a hybrid approach integrating architecture-level and logic-level techniques to accurately estimate the vulnerability of all regular and irregular structures within a microprocessor, and evaluated theulnerability of the OR1200 processor using the proposed approach.
Abstract: With continuous technology downscaling, the rate of radiation induced soft errors is rapidly increasing. Fast and accurate soft error vulnerability analysis in early design stages plays an important role in cost-effective reliability improvement. However, existing solutions are suitable for either regular (a.k.a address-based such as memory hierarchy) or irregular (random logic such as functional units and control logic) structures, failing to provide an accurate system-level analysis. In this paper, we propose a hybrid approach integrating architecture-level and logic-level techniques to accurately estimate the vulnerability of all regular and irregular structures within a microprocessor. All error propagation and masking scenarios are carefully handled among these structures. We have evaluated the vulnerability of the OR1200 processor using the proposed approach. Comparison with statistical fault injection shows an average inaccuracy of less than 7% with five orders of magnitude improvement in runtime.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: It is shown that circuit performance does not trade off so smoothly with mean time to failure (MTTF) as suggested by Black's Equation, and performance scaling achieved by reducing the EM lifetime requirement depends on the EM slack in the circuit, which in turn depends on factors such as timing constraints.
Abstract: Reliability issues significantly limit performance improvements from Moore's-Law scaling. At 45nm and below, electromigration (EM) is a serious reliability issue which affects global and local interconnects in a chip and limits performance scaling. Traditional IC implementation flows meet a 10-year lifetime requirement by overdesigning and sacrificing performance. At the same time, it is well-known among circuit designers that Black's Equation [2] suggests that lifetime can be traded for performance. In our work, we carefully study the impacts of EM-awareness on IC implementation outcomes, and show that circuit performance does not trade off so smoothly with mean time to failure (MTTF) as suggested by Black's Equation. We conduct two basic studies: EM lifetime versus performance with fixed resource budget, and EM lifetime versus resource with fixed performance. Using design examples implemented in two process nodes, we show that performance scaling achieved by reducing the EM lifetime requirement depends on the EM slack in the circuit, which in turn depends on factors such as timing constraints, length of critical paths and the mix of cell sizes. Depending on these factors, the performance gain can range from 10% to 80% when the lifetime requirement is reduced from 10 years to one year. We show that at a fixed performance requirement, power and area resources are affected by the timing slack and can either decrease by 3% or increase by 7.8% when the MTTF requirement is reduced. We also study how conventional EM fixes using per net Non-Default Rule (NDR) routing, downsizing of drivers, and fanout reduction affect performance at reduced lifetime requirements. Our study indicates, e.g., that NDR routing can increase performance by up to 5% but at the cost of 2% increase in area at a reduced 7-year lifetime requirement.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: An analytical model is developed to predict the probability density function and covariance of temperatures and voltage droops of a die in the presence of the BTI and process variation and it is observed that for benchmark circuits, treating each aspect independently and ignoring their intrinsic interactions results in 16% over-design, translating to unnecessary yield and performance loss.
Abstract: In nano-scale regime, there are various sources of uncertainty and unpredictability of VLSI designs such as transistor aging mainly due to Bias Temperature Instability (BTI) as well as Process-Voltage-Temperature (PVT) variations. BTI exponentially varies by temperature and the actual supply voltage seen by the transistors within the chip which are functions of leakage power. Leakage power is strongly impacted by PVT and BTI which in turn results in thermal-voltage variations. Hence, neglecting one or some of these aspects can lead to a considerable inaccuracy in the estimated BTI-induced delay degradation. However, a holistic approach to tackle all these issues and their interdependence is missing. In this paper, we develop an analytical model to predict the probability density function and covariance of temperatures and voltage droops of a die in the presence of the BTI and process variation. Based on this model, we propose a statistical method that characterizes the life-time of the circuit affected by BTI in the presence of process-induced temperature-voltage variations. We observe that for benchmark circuits, treating each aspect independently and ignoring their intrinsic interactions results in 16% over-design, translating to unnecessary yield and performance loss.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: An overview of temperature-related effects that threaten dependability and a methodology for reducing the dependability concerns through thermal management utilizing the concept of aging budgeting are presented.
Abstract: Dependability has become a growing concern in the nano-CMOS era due to elevated temperatures and an increased susceptibility to temperature of the small structures. We present an overview of temperature-related effects that threaten dependability and a methodology for reducing the dependability concerns through thermal management utilizing the concept of aging budgeting.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper is the first to present a comprehensive analysis of the profitability of the hybrid electrical energy storage (HEES) systems while further providing a HEES design and control optimization framework to maximize the total return on investment (ROI).
Abstract: This paper is the first to present a comprehensive analysis of the profitability of the hybrid electrical energy storage (HEES) systems while further providing a HEES design and control optimization framework to maximize the total return on investment (ROI). The solution consists of two steps: (i) Derivation of an optimal HEES management policy to maximize the daily energy cost saving and (ii) Optimal design of the HEES system to maximize the amortized annual profit under budget and system volume constraints. We consider a HEES system comprised of lead-acid and Li-ion batteries for a case study. The optimal HEES system achieves an annual ROI of up to 60% higher than a lead-acid battery-only system (Li-ion battery-only) system.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper proposes to instrument the Android kernel in order to collect and report accurate subsystem activity values based on real-time profiling of the running applications, and describes a novel application design framework, which considers the batterys state-of-charge (SOC), batterys energy depletion rate, and service quality of the target application.
Abstract: Emerging mobile systems integrate a lot of functionality into a small form factor with a small energy source in the form of rechargeable battery. This situation necessitates accurate estimation of the remaining energy in the battery such that user applications can be judicious on how they consume this scarce and precious resource. This paper thus focuses on estimating the remaining battery energy in Android OS-based mobile systems. This paper proposes to instrument the Android kernel in order to collect and report accurate subsystem activity values based on real-time profiling of the running applications. The activity information along with offline-constructed, regression-based power macro models for major subsystems in the smartphone yield the power dissipation estimate for the whole system. Next, while accounting for the rate-capacity effect in batteries, the total power dissipation data is translated into the battery's energy depletion rate, and subsequently, used to compute the battery's remaining lifetime based on its current state of charge information. Finally, this paper describes a novel application design framework, which considers the batterys state-of-charge (SOC), batterys energy depletion rate, and service quality of the target application. The benefits of the design framework are illustrated by examining an archetypical case, involving the design space exploration and optimization of a GPS-based application in an Android OS.

Proceedings ArticleDOI
Hanhua Qian1, Hao Liang1, Chip-Hong Chang1, Wei Zhang1, Hao Yu1 
29 Apr 2013
TL;DR: This model considers the thermal effect of TSVs at fine-granularity by calculating the anisotropic equivalent thermal conductances of a solid grid cell if TSVs are inserted and is much more accurate than 3D-ICE in its estimation of steady state temperature and thermal distribution.
Abstract: This paper presents a fast and accurate steady state thermal simulator for heatsink and microfluid-cooled 3D-ICs. This model considers the thermal effect of TSVs at fine-granularity by calculating the anisotropic equivalent thermal conductances of a solid grid cell if TSVs are inserted. Entrance effect of microchannels is also investigated for accurate modeling of microfluidic cooling. The proposed thermal simulator is verified against commercial multiphysics solver COMSOL and compared with Hotspot and 3D-ICE. Simulation results shows that for heatsink cooling, the proposed simulator is as accurate as Hotspot but runs much faster at moderate granularity. For microfluidic cooling, our proposed simulator is much more accurate than 3D-ICE in its estimation of steady state temperature and thermal distribution.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper proposes to reduce the number of refresh operations through re-arranging program data layout at compilation time and proposes an N-refresh scheme, which can reduce the dynamic energy consumption.
Abstract: Spin-Transfer Torque RAM (STT-RAM) has been proposed to build on-chip caches because of its attractive features: high storage density and negligible leakage power. Recently, researchers propose to improve the write performance of STT-RAM by relaxing its non-volatility property. To avoid data loss resulting from volatility, refresh schemes are proposed. However, refresh operations consume additional energy. In this paper, we propose to reduce the number of refresh operations through re-arranging program data layout at compilation time. An N-refresh scheme is also proposed. Experimental results show that, on average, the proposedmethods can reduce the number of refresh operations by 73.3%, and reduce the dynamic energy consumption by 27.6%.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: A new logic synthesis methodology, called MIXSyn, is presented, that produces area-efficient results for mixed XOR-AND/OR dominated logic functions and is capable to exploit the opportunity of novel XOR implementations offered by the use of double-gate ambipolar devices.
Abstract: We present a new logic synthesis methodology, called MIXSyn, that produces area-efficient results for mixed XOR-AND/OR dominated logic functions. MIXSyn is a two step synthesis process. The first step is a hybrid logic optimization that enables selective and distinct optimization of AND/OR and XOR-intensive portions of the logic circuit. The second step is a library-free technology mapping that enhances design flexibility with a tractable computational cost. MIXSyn has been tested on a set of large MCNC benchmarks. Experimental results indicate that MIXSyn produces CMOS circuits with 18.0% and 9.2% fewer devices, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively. MIXSyn is also capable to exploit the opportunity of novel XOR implementations offered by the use of double-gate ambipolar devices. Experimental results show that MIXSyn can reduce the number of ambipolar transistors by 20.9% and 15.3%, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: A Microfluidic Hardware Design Language (MHDL) for LoC specification is introduced, along with software tools to assist LoC designers verify the correctness of their specifications and estimate their performance.
Abstract: This paper describes an integrated design, verification, and simulation environment for programmable microfluidic devices called laboratories-on-chip (LoCs). Today's LoCs are architected and laid out by hand, which is time-consuming, tedious, and error-prone. To increase designer productivity, this paper introduces a Microfluidic Hardware Design Language (MHDL) for LoC specification, along with software tools to assist LoC designers verify the correctness of their specifications and estimate their performance.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: Detailed circuit-level models are presented, 2D and 3D main memory latencies are compared, and it is shown that although a 3D memory hierarchy exploiting a wider memory bus can increase performance, this performance increase may not justify the net increase in power consumption.
Abstract: In recent years, 3D technology has been a popular area of study that has allowed researchers to explore a number of novel computer architectures. One of the more popular topics is that of integrating 3D main memory dies below the computing die and connecting them with through-silicon vias (TSVs). This is assumed to reduce off-chip main memory access latencies by roughly 45% to 60%. Our detailed circuit-level models, however, demonstrate that this latency reduction from the TSVs is significantly less. In this paper, we present these models, compare 2D and 3D main memory latencies, and show that the reduction in latency from using 3D main memory to be no more than 2.4 ns. We also show that although the wider I/O bus width enabled by using TSVs increases performance, it may do so with an increase in power consumption. Although TSVs consume less power per bit transfer than off-chip metal interconnects (11.2 times less power per bit transfer), TSVs typically use considerably more bits and may result in a net increase in power due to the large number of bits in the memory I/O bus. Our analysis shows that although a 3D memory hierarchy exploiting a wider memory bus can increase performance, this performance increase may not justify the net increase in power consumption.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: Experiments show that under the nonlinear electrical-thermal TSV model, insertion of thermal TSVs can effectively reduce temperature-gradient introduced clock-skew by 58.4% on average, and has 11.6% higher clock-Skew reduction than the result under linear electrical-Thermal model.
Abstract: 3D physical design needs accurate device model of through-silicon vias (TSVs). In this paper, physics-based electrical-thermal model is introduced for both signal and dummy thermal TSVs with the consideration of nonlinear electrical-thermal dependence. Taking thermal-reliable 3D clock-tree synthesis as a case-study to verify the effectiveness of the proposed TSV model, one nonlinear programming-based clock-skew reduction problem is formulated to allocate thermal TSVs for clock-skew reduction under non-uniform temperature distribution. With a number of 3D clock-tree benchmarks, experiments show that under the nonlinear electrical-thermal TSV model, insertion of thermal TSVs can effectively reduce temperature-gradient introduced clock-skew by 58.4% on average, and has 11.6% higher clock-skew reduction than the result under linear electrical-thermal model.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: This work presents for the first time a framework that yields provably optimal test cubes by using the theory of quantified Boolean formulas (QBF) and demonstrates the quality gain of the proposed method.
Abstract: Circuits that employ test pattern compression rely on test cubes to achieve high compression ratios. The less inputs of a test pattern are specified, the better it can be compacted and hence the lower the test application time. Although there exist previous approaches to generate such test cubes, none of them are optimal. We present for the first time a framework that yields provably optimal test cubes by using the theory of quantified Boolean formulas (QBF). Extensive comparisons with previous methods demonstrate the quality gain of the proposed method.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: Experimental results demonstrate that the proposed thermal management scheme can reduce the number of the hot spots by 50% compared to the simple lowest temperature based task scheduling method, leading to more uniform on-chip temperature distribution across the microprocessor cores.
Abstract: Dynamic thermal management method is a viable way to effectively mitigate the thermal emergences. In this paper, a new thermal management scheme is proposed to reduce the on-chip temperature variance and the occurrence of hot spots by considering more transient thermal effects. The new method performs the task migrations to reduce the temperature variations across the chip. Instead of intuitively assigning the heavy tasks to the low temperature cores to balance the thermal profile based on steady state thermal analysis, the proposed method applies moment matching based transient thermal analysis techniques for fast thermal estimation and prediction to guide the migration process. We show that by considering the dominant temperature moment component, the resulting algorithm can lead to significant reduction of hot spots without full transient thermal simulation. Our experimental results on a 16 core microprocessor demonstrate that the proposed method can reduce the number of the hot spots by 50% compared to the simple lowest temperature based task scheduling method, leading to more uniform on-chip temperature distribution across the microprocessor cores.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: The crosstalk noise model is derived from the perspective of 3D chip and then ShieldUS, a runtime data-to-TSVs remapping strategy is proposed, which makes dynamic shielding practical and does not need to predefine parameters.
Abstract: 3D IC is a promising technology to meet the demands of high throughput, high scalability, and low power consumption for future generation integrated circuits. One way to implement the 3D IC is to interconnect layers of two-dimensional (2D) IC with Through-Silicon Via (TSV), which shortens the signal lengths. Unfortunately, while TSVs are bundled together as a cluster, the crosstalk coupling noise may lead to transmission errors. As a result, the working frequency of TSVs has to be lowered to avoid the errors, leading to narrower bandwidth that TSVs can provide. In this paper, we first derive the crosstalk noise model from the perspective of 3D chip and then propose ShieldUS, a runtime data-to-TSVs remapping strategy. With ShieldUS, the transition patterns of data over TSVs are observed at runtime, and relatively stable bits will be mapped to the TSVs which act as shields to protect the other bits which have more fluctuations. We evaluate the performance of ShieldUS with address lines from real benchmark traces and data lines of different similarities. The results show that ShieldUS is accurate and flexible. We further study dynamic shielding and our design of Interval Equilibration Unit (IEU) can intelligently select suitable parameters for dynamic shielding, which makes dynamic shielding practical and does not need to predefine parameters. This also improves the practicability of ShieldUS.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex- DRAM) to exploit the high memory bandwidth and the low memory latency of the 3D -DRAM as well as the high capacity and theLow cost of the ex-DRam.
Abstract: This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex-DRAM). Our approach dynamically allocates and relocates memory blocks between the 3D-DRAM and the ex-DRAM to exploit the high memory bandwidth and the low memory latency of the 3D-DRAM as well as the high capacity and the low cost of the ex-DRAM. Our simulation shows that in workloads that are not memory intensive, our memory management technique transfers all active memory blocks to the 3D-DRAM which runs faster than the ex-DRAM. In memory intensive workloads, our memory management technique utilizes both the 3D-DRAM and the ex-DRAM to increase the memory bandwidth to alleviate bandwidth congestion. Our approach supports Quality of Service (QoS) for “latency sensitive”, “bandwidth sensitive”, and “insensitive” applications. To improve the performance and satisfy a certain level of QoS, memory blocks of different application types are allocated differently. Compared to the scratchpad memory management mechanism, the average memory access latency of our approach decreases by 19% and 23%, while performance improves by up to 5% and 12% in single threaded benchmarks and multi-threaded benchmarks respectively. Moreover, using our approach, applications do not need to manage memory explicitly like in the scratchpad case. Our memory block relocation comes with negligible performance overhead, particularly for applications which have high spatial memory locality.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: By the proposed gain peaking technique, this transceiver realizes good gain flatness and is capable of more than 7 Gbps in 16QAM wireless communication for all channels of IEEE802.11ad standard within EVM of around -23 dB.
Abstract: This paper presents a 60-GHz direct-conversion transceiver in 65 nm CMOS technology. By the proposed gain peaking technique, this transceiver realizes good gain flatness and is capable of more than 7 Gbps in 16QAM wireless communication for all channels of IEEE802.11ad standard within EVM of around -23 dB. The transceiver consumes 319mWin transmitting and 223mW in receiving, including the PLL consumption.

Proceedings ArticleDOI
29 Apr 2013
TL;DR: It is demonstrated that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU.
Abstract: High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.