scispace - formally typeset
Search or ask a question

Showing papers presented at "Asia and South Pacific Design Automation Conference in 2012"


Proceedings ArticleDOI
Yen-Kuang Chen1
09 Mar 2012
TL;DR: An overview of challenges and opportunities presented by the M2M Internet, where hundreds of billions of smart sensors and devices will interact with one another without human intervention, on a Machine-to-Machine (M2M) basis, are provided.
Abstract: To date, most Internet applications focus on providing information, interaction, and entertainment for humans. However, with the widespread deployment of networked, intelligent sensor technologies, an Internet of Things (IoT) is steadily evolving, much like the Internet decades ago. In the future, hundreds of billions of smart sensors and devices will interact with one another without human intervention, on a Machine-to-Machine (M2M) basis. They will generate an enormous amount of data at an unprecedented scale and resolution, providing humans with information and control of events and objects even in remote physical environments. The scale of the M2M Internet will be several orders of magnitude larger than the existing Internet, posing serious research challenges. This paper will provide an overview of challenges and opportunities presented by this new paradigm.

270 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: A new synthesis approach which relies on concepts that are complementary to existing ones and exploits Quantum Multiple-valued Decision Diagrams (QMDDs) for this purpose, enabling automatic synthesis of large reversible functions with the minimal number of circuit lines.
Abstract: Reversible circuits are an emerging technology where all computations are performed in an invertible manner. Motivated by their promising applications, e.g. in the domain of quantum computation or in the low-power design, the synthesis of such circuits has been intensely studied. However, how to automatically realize reversible circuits with the minimal number of lines for large functions is an open research problem. In this paper, we propose a new synthesis approach which relies on concepts that are complementary to existing ones. While “conventional” function representations have been applied for synthesis so far (such as truth tables, ESOPs, BDDs), we exploit Quantum Multiple-valued Decision Diagrams (QMDDs) for this purpose. An algorithm is presented that performs transformations on this data-structure eventually leading to the desired circuit. Experimental results show the novelty of the proposed approach through enabling automatic synthesis of large reversible functions with the minimal number of circuit lines. Furthermore, the quantum cost of the resulting circuits is reduced by 50% on average compared to an existing state-of-the-art synthesis method.

105 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: EPIC is an efficient and effective predictor for IC manufacturing hotspots in deep sub-wavelength lithography and proposes a unified framework to combine different hotspot detection methods together, such as machine learning and pattern matching, using mathematical programming/optimization.
Abstract: In this paper we present EPIC, an efficient and effective predictor for IC manufacturing hotspots in deep sub-wavelength lithography. EPIC proposes a unified framework to combine different hotspot detection methods together, such as machine learning and pattern matching, using mathematical programming/optimization. EPIC algorithm has been tested on a number of industry benchmarks under advanced manufacturing conditions. It demonstrates so far the best capability in selectively combining the desirable features of various hotspot detection methods (3.5–8.2% accuracy improvement) as well as significant suppression of the detection noise (e.g., 80% false-alarm reduction). These characteristics make EPIC very suitable for conducting high performance physical verification and guiding efficient manufacturability friendly physical design.

88 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: A scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm consisting of a heterogeneous, tile-based manycore structure and a multi-agent management layer underpinned by distributed runtime and OS services is introduced.
Abstract: This paper introduces a scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm. The hardware architecture consists of a heterogeneous, tile-based manycore structure while the software architecture comprises a multi-agent management layer underpinned by distributed runtime and OS services. The necessity for invasive-specific hardware assist functions is analytically shown and their integration into the overall manycore environment is described.

79 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: This work is the first successful one to parallelize R- tree query on the GPU and proposes the first R-tree construction method on theGPU, which does not depend on a partition algorithm and guarantees the same quality as the sequential construction.
Abstract: R-tree is an important spatial data structure used in EDA as well as other fields. Although there has been a huge literature of parallel R-tree query, as far as we know, our work is the first successful one to parallelize R-tree query on the GPU. We also propose the first R-tree construction method on the GPU. Unlike the other parallel construction methods, our method does not depend on a partition algorithm and guarantees the same quality as the sequential construction. Experiments show that more than 30× speedup on R-tree query and more than 20× speedup on R-tree construction are achieved.

62 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: This paper proposed to use ECC code to relax the BER (Bit Error Rate) requirement of a single memory to improve the write energy consumption and latency for both the MOS based and cross-point based memristor ReRAM designs.
Abstract: The emerging memristor-based Resistive RAM (ReRAM) has shown great potential as one of the most promising memory technologies, with the unique properties such as high density, low-power, good-scalability, and non-volatility However, as the process technology scales, the process variation will cause the deviation of the actual electrical behavior of memristor Recently, researchers have observed that the probability of a single ReRAM cell switching successfully follows a function of the logarithm of the total programming time As a result, the uncertainty of the electrical behavior results in different degrees of error rates in ReRAM-based memory Traditional ECC (Error Correcting Code) design for conventional DRAM memory is used to detect and correct the errors in the memory system In this paper, based on the mathematical analysis of the error patterns in memristor-based ReRAM and the study of ECC designs, we proposed to use ECC code to relax the BER (Bit Error Rate) requirement of a single memory to improve the write energy consumption and latency for both the MOS based and cross-point based memristor ReRAM designs In addition, the performance/power/area overhead of the proposed design options is also evaluated in detail

62 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: This paper proposes a novel algorithm to optimally assign cuts to 193i or E-Beam processes with proper modifications on cut distribution in order to maximize the overall throughput and shows that the throughput is dramatically improved by the cut redistribution.
Abstract: Since some of major IC industry participants are moving to the highly regular 1D gridded designs to enable scaling to sub-20nm nodes, how to manufacture the randomly distributed cuts with reasonable throughput and process variation becomes a big challenge. With the help of hybrid lithography, people can apply different types of processes for one single layer manufacturing such that the advantages from different technologies can be combined together to further benefit manufacturing. In this paper, targeting cut printing difficulties and hybrid lithography with electron beam (E-Beam) and 193 nm immersion (193i) processes, we propose a novel algorithm to optimally assign cuts to 193i or E-Beam processes with proper modifications on cut distribution, in order to maximize the overall throughput. To validate our method, we construct our algorithm based on the forbidden patterns obtained from the optical simulation; then we formulate the redistribution problem into a well defined ILP problem and finally call a reliable solver to solve the whole problem. Experimental results show that the throughput is dramatically improved by the cut redistribution. Besides that, for sparser layers, the EBL process can be totaly saved, which largely reduces the fabrication cost.

57 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: This work presents a methodology that parallelizes the simulation of mixed-abstraction level SystemC models across multicore CPUs, and graphics processing units (GPUs) for improved simulation performance.
Abstract: This work presents a methodology that parallelizes the simulation of mixed-abstraction level SystemC models across multicore CPUs, and graphics processing units (GPUs) for improved simulation performance. Given a SystemC model, we partition it into processes suitable for GPU execution and CPU execution. We convert the processes identified for GPU execution into GPU kernels with additional SystemC wrapper processes that invoke these kernels. The wrappers enable seamless communication of events in all directions between the GPUs and CPUs. We alter the OSCI SystemC simulation kernel to allow parallel execution of processes. Hence, we co-simulate in parallel, the SystemC processes on multiple CPUs, and the GPU kernels on the GPUs; exploit both the CPUs, and GPUs for faster simulation. We experiment with synthetic benchmarks and a set-top box case study.

55 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: A modular framework that enables a scheduling for time-triggered distributed embedded systems and provides a symbolic representation used by an Integer Linear Programming (ILP) solver to determine a schedule that respects all bus and processor constraints as well as end-to-end timing constraints is proposed.
Abstract: This paper proposes a modular framework that enables a scheduling for time-triggered distributed embedded systems. The framework provides a symbolic representation that is used by an Integer Linear Programming (ILP) solver to determine a schedule that respects all bus and processor constraints as well as end-to-end timing constraints. Unlike other approaches, the proposed technique complies with automotive specific requirements at system-level and is fully extensible. Formulations for common time-triggered automotive operating systems and bus systems are presented. The proposed model supports the automotive bus systems FlexRay 2.1 and 3.0. For the operating systems, formulations for an eCos-based non-preemptive component and a preemptive OSEKtime operating system are introduced. A case study from the automotive domain gives evidence of the applicability of the proposed approach by scheduling multiple distributed control functions concurrently. Finally, a scalability analysis is carried out with synthetic test cases.

53 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: The proposed dual-mode architecture achieves both the low startup voltage in a startup mode and high conversion efficiency in a normal operation mode without off-chip inductors and capacitors.
Abstract: In this paper, a fully integrated low voltage charge pump for thermoelectric energy harvesters is presented. The proposed dual-mode architecture achieves both the low startup voltage in a startup mode and high conversion efficiency in a normal operation mode without off-chip inductors and capacitors. In the measurement, the proposed circuit successfully converts 120-mV input to 770-mV output with 38.8% conversion efficiency.

51 citations


Proceedings ArticleDOI
09 Mar 2012
TL;DR: The focus of the methodology is the virtual prototyping of the embedded software combined with the prototypes of the physical environment in order to capture the complete closed control loop of the software over the hardware via sensors/actors with the physical objects.
Abstract: The modeling and analysis of Cyber-Physical Systems (CPS) is one of the key challenges in complex system design as heterogeneous components are combined and their close interaction with the physical environment has to be considered. This article presents a methodology and an open toolset for the virtual prototyping of CPS. The focus of the methodology is the virtual prototyping of the embedded software combined with the prototyping of the physical environment in order to capture the complete closed control loop of the software over the hardware via sensors/actors with the physical objects. The methodology is based on the application of integrated open source tools and standard languages, i.e., C/C++, SystemC, and the Open Dynamics Engine, which are combined to a powerful simulation framework. Key activities of the methodology are outlined by the example of an electric two-wheel vehicle.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: Various circuit structures for nvLogic and nvSRAM are explored, taking into account memristor endurance, especially for low-voltage applications.
Abstract: The use of low voltage circuits and power-off mode help to reduce the power consumption of chips. Non-volatile logic (nvLogic) and nonvolatile SRAM (nvSRAM) enable a chip to preserve its key local states and data, while providing faster power-on/off speeds than those available with conventional two-macro schemes. Resistive memory (memristor) devices feature fast write speed and low write power. Applying memristors to nvLogic and nvSRAMs not only enables chips to achieve low power consumption for store operations, but also achieve fast power-on/off processes and reliable operation even in the event of sudden power failure. However, current memristor devices suffer from limited endurance, which influences the design of the circuit structure for memristor-based nvLogic and nvSRAM. Moreover, previous nvLogic/nvSRAM circuits cannot achieve low voltage operation. This paper explores various circuit structures for nvLogic and nvSRAM, taking into account memristor endurance, especially for low-voltage applications.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: A heuristic method to reduce the number of variables to represent incompletely specified index generation functions using linear decompositions using the imbalance measure and the ambiguity measure is shown.
Abstract: This paper shows a heuristic method to reduce the number of variables to represent incompletely specified index generation functions using linear decompositions. To find good linear transformations, two measures are introduced: the imbalance measure and the ambiguity measure. Experimental results using m-out-of-n code to binary converters, randomly generated functions, IP address tables, and lists of English words show the usefulness of the approach.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: The proposed AR model can serve as a fast alternative for predicting the transient temperature of a CMP with reasonably good accuracy and achieve approximately 113X speed-up over existing thermal profile estimation methods, while introducing an error of only 0.8°C on average.
Abstract: Thermal issues have become critical roadblocks for the development of advanced chip-multiprocessors (CMPs). In this paper, we introduce a new angle to view transient thermal analysis - based on predicting thermal profile, instead of calculating it. We develop a systematic framework that can learn different thermal profiles of a CMP by using an autoregressive (AR) model. The proposed AR model can serve as a fast alternative for predicting the transient temperature of a CMP with reasonably good accuracy. Experimental results show that the proposed AR model can achieve approximately 113X speed-up over existing thermal profile estimation methods, while introducing an error of only 0.8°C on average.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: A fine-grained dynamic voltage scaling (FDVS) technique to reduce the OLED power and effectively reduce the color remapping cost when color compensation is required to improve the image quality of an OLED panel working at a scaled voltage is proposed.
Abstract: Organic Light Emitting Diode (OLED) has emerged as the new generation display technique for mobile multimedia devices. Compared to existing technologies OLEDs are thinner, brighter, lighter, and cheaper. However, OLED panels are still the biggest contributor to the total power consumption of mobile devices. In this work, we proposed a fine-grained dynamic voltage scaling (FDVS) technique to reduce the OLED power. An OLED panel is partitioned into multiple display areas of which the supply voltage is adaptively adjusted based on the displayed content. A DVS-friendly OLED driver design is also proposed to enhance the color accuracy of the OLED pixels at the scaled supply voltage. Our experimental results show that compared to the existing global DVS technique, FDVS technique can achieve 25.9%∼43.1% more OLED power saving while maintaining a high image quality measured by Structural Similarity Index (SSIM=0.98). The further analysis shows shat FDVS technology can also effectively reduce the color remapping cost when color compensation is required to improve the image quality of an OLED panel working at a scaled voltage.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: Experimental results show that the general approach to synthesize a linear FSM-based SCE for a target function produces circuits that are much more tolerant of soft errors than deterministic implementations, while the area-delay product of the circuits are less than that of deterministic implementation.
Abstract: The Stochastic Computational Element (SCE) uses streams of random bits (stochastic bits streams) to perform computation with conventional digital logic gates. It can guarantee reliable computation using unreliable devices. In stochastic computing, the linear Finite State Machine (FSM) can be used to implement some sophisticated functions, such as the exponentiation and tanh functions, more efficiently than combinational logic. However, a general approach about how to synthesize a linear FSM-based SCE for a target function has not been available. In this paper, we will introduce three properties of the linear FSM used in stochastic computing and demonstrate a general approach to synthesize a linear FSM-based SCE for a target function. Experimental results show that our approach produces circuits that are much more tolerant of soft errors than deterministic implementations, while the area-delay product of the circuits are less than that of deterministic implementations.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: This paper integrates data reuse, loop pipelining, memory partitioning, and memory merging into an automated optimization flow (AMO) for FPGA behavioral synthesis, and develops memory padding to help in the memory partitions of indices with modulo operations.
Abstract: Behavioral synthesis tools have made significant progress in compiling high-level programs into register-transfer level (RTL) specifications. But manually rewriting code is still necessary in order to obtain better quality of results in memory system optimization. In recent years different automated memory optimization techniques have been proposed and implemented, such as data reuse and memory partitioning, but the problem of integrating these techniques into an applicable flow to obtain a better performance has become a challenge. In this paper we integrate data reuse, loop pipelining, memory partitioning, and memory merging into an automated optimization flow (AMO) for FPGA behavioral synthesis. We develop memory padding to help in the memory partitioning of indices with modulo operations. Experimental results on Xilinx Virtex-6 FPGAs show that our integrated approach can gain an average 5.8× throughput and 4.55× latency improvement compared to the approach without memory partitioning. Moreover, memory merging saves up to 44.32% of block RAM (BRAM).

Proceedings ArticleDOI
09 Mar 2012
TL;DR: GLOW, a hybrid global router to provide low power opto-electronic interconnect synthesis under the considerations of thermal reliability and various physical design constraints such as optical power, delay and signal quality is presented.
Abstract: In this paper, we examine the integration potential and explore the design space of low power thermal reliable on-chip interconnect synthesis featuring nanophotonics Wavelength Division Multiplexing (WDM). With the recent advancements, it is foreseen that nanophotonics holds the promise to be employed for future on-chip data signalling due to its unique power efficiency, signal delay and huge multiplexing potential. However, there are major challenges to address before feasible on-chip integration could be reached. In this paper, we present GLOW, a hybrid global router to provide low power opto-electronic interconnect synthesis under the considerations of thermal reliability and various physical design constraints such as optical power, delay and signal quality. GLOW is evaluated with testing cases derived from ISPD07-08 global routing benchmarks. Compared with a greedy approach, GLOW demonstrates around 23%–50% of total optical power reduction, revealing great potential of on-chip WDM interconnect synthesis.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: The global charge replacement (GCR) optimization problem is formally described and an algorithm to find the near-optimal GCR control policy is provided and significant improvements in the charge replacement efficiency are demonstrated.
Abstract: Hybrid electrical energy storage (HEES) systems are composed of multiple banks of heterogeneous electrical energy storage (EES) elements with distinctive properties. Charge replacement in a HEES system (i.e., dynamic assignment of load demands to EES banks) is one of the key operations in the system. This paper formally describes the global charge replacement (GCR) optimization problem and provides an algorithm to find the near-optimal GCR control policy. The optimization problem is formulated as a mixed-integer nonlinear programming problem, where the objective function is the charge replacement efficiency. The constraints account for the energy conservation law, efficiency of the charger/converter, the rate capacity effect, and self-discharge rates plus internal resistances of the EES element arrays. The near-optimal solution to this problem is obtained while considering the state of charges (SoCs) of the EES element arrays, characteristics of the load devices, and estimates of energy contributions by the EES element arrays. Experimental results demonstrate significant improvements in the charge replacement efficiency in an example HEES system comprised of banks of battery and supercapacitor elements with a high-power pulsed military radio transceiver as the load device.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: Experimental results show that the signal TSV planner outperforms the state-of-the-art TSV-aware 3D floorplanner by 7% to 38% with respect to wirelength, and the multiple TSV insertion algorithm outperforms a single TSVs insertion algorithm by 27% to 37%.
Abstract: Since re-designing and re-optimizing existing logic, memory, and IP blocks in a 3D fashion significantly increases design cost, near-term three-dimensional integrated circuit (3D IC) design will focus on reusing existing 2D blocks. One way to reuse 2D blocks in the 3D IC design is to first perform 3D floorplanning, insert signal through-silicon vias (TSVs) for 3D inter-block connections, and then route the blocks. In this paper, we propose algorithms (finding signal TSV locations, assigning TSVs to whitespace blocks, and manipulating whitespace blocks) for post-floorplanning signal TSV planning in the block-level 3D IC design. Experimental results show that our signal TSV planner outperforms the state-of-the-art TSV-aware 3D floorplanner by 7% to 38% with respect to wirelength. In addition, our multiple TSV insertion algorithm outperforms a single TSV insertion algorithm by 27% to 37%.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: Some key design for manufacturability and reliability challenges and possible solutions for TSV-based 3D IC integration, as well as future research directions are discussed.
Abstract: The 3D IC integration using through-silicon-vias (TSV) has gained tremendous momentum recently for industry adoption. However, as TSV involves disruptive manufacturing technologies, new modeling and design techniques need to be developed for 3D IC manufacturability and reliability. In particular, TSVs in 3D IC may cause significant thermal mechanical stress, which not only results in systematic mobility/performance variations, but also leads to mechanical reliability concerns such as interfacial cracking. Meanwhile, the huge dimensional gaps between TSV, on-chip wires, and bonding/packaging all lead to new electromigration concerns. Thus full-chip/package modeling and physical design tools need to be developed to achieve more reliable 3D IC integration. In this paper, we will discuss some key design for manufacturability and reliability challenges and possible solutions for TSV-based 3D IC integration, as well as future research directions.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: This study chooses medical imaging as the application domain for investigation, and studies the application performance and energy efficiency across a diverse set of commodity hardware platforms, such as general-purpose multi-core CPUs, massive parallel many-core GPUs, low-power mobile CPUs and fine-grain customizable FPGAs.
Abstract: We believe that by adapting architectures to fit the requirements of a given application domain, we can significantly improve the efficiency of computation. To validate the idea for our application domain, we evaluate a wide spectrum of commodity computing platforms to quantify the potential benefits of heterogeneity and customization for the domain-specific applications. In particular, we choose medical imaging as the application domain for investigation, and study the application performance and energy efficiency across a diverse set of commodity hardware platforms, such as general-purpose multi-core CPUs, massive parallel many-core GPUs, low-power mobile CPUs and fine-grain customizable FPGAs. This study leads to a number of interesting observations that can be used to guide further development of domain-specific architectures.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: A systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple variable-sized convolutions in parallel is presented, which leverages spatial parallelism, dedicated wide data buses with on-chip memories to provide an energy efficient solution to enable adoption into embedded systems.
Abstract: Advances in neuroscience have enabled researchers to develop computational models of auditory, visual and learning perceptions in the human brain. HMAX, which is a biologically inspired model of the visual cortex, has been shown to outperform standard computer vision approaches for multi-class object recognition. HMAX, while computationally demanding, can be potentially applied in various applications such as autonomous vehicle navigation, unmanned surveillance and robotics. In this paper, we present a reconfigurable hardware accelerator for the time-consuming S2 stage of the HMAX model. The accelerator leverages spatial parallelism, dedicated wide data buses with on-chip memories to provide an energy efficient solution to enable adoption into embedded systems. We present a systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple variable-sized convolutions in parallel. An automation flow is described for this accelerator which can generate optimal hardware configurations for a given algorithmic specification and also perform run-time configuration and execution seamlessly. Experimental results on Virtex-6 FPGA platforms show 5X to 11X speedups and 14X to 33X higher performance-per-Watt over a CNS-based implementation on a Tesla GPU.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: In this article, the authors present ingredients for a class of abstract, high-level platform models that enable fast yet accurate performance and power simulation of application execution on heterogeneous multi-core/processor architectures.
Abstract: With increasing complexity of today's embedded systems, research has focused on developing fast, yet accurate high-level and executable models of complete platforms. These models address the need for hardware/software co-simulation of the entire system at early stages of the design. Traditional models tend to be either slow or inaccurate. In this paper, we present ingredients for a class of abstract, high-level platform models that enable fast yet accurate performance and power simulation of application execution on heterogeneous multi-core/-processor architectures. Models are based on host-compiled simulation of the application code, which is instrumented with timing and power information. Back-annotated source code is further augmented with abstract OS and processor models that are integrated into standard co-simulation backplanes. The efficiency of the modeling platform has been evaluated by applying an industrial-strength benchmark, demonstrating the feasibility and benefits of such models for rapid, early exploration of the power, performance and cost design space. Results show that an accurate Pareto set of solutions can be obtained in a fraction of the time needed with traditional simulation and modeling approaches.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: This research attacks the Dark Silicon problem directly through a set of energy-saving accelerators, called Conservation Cores, or c-cores, a post-multicore approach that constructively uses dark silicon to reduce the energy consumption of an application by 10× or more.
Abstract: The Dark Silicon Age kicked off with the transition to multicore and will be characterized by a wild chase for seemingly ever-more insane architectural designs. At the heart of this transformation is the Utilization Wall, which states that, with each new process generation, the percentage of transistors that a chip can switch at full frequency is dropping exponentially due to power constraints. This has led to increasingly larger and larger fractions of a chip's silicon area that must remain passive, or dark. Since Dark Silicon is an exponentially-worsening phenomenon, getting worse at the same rate that Moore's Law is ostensibly making process technology better, we need to seek out fundamentally new approaches to designing processors for the Dark Silicon Age. Simply tweaking existing designs is not enough. Our research attacks the Dark Silicon problem directly through a set of energy-saving accelerators, called Conservation Cores, or c-cores. C-cores are a post-multicore approach that constructively uses dark silicon to reduce the energy consumption of an application by 10× or more. To examine the utility of c-cores, we are developing GreenDroid, a multicore chip that targets the Android mobile software stack. Our mobile application processor prototype targets a 32-nm process and is comprised of hundreds of automatically generated, specialized, patchable c-cores. These cores target specific Android hotspots, including the kernel. Our preliminary results suggest that we can attain up to 11× improvement in energy efficiency using a modest amount of silicon.

Proceedings ArticleDOI
Hao Zhuang1, Wenjian Yu1, Gang Hu1, Zhi Liu1, Zuochang Ye1 
09 Mar 2012
TL;DR: A space management technique with Octree data structure to reduce the time of each hop and parallelize the whole FRW by multi-threaded programming and results show large speedup brought by the proposed techniques for structures under the VLSI technology with thin dielectric layers.
Abstract: The floating random walk (FRW) algorithm has several advantages for extracting 3D interconnect capacitance. However, for multi-layer dielectrics in VLSI technology, the efficiency of FRW algorithm would be degraded due to frequent stop of walks at dielectric interface and constraint of first-hop length especially in thin dielectrics. In this paper, we tackle these problems with the numerical characterization of Green??s function for cross-interface transition probabilities and the corresponding weight value. We also present a space management technique with Octree data structure to reduce the time of each hop and parallelize the whole FRW by multi-threaded programming. Numerical results show large speedup brought by the proposed techniques for structures under the VLSI technology with thin dielectric layers.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: This paper successfully presents an algorithm that can optimally solve the pattern relocation problem and the relocation results with full scale layouts generated from Nangate Open Cell Library has shown great advantages with competitive runtimes compared to the existing commercial tool.
Abstract: Blank defect mitigation is a critical step for extreme ultraviolet (EUV) lithography. Targeting the defective blank, a layout relocation method, to shift and rotate the whole layout pattern to a proper position, has been proved to be an effective way to reduce defect impact. Yet, there is still no published work about how to find the best pattern location to minimize the impact from the buried defects with reasonable defect model and considerable process variation control. In this paper, we successfully present an algorithm that can optimally solve this pattern relocation problem. Experimental results validate our method, and the relocation results with full scale layouts generated from Nangate Open Cell Library has shown great advantages with competitive runtimes compared to the existing commercial tool.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: A parallel LU factorization (with partial pivoting) algorithm on shared-memory computers with multi-core CPUs, to accelerate circuit simulation and find a predictive method to decide whether a matrix should use parallel or sequential algorithm.
Abstract: Sparse matrix solver has become the bottleneck in SPICE simulator. It is difficult to parallelize the solver because of the high data-dependency during the numerical LU factorization. This paper proposes a parallel LU factorization (with partial pivoting) algorithm on shared-memory computers with multi-core CPUs, to accelerate circuit simulation. Since not every matrix is suitable for parallel algorithm, a predictive method is proposed to decide whether a matrix should use parallel or sequential algorithm. The experimental results on 35 circuit matrices reveal that the developed algorithm achieves speedups of 2.11×∼8.38× (on geometric-average), compared with KLU, with 1∼8 threads, on the matrices which are suitable for parallel algorithm. Our solver can be downloaded from http://nicslu.weebly.com.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: A level one data cache tuning heuristic for a heterogeneous multi-core system is presented, which classifies applications based on data sharing and cache behavior, and uses this classification to guide cache tuning and reduce the number of cores that need to be tuned.
Abstract: Since multi-core architectures are becoming more popular, recent multi-core optimizations focus on energy consumption. We present a level one data cache tuning heuristic for a heterogeneous multi-core system, which classifies applications based on data sharing and cache behavior, and uses this classification to guide cache tuning and reduce the number of cores that need to be tuned. Results reveal average energy savings of 25% for 2-, 4-, 8-, and 16-core systems while searching only 1% of the design space.

Proceedings ArticleDOI
09 Mar 2012
TL;DR: Recent advances in 3D stack yield techniques are surveyed and challenges to be resolved in the future are pointed out.
Abstract: Three-dimensional (3D) integrated circuits (ICs) that stack multiple dies vertically using through-silicon vias (TSVs) have gained wide interests of the semiconductor industry. The shift towards volume production of 3D-stacked ICs, however, requires their manufacturing yield to be commercially viable. Various techniques have been presented in the literature to address this important problem, including pre-bond testing techniques to tackle the “known good die” problem, TSV redundancy designs to provide defect-tolerance, and wafter/die matching solutions to improve the overall stack yield. In this paper, we survey recent advances in this filed and point out challenges to be resolved in the future.