Showing papers presented at "Asia and South Pacific Design Automation Conference in 2012"

PDF

Open Access

Proceedings Article•DOI•

Challenges and opportunities of internet of things

[...]

Yen-Kuang Chen¹•Institutions (1)

09 Mar 2012

TL;DR: An overview of challenges and opportunities presented by the M2M Internet, where hundreds of billions of smart sensors and devices will interact with one another without human intervention, on a Machine-to-Machine (M2M) basis, are provided.

...read moreread less

Abstract: To date, most Internet applications focus on providing information, interaction, and entertainment for humans. However, with the widespread deployment of networked, intelligent sensor technologies, an Internet of Things (IoT) is steadily evolving, much like the Internet decades ago. In the future, hundreds of billions of smart sensors and devices will interact with one another without human intervention, on a Machine-to-Machine (M2M) basis. They will generate an enormous amount of data at an unprecedented scale and resolution, providing humans with information and control of events and objects even in remote physical environments. The scale of the M2M Internet will be several orders of magnitude larger than the existing Internet, posing serious research challenges. This paper will provide an overview of challenges and opportunities presented by this new paradigm.

...read moreread less

270 citations

Proceedings Article•DOI•

Synthesis of reversible circuits with minimal lines for large functions

[...]

Mathias Soeken¹, Robert Wille¹, Christoph Hilken¹, Nils Przigoda¹, Rolf Drechsler¹ - Show less +1 more•Institutions (1)

University of Bremen¹

09 Mar 2012

TL;DR: A new synthesis approach which relies on concepts that are complementary to existing ones and exploits Quantum Multiple-valued Decision Diagrams (QMDDs) for this purpose, enabling automatic synthesis of large reversible functions with the minimal number of circuit lines.

...read moreread less

Abstract: Reversible circuits are an emerging technology where all computations are performed in an invertible manner. Motivated by their promising applications, e.g. in the domain of quantum computation or in the low-power design, the synthesis of such circuits has been intensely studied. However, how to automatically realize reversible circuits with the minimal number of lines for large functions is an open research problem. In this paper, we propose a new synthesis approach which relies on concepts that are complementary to existing ones. While “conventional” function representations have been applied for synthesis so far (such as truth tables, ESOPs, BDDs), we exploit Quantum Multiple-valued Decision Diagrams (QMDDs) for this purpose. An algorithm is presented that performs transformations on this data-structure eventually leading to the desired circuit. Experimental results show the novelty of the proposed approach through enabling automatic synthesis of large reversible functions with the minimal number of circuit lines. Furthermore, the quantum cost of the resulting circuits is reduced by 50% on average compared to an existing state-of-the-art synthesis method.

...read moreread less

105 citations

Proceedings Article•DOI•

EPIC: Efficient prediction of IC manufacturing hotspots with a unified meta-classification formulation

[...]

Duo Ding¹, Bei Yu¹, Joydeep Ghosh¹, David Z. Pan¹•Institutions (1)

University of Texas at Austin¹

09 Mar 2012

TL;DR: EPIC is an efficient and effective predictor for IC manufacturing hotspots in deep sub-wavelength lithography and proposes a unified framework to combine different hotspot detection methods together, such as machine learning and pattern matching, using mathematical programming/optimization.

...read moreread less

Abstract: In this paper we present EPIC, an efficient and effective predictor for IC manufacturing hotspots in deep sub-wavelength lithography. EPIC proposes a unified framework to combine different hotspot detection methods together, such as machine learning and pattern matching, using mathematical programming/optimization. EPIC algorithm has been tested on a number of industry benchmarks under advanced manufacturing conditions. It demonstrates so far the best capability in selectively combining the desirable features of various hotspot detection methods (3.5–8.2% accuracy improvement) as well as significant suppression of the detection noise (e.g., 80% false-alarm reduction). These characteristics make EPIC very suitable for conducting high performance physical verification and guiding efficient manufacturability friendly physical design.

...read moreread less

88 citations

Proceedings Article•DOI•

Invasive manycore architectures

[...]

Jorg Henkel¹, Andreas Herkersdorf², Lars Bauer¹, Thomas Wild², Michael Hübner¹, Ravi Kumar Pujari², Artjom Grudnitsky¹, Jan Heisswolf¹, Aurang Zaib², Benjamin Vogel¹, Vahid Lari³, Sebastian Kobbe¹ - Show less +8 more•Institutions (3)

Karlsruhe Institute of Technology¹, Technische Universität München², University of Erlangen-Nuremberg³

09 Mar 2012

TL;DR: A scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm consisting of a heterogeneous, tile-based manycore structure and a multi-agent management layer underpinned by distributed runtime and OS services is introduced.

...read moreread less

Abstract: This paper introduces a scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm. The hardware architecture consists of a heterogeneous, tile-based manycore structure while the software architecture comprises a multi-agent management layer underpinned by distributed runtime and OS services. The necessity for invasive-specific hardware assist functions is analytically shown and their integration into the overall manycore environment is described.

...read moreread less

79 citations

Proceedings Article•DOI•

Parallel implementation of R-trees on the GPU

[...]

Lijuan Luo¹, Martin D. F. Wong¹, Lance Leong²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Nvidia²

09 Mar 2012

TL;DR: This work is the first successful one to parallelize R- tree query on the GPU and proposes the first R-tree construction method on theGPU, which does not depend on a partition algorithm and guarantees the same quality as the sequential construction.

...read moreread less

Abstract: R-tree is an important spatial data structure used in EDA as well as other fields. Although there has been a huge literature of parallel R-tree query, as far as we know, our work is the first successful one to parallelize R-tree query on the GPU. We also propose the first R-tree construction method on the GPU. Unlike the other parallel construction methods, our method does not depend on a partition algorithm and guarantees the same quality as the sequential construction. Experiments show that more than 30× speedup on R-tree query and more than 20× speedup on R-tree construction are achieved.

...read moreread less

62 citations

Proceedings Article•DOI•

Low power memristor-based ReRAM design with Error Correcting Code

[...]

Niu Dimin¹, Yang Xiao¹, Yuan Xie¹•Institutions (1)

Pennsylvania State University¹

09 Mar 2012

TL;DR: This paper proposed to use ECC code to relax the BER (Bit Error Rate) requirement of a single memory to improve the write energy consumption and latency for both the MOS based and cross-point based memristor ReRAM designs.

...read moreread less

Abstract: The emerging memristor-based Resistive RAM (ReRAM) has shown great potential as one of the most promising memory technologies, with the unique properties such as high density, low-power, good-scalability, and non-volatility However, as the process technology scales, the process variation will cause the deviation of the actual electrical behavior of memristor Recently, researchers have observed that the probability of a single ReRAM cell switching successfully follows a function of the logarithm of the total programming time As a result, the uncertainty of the electrical behavior results in different degrees of error rates in ReRAM-based memory Traditional ECC (Error Correcting Code) design for conventional DRAM memory is used to detect and correct the errors in the memory system In this paper, based on the mathematical analysis of the error patterns in memristor-based ReRAM and the study of ECC designs, we proposed to use ECC code to relax the BER (Bit Error Rate) requirement of a single memory to improve the write energy consumption and latency for both the MOS based and cross-point based memristor ReRAM designs In addition, the performance/power/area overhead of the proposed design options is also evaluated in detail

...read moreread less

62 citations

Proceedings Article•DOI•

Hybrid lithography optimization with E-Beam and immersion processes for 16nm 1D gridded design

[...]

Yuelin Du¹, Hongbo Zhang¹, Martin D. F. Wong¹, Kai-Yuan Chao²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Intel²

09 Mar 2012

TL;DR: This paper proposes a novel algorithm to optimally assign cuts to 193i or E-Beam processes with proper modifications on cut distribution in order to maximize the overall throughput and shows that the throughput is dramatically improved by the cut redistribution.

...read moreread less

Abstract: Since some of major IC industry participants are moving to the highly regular 1D gridded designs to enable scaling to sub-20nm nodes, how to manufacture the randomly distributed cuts with reasonable throughput and process variation becomes a big challenge. With the help of hybrid lithography, people can apply different types of processes for one single layer manufacturing such that the advantages from different technologies can be combined together to further benefit manufacturing. In this paper, targeting cut printing difficulties and hybrid lithography with electron beam (E-Beam) and 193 nm immersion (193i) processes, we propose a novel algorithm to optimally assign cuts to 193i or E-Beam processes with proper modifications on cut distribution, in order to maximize the overall throughput. To validate our method, we construct our algorithm based on the forbidden patterns obtained from the optical simulation; then we formulate the redistribution problem into a well defined ILP problem and finally call a reliable solver to solve the whole problem. Experimental results show that the throughput is dramatically improved by the cut redistribution. Besides that, for sparser layers, the EBL process can be totaly saved, which largely reduces the fabrication cost.

...read moreread less

57 citations

Proceedings Article•DOI•

Parallel simulation of mixed-abstraction SystemC models on GPUs and multicore CPUs

[...]

Rohit Sinha¹, Aayush Prakash¹, Hiren D. Patel¹•Institutions (1)

University of Waterloo¹

09 Mar 2012

TL;DR: This work presents a methodology that parallelizes the simulation of mixed-abstraction level SystemC models across multicore CPUs, and graphics processing units (GPUs) for improved simulation performance.

...read moreread less

Abstract: This work presents a methodology that parallelizes the simulation of mixed-abstraction level SystemC models across multicore CPUs, and graphics processing units (GPUs) for improved simulation performance. Given a SystemC model, we partition it into processes suitable for GPU execution and CPU execution. We convert the processes identified for GPU execution into GPU kernels with additional SystemC wrapper processes that invoke these kernels. The wrappers enable seamless communication of events in all directions between the GPUs and CPUs. We alter the OSCI SystemC simulation kernel to allow parallel execution of processes. Hence, we co-simulate in parallel, the SystemC processes on multiple CPUs, and the GPU kernels on the GPUs; exploit both the CPUs, and GPUs for faster simulation. We experiment with synthetic benchmarks and a set-top box case study.

...read moreread less

55 citations

Proceedings Article•DOI•

Modular scheduling of distributed heterogeneous time-triggered automotive systems

[...]

Martin Lukasiewycz, Reinhard Schneider¹, Dip Goswami¹, Samarjit Chakraborty¹•Institutions (1)

Technische Universität München¹

09 Mar 2012

TL;DR: A modular framework that enables a scheduling for time-triggered distributed embedded systems and provides a symbolic representation used by an Integer Linear Programming (ILP) solver to determine a schedule that respects all bus and processor constraints as well as end-to-end timing constraints is proposed.

...read moreread less

Abstract: This paper proposes a modular framework that enables a scheduling for time-triggered distributed embedded systems. The framework provides a symbolic representation that is used by an Integer Linear Programming (ILP) solver to determine a schedule that respects all bus and processor constraints as well as end-to-end timing constraints. Unlike other approaches, the proposed technique complies with automotive specific requirements at system-level and is fully extensible. Formulations for common time-triggered automotive operating systems and bus systems are presented. The proposed model supports the automotive bus systems FlexRay 2.1 and 3.0. For the operating systems, formulations for an eCos-based non-preemptive component and a preemptive OSEKtime operating system are introduced. A case study from the automotive domain gives evidence of the applicability of the proposed approach by scheduling multiple distributed control functions concurrently. Finally, a scalability analysis is carried out with synthetic test cases.

...read moreread less

53 citations

Proceedings Article•DOI•

A 120-mV input, fully integrated dual-mode charge pump in 65-nm CMOS for thermoelectric energy harvester

[...]

Po-Hung Chen¹, Koichi Ishida¹, Xin Zhang¹, Yasuyuki Okuma, Yoshikatsu Ryu, Makoto Takamiya¹, Takayasu Sakurai¹ - Show less +3 more•Institutions (1)

University of Tokyo¹

09 Mar 2012

TL;DR: The proposed dual-mode architecture achieves both the low startup voltage in a startup mode and high conversion efficiency in a normal operation mode without off-chip inductors and capacitors.

...read moreread less

Abstract: In this paper, a fully integrated low voltage charge pump for thermoelectric energy harvesters is presented. The proposed dual-mode architecture achieves both the low startup voltage in a startup mode and high conversion efficiency in a normal operation mode without off-chip inductors and capacitors. In the measurement, the proposed circuit successfully converts 120-mV input to 770-mV output with 38.8% conversion efficiency.

...read moreread less

51 citations

Proceedings Article•DOI•

Virtual prototyping of Cyber-Physical Systems

[...]

Wolfgang Mueller¹, Markus Becker¹, Ahmed Elfeky¹, Anthony DiPasquale²•Institutions (2)

University of Paderborn¹, Northwestern University²

09 Mar 2012

TL;DR: The focus of the methodology is the virtual prototyping of the embedded software combined with the prototypes of the physical environment in order to capture the complete closed control loop of the software over the hardware via sensors/actors with the physical objects.

...read moreread less

Abstract: The modeling and analysis of Cyber-Physical Systems (CPS) is one of the key challenges in complex system design as heterogeneous components are combined and their close interaction with the physical environment has to be considered. This article presents a methodology and an open toolset for the virtual prototyping of CPS. The focus of the methodology is the virtual prototyping of the embedded software combined with the prototyping of the physical environment in order to capture the complete closed control loop of the software over the hardware via sensors/actors with the physical objects. The methodology is based on the application of integrated open source tools and standard languages, i.e., C/C++, SystemC, and the Open Dynamics Engine, which are combined to a powerful simulation framework. Key activities of the methodology are outlined by the example of an electric two-wheel vehicle.

...read moreread less

Proceedings Article•DOI•

Endurance-aware circuit designs of nonvolatile logic and nonvolatile sram using resistive memory (memristor) device

[...]

Meng-Fan Chang¹, Ching-Hao Chuang¹, Min-Ping Chen¹, Lai-Fu Chen¹, Hiroyuki Yamauchi², Pi-Feng Chiu³, Shyh-Shyuan Sheu³ - Show less +3 more•Institutions (3)

National Tsing Hua University¹, Fukuoka Institute of Technology², Industrial Technology Research Institute³

09 Mar 2012

TL;DR: Various circuit structures for nvLogic and nvSRAM are explored, taking into account memristor endurance, especially for low-voltage applications.

...read moreread less

Abstract: The use of low voltage circuits and power-off mode help to reduce the power consumption of chips. Non-volatile logic (nvLogic) and nonvolatile SRAM (nvSRAM) enable a chip to preserve its key local states and data, while providing faster power-on/off speeds than those available with conventional two-macro schemes. Resistive memory (memristor) devices feature fast write speed and low write power. Applying memristors to nvLogic and nvSRAMs not only enables chips to achieve low power consumption for store operations, but also achieve fast power-on/off processes and reliable operation even in the event of sudden power failure. However, current memristor devices suffer from limited endurance, which influences the design of the circuit structure for memristor-based nvLogic and nvSRAM. Moreover, previous nvLogic/nvSRAM circuits cannot achieve low voltage operation. This paper explores various circuit structures for nvLogic and nvSRAM, taking into account memristor endurance, especially for low-voltage applications.

...read moreread less

Proceedings Article•DOI•

Linear decomposition of index generation functions

[...]

Tsutomu Sasao¹•Institutions (1)

Kyushu Institute of Technology¹

09 Mar 2012

TL;DR: A heuristic method to reduce the number of variables to represent incompletely specified index generation functions using linear decompositions using the imbalance measure and the ambiguity measure is shown.

...read moreread less

Abstract: This paper shows a heuristic method to reduce the number of variables to represent incompletely specified index generation functions using linear decompositions. To find good linear transformations, two measures are introduced: the imbalance measure and the ambiguity measure. Experimental results using m-out-of-n code to binary converters, randomly generated functions, IP address tables, and lists of English words show the usefulness of the approach.

...read moreread less

Proceedings Article•DOI•

A learning-based autoregressive model for fast transient thermal analysis of chip-multiprocessors

[...]

Da-Cheng Juan¹, Huapeng Zhou¹, Diana Marculescu¹, Xin Li¹•Institutions (1)

Carnegie Mellon University¹

09 Mar 2012

TL;DR: The proposed AR model can serve as a fast alternative for predicting the transient temperature of a CMP with reasonably good accuracy and achieve approximately 113X speed-up over existing thermal profile estimation methods, while introducing an error of only 0.8°C on average.

...read moreread less

Abstract: Thermal issues have become critical roadblocks for the development of advanced chip-multiprocessors (CMPs). In this paper, we introduce a new angle to view transient thermal analysis - based on predicting thermal profile, instead of calculating it. We develop a systematic framework that can learn different thermal profiles of a CMP by using an autoregressive (AR) model. The proposed AR model can serve as a fast alternative for predicting the transient temperature of a CMP with reasonably good accuracy. Experimental results show that the proposed AR model can achieve approximately 113X speed-up over existing thermal profile estimation methods, while introducing an error of only 0.8°C on average.

...read moreread less

Proceedings Article•DOI•

Fine-grained dynamic voltage scaling on OLED display

[...]

Xiang Chen¹, Jian Zeng¹, Yi Chen¹, Wei Zhang², Hai Li³ - Show less +1 more•Institutions (3)

University of Pittsburgh¹, Nanyang Technological University², New York University³

09 Mar 2012

TL;DR: A fine-grained dynamic voltage scaling (FDVS) technique to reduce the OLED power and effectively reduce the color remapping cost when color compensation is required to improve the image quality of an OLED panel working at a scaled voltage is proposed.

...read moreread less

Abstract: Organic Light Emitting Diode (OLED) has emerged as the new generation display technique for mobile multimedia devices. Compared to existing technologies OLEDs are thinner, brighter, lighter, and cheaper. However, OLED panels are still the biggest contributor to the total power consumption of mobile devices. In this work, we proposed a fine-grained dynamic voltage scaling (FDVS) technique to reduce the OLED power. An OLED panel is partitioned into multiple display areas of which the supply voltage is adaptively adjusted based on the displayed content. A DVS-friendly OLED driver design is also proposed to enhance the color accuracy of the OLED pixels at the scaled supply voltage. Our experimental results show that compared to the existing global DVS technique, FDVS technique can achieve 25.9%∼43.1% more OLED power saving while maintaining a high image quality measured by Structural Similarity Index (SSIM=0.98). The further analysis shows shat FDVS technology can also effectively reduce the color remapping cost when color compensation is required to improve the image quality of an OLED panel working at a scaled voltage.

...read moreread less

Proceedings Article•DOI•

The synthesis of linear Finite State Machine-based Stochastic Computational Elements

[...]

Peng Li¹, Weikang Qian², Marc D. Riedel¹, Kia Bazargan¹, David J. Lilja¹ - Show less +1 more•Institutions (2)

University of Minnesota¹, Shanghai Jiao Tong University²

09 Mar 2012

TL;DR: Experimental results show that the general approach to synthesize a linear FSM-based SCE for a target function produces circuits that are much more tolerant of soft errors than deterministic implementations, while the area-delay product of the circuits are less than that of deterministic implementation.

...read moreread less

Abstract: The Stochastic Computational Element (SCE) uses streams of random bits (stochastic bits streams) to perform computation with conventional digital logic gates. It can guarantee reliable computation using unreliable devices. In stochastic computing, the linear Finite State Machine (FSM) can be used to implement some sophisticated functions, such as the exponentiation and tanh functions, more efficiently than combinational logic. However, a general approach about how to synthesize a linear FSM-based SCE for a target function has not been available. In this paper, we will introduce three properties of the linear FSM used in stochastic computing and demonstrate a general approach to synthesize a linear FSM-based SCE for a target function. Experimental results show that our approach produces circuits that are much more tolerant of soft errors than deterministic implementations, while the area-delay product of the circuits are less than that of deterministic implementations.

...read moreread less

Proceedings Article•DOI•

An integrated and automated memory optimization flow for FPGA behavioral synthesis

[...]

Yuxin Wang¹, Peng Zhang², Xu Cheng¹, Jason Cong²•Institutions (2)

Peking University¹, University of California, Los Angeles²

09 Mar 2012

TL;DR: This paper integrates data reuse, loop pipelining, memory partitioning, and memory merging into an automated optimization flow (AMO) for FPGA behavioral synthesis, and develops memory padding to help in the memory partitions of indices with modulo operations.

...read moreread less

Abstract: Behavioral synthesis tools have made significant progress in compiling high-level programs into register-transfer level (RTL) specifications. But manually rewriting code is still necessary in order to obtain better quality of results in memory system optimization. In recent years different automated memory optimization techniques have been proposed and implemented, such as data reuse and memory partitioning, but the problem of integrating these techniques into an applicable flow to obtain a better performance has become a challenge. In this paper we integrate data reuse, loop pipelining, memory partitioning, and memory merging into an automated optimization flow (AMO) for FPGA behavioral synthesis. We develop memory padding to help in the memory partitioning of indices with modulo operations. Experimental results on Xilinx Virtex-6 FPGAs show that our integrated approach can gain an average 5.8× throughput and 4.55× latency improvement compared to the approach without memory partitioning. Moreover, memory merging saves up to 44.32% of block RAM (BRAM).

...read moreread less

Proceedings Article•DOI•

GLOW: A global router for low-power thermal-reliable interconnect synthesis using photonic wavelength multiplexing

[...]

Duo Ding¹, Bei Yu¹, David Z. Pan¹•Institutions (1)

University of Texas at Austin¹

09 Mar 2012

TL;DR: GLOW, a hybrid global router to provide low power opto-electronic interconnect synthesis under the considerations of thermal reliability and various physical design constraints such as optical power, delay and signal quality is presented.

...read moreread less

Abstract: In this paper, we examine the integration potential and explore the design space of low power thermal reliable on-chip interconnect synthesis featuring nanophotonics Wavelength Division Multiplexing (WDM). With the recent advancements, it is foreseen that nanophotonics holds the promise to be employed for future on-chip data signalling due to its unique power efficiency, signal delay and huge multiplexing potential. However, there are major challenges to address before feasible on-chip integration could be reached. In this paper, we present GLOW, a hybrid global router to provide low power opto-electronic interconnect synthesis under the considerations of thermal reliability and various physical design constraints such as optical power, delay and signal quality. GLOW is evaluated with testing cases derived from ISPD07-08 global routing benchmarks. Compared with a greedy approach, GLOW demonstrates around 23%–50% of total optical power reduction, revealing great potential of on-chip WDM interconnect synthesis.

...read moreread less

Proceedings Article•DOI•

Charge replacement in hybrid electrical energy storage systems

[...]

Qing Xie¹, Yanzhi Wang¹, Massoud Pedram¹, Younghyun Kim², Donghwa Shin², Naehyuck Chang² - Show less +2 more•Institutions (2)

University of Southern California¹, Seoul National University²

09 Mar 2012

TL;DR: The global charge replacement (GCR) optimization problem is formally described and an algorithm to find the near-optimal GCR control policy is provided and significant improvements in the charge replacement efficiency are demonstrated.

...read moreread less

Abstract: Hybrid electrical energy storage (HEES) systems are composed of multiple banks of heterogeneous electrical energy storage (EES) elements with distinctive properties. Charge replacement in a HEES system (i.e., dynamic assignment of load demands to EES banks) is one of the key operations in the system. This paper formally describes the global charge replacement (GCR) optimization problem and provides an algorithm to find the near-optimal GCR control policy. The optimization problem is formulated as a mixed-integer nonlinear programming problem, where the objective function is the charge replacement efficiency. The constraints account for the energy conservation law, efficiency of the charger/converter, the rate capacity effect, and self-discharge rates plus internal resistances of the EES element arrays. The near-optimal solution to this problem is obtained while considering the state of charges (SoCs) of the EES element arrays, characteristics of the load devices, and estimates of energy contributions by the EES element arrays. Experimental results demonstrate significant improvements in the charge replacement efficiency in an example HEES system comprised of banks of battery and supercapacitor elements with a high-power pulsed military radio transceiver as the load device.

...read moreread less

Proceedings Article•DOI•

Block-level 3D IC design with through-silicon-via planning

[...]

Dae Hyun Kim¹, Rasit O. Topaloglu², Sung Kyu Lim¹•Institutions (2)

Georgia Institute of Technology¹, GlobalFoundries²

09 Mar 2012

TL;DR: Experimental results show that the signal TSV planner outperforms the state-of-the-art TSV-aware 3D floorplanner by 7% to 38% with respect to wirelength, and the multiple TSV insertion algorithm outperforms a single TSVs insertion algorithm by 27% to 37%.

...read moreread less

Abstract: Since re-designing and re-optimizing existing logic, memory, and IP blocks in a 3D fashion significantly increases design cost, near-term three-dimensional integrated circuit (3D IC) design will focus on reusing existing 2D blocks. One way to reuse 2D blocks in the 3D IC design is to first perform 3D floorplanning, insert signal through-silicon vias (TSVs) for 3D inter-block connections, and then route the blocks. In this paper, we propose algorithms (finding signal TSV locations, assigning TSVs to whitespace blocks, and manipulating whitespace blocks) for post-floorplanning signal TSV planning in the block-level 3D IC design. Experimental results show that our signal TSV planner outperforms the state-of-the-art TSV-aware 3D floorplanner by 7% to 38% with respect to wirelength. In addition, our multiple TSV insertion algorithm outperforms a single TSV insertion algorithm by 27% to 37%.

...read moreread less

Proceedings Article•DOI•

Design for manufacturability and reliability for TSV-based 3D ICs

[...]

David Z. Pan¹, Sung Kyu Lim², Krit Athikulwongse², Moongon Jung², Joydeep Mitra¹, Jiwoo Pak¹, Mohit Pathak², Jae-Seok Yang¹ - Show less +4 more•Institutions (2)

University of Texas at Austin¹, Georgia Institute of Technology²

09 Mar 2012

TL;DR: Some key design for manufacturability and reliability challenges and possible solutions for TSV-based 3D IC integration, as well as future research directions are discussed.

...read moreread less

Abstract: The 3D IC integration using through-silicon-vias (TSV) has gained tremendous momentum recently for industry adoption. However, as TSV involves disruptive manufacturing technologies, new modeling and design techniques need to be developed for 3D IC manufacturability and reliability. In particular, TSVs in 3D IC may cause significant thermal mechanical stress, which not only results in systematic mobility/performance variations, but also leads to mechanical reliability concerns such as interfacial cracking. Meanwhile, the huge dimensional gaps between TSV, on-chip wires, and bonding/packaging all lead to new electromigration concerns. Thus full-chip/package modeling and physical design tools need to be developed to achieve more reliable 3D IC integration. In this paper, we will discuss some key design for manufacturability and reliability challenges and possible solutions for TSV-based 3D IC integration, as well as future research directions.

...read moreread less

Proceedings Article•DOI•

Platform characterization for Domain-Specific Computing

[...]

Alex A. T. Bui¹, Kwang-Ting Cheng², Jason Cong¹, Luminita A. Vese¹, Yi-Chu Wang², Bo Yuan¹, Yi Zou¹ - Show less +3 more•Institutions (2)

University of California, Los Angeles¹, University of California, Santa Barbara²

09 Mar 2012

TL;DR: This study chooses medical imaging as the application domain for investigation, and studies the application performance and energy efficiency across a diverse set of commodity hardware platforms, such as general-purpose multi-core CPUs, massive parallel many-core GPUs, low-power mobile CPUs and fine-grain customizable FPGAs.

...read moreread less

Abstract: We believe that by adapting architectures to fit the requirements of a given application domain, we can significantly improve the efficiency of computation. To validate the idea for our application domain, we evaluate a wide spectrum of commodity computing platforms to quantify the potential benefits of heterogeneity and customization for the domain-specific applications. In particular, we choose medical imaging as the application domain for investigation, and study the application performance and energy efficiency across a diverse set of commodity hardware platforms, such as general-purpose multi-core CPUs, massive parallel many-core GPUs, low-power mobile CPUs and fine-grain customizable FPGAs. This study leads to a number of interesting observations that can be used to guide further development of domain-specific architectures.

...read moreread less

Proceedings Article•DOI•

A reconfigurable accelerator for neuromorphic object recognition

[...]

Jagdish Sabarad¹, Srinidhi Kestur¹, Mi Sun Park¹, Dharav Dantara¹, Vijaykrishnan Narayanan¹, Yang Chen², Deepak Khosla² - Show less +3 more•Institutions (2)

Pennsylvania State University¹, HRL Laboratories²

09 Mar 2012

TL;DR: A systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple variable-sized convolutions in parallel is presented, which leverages spatial parallelism, dedicated wide data buses with on-chip memories to provide an energy efficient solution to enable adoption into embedded systems.

...read moreread less

Abstract: Advances in neuroscience have enabled researchers to develop computational models of auditory, visual and learning perceptions in the human brain. HMAX, which is a biologically inspired model of the visual cortex, has been shown to outperform standard computer vision approaches for multi-class object recognition. HMAX, while computationally demanding, can be potentially applied in various applications such as autonomous vehicle navigation, unmanned surveillance and robotics. In this paper, we present a reconfigurable hardware accelerator for the time-consuming S2 stage of the HMAX model. The accelerator leverages spatial parallelism, dedicated wide data buses with on-chip memories to provide an energy efficient solution to enable adoption into embedded systems. We present a systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple variable-sized convolutions in parallel. An automation flow is described for this accelerator which can generate optimal hardware configurations for a given algorithmic specification and also perform run-time configuration and execution seamlessly. Experimental results on Virtex-6 FPGA platforms show 5X to 11X speedups and 14X to 33X higher performance-per-Watt over a CNS-based implementation on a Tesla GPU.

...read moreread less

Proceedings Article•DOI•

Abstract system-level models for early performance and power exploration

[...]

Andreas Gerstlauer¹, Suhas Chakravarty¹, Manan Kathuria¹, Parisa Razaghi¹•Institutions (1)

University of Texas at Austin¹

09 Mar 2012

TL;DR: In this article, the authors present ingredients for a class of abstract, high-level platform models that enable fast yet accurate performance and power simulation of application execution on heterogeneous multi-core/processor architectures.

...read moreread less

Abstract: With increasing complexity of today's embedded systems, research has focused on developing fast, yet accurate high-level and executable models of complete platforms. These models address the need for hardware/software co-simulation of the entire system at early stages of the design. Traditional models tend to be either slow or inaccurate. In this paper, we present ingredients for a class of abstract, high-level platform models that enable fast yet accurate performance and power simulation of application execution on heterogeneous multi-core/-processor architectures. Models are based on host-compiled simulation of the application code, which is instrumented with timing and power information. Back-annotated source code is further augmented with abstract OS and processor models that are integrated into standard co-simulation backplanes. The efficiency of the modeling platform has been evaluated by applying an industrial-strength benchmark, demonstrating the feasibility and benefits of such models for rapid, early exploration of the power, performance and cost design space. Results show that an accurate Pareto set of solutions can be obtained in a fraction of the time needed with traditional simulation and modeling approaches.

...read moreread less

Proceedings Article•DOI•

GreenDroid: An architecture for the Dark Silicon Age

[...]

Nathan Goulding-Hotta¹, Jack Sampson¹, Qiaoshi Zheng¹, Vikram Bhatt¹, Joe Auricchio¹, Steven Swanson¹, Michael Taylor¹ - Show less +3 more•Institutions (1)

University of California, San Diego¹

09 Mar 2012

TL;DR: This research attacks the Dark Silicon problem directly through a set of energy-saving accelerators, called Conservation Cores, or c-cores, a post-multicore approach that constructively uses dark silicon to reduce the energy consumption of an application by 10× or more.

...read moreread less

Abstract: The Dark Silicon Age kicked off with the transition to multicore and will be characterized by a wild chase for seemingly ever-more insane architectural designs. At the heart of this transformation is the Utilization Wall, which states that, with each new process generation, the percentage of transistors that a chip can switch at full frequency is dropping exponentially due to power constraints. This has led to increasingly larger and larger fractions of a chip's silicon area that must remain passive, or dark. Since Dark Silicon is an exponentially-worsening phenomenon, getting worse at the same rate that Moore's Law is ostensibly making process technology better, we need to seek out fundamentally new approaches to designing processors for the Dark Silicon Age. Simply tweaking existing designs is not enough. Our research attacks the Dark Silicon problem directly through a set of energy-saving accelerators, called Conservation Cores, or c-cores. C-cores are a post-multicore approach that constructively uses dark silicon to reduce the energy consumption of an application by 10× or more. To examine the utility of c-cores, we are developing GreenDroid, a multicore chip that targets the Android mobile software stack. Our mobile application processor prototype targets a 32-nm process and is comprised of hundreds of automatically generated, specialized, patchable c-cores. These cores target specific Android hotspots, including the kernel. Our preliminary results suggest that we can attain up to 11× improvement in energy efficiency using a modest amount of silicon.

...read moreread less

Proceedings Article•DOI•

Fast floating random walk algorithm formulti-dielectric capacitance extraction with numerical characterization of Green's functions

[...]

Hao Zhuang¹, Wenjian Yu¹, Gang Hu¹, Zhi Liu¹, Zuochang Ye¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

09 Mar 2012

TL;DR: A space management technique with Octree data structure to reduce the time of each hop and parallelize the whole FRW by multi-threaded programming and results show large speedup brought by the proposed techniques for structures under the VLSI technology with thin dielectric layers.

...read moreread less

Abstract: The floating random walk (FRW) algorithm has several advantages for extracting 3D interconnect capacitance. However, for multi-layer dielectrics in VLSI technology, the efficiency of FRW algorithm would be degraded due to frequent stop of walks at dielectric interface and constraint of first-hop length especially in thin dielectrics. In this paper, we tackle these problems with the numerical characterization of Green??s function for cross-interface transition probabilities and the corresponding weight value. We also present a space management technique with Octree data structure to reduce the time of each hop and parallelize the whole FRW by multi-threaded programming. Numerical results show large speedup brought by the proposed techniques for structures under the VLSI technology with thin dielectric layers.

...read moreread less

Proceedings Article•DOI•

Efficient pattern relocation for EUV blank defect mitigation

[...]

Hongbo Zhang¹, Yuelin Du¹, Martin D. F. Wong¹, Rasit O. Topalaglu²•Institutions (2)

University of Illinois at Urbana–Champaign¹, GlobalFoundries²

09 Mar 2012

TL;DR: This paper successfully presents an algorithm that can optimally solve the pattern relocation problem and the relocation results with full scale layouts generated from Nangate Open Cell Library has shown great advantages with competitive runtimes compared to the existing commercial tool.

...read moreread less

Abstract: Blank defect mitigation is a critical step for extreme ultraviolet (EUV) lithography. Targeting the defective blank, a layout relocation method, to shift and rotate the whole layout pattern to a proper position, has been proved to be an effective way to reduce defect impact. Yet, there is still no published work about how to find the best pattern location to minimize the impact from the buried defects with reasonable defect model and considerable process variation control. In this paper, we successfully present an algorithm that can optimally solve this pattern relocation problem. Experimental results validate our method, and the relocation results with full scale layouts generated from Nangate Open Cell Library has shown great advantages with competitive runtimes compared to the existing commercial tool.

...read moreread less

Proceedings Article•DOI•

An adaptive LU factorization algorithm for parallel circuit simulation

[...]

Xiaoming Chen¹, Yu Wang¹, Huazhong Yang¹•Institutions (1)

Tsinghua University¹

09 Mar 2012

TL;DR: A parallel LU factorization (with partial pivoting) algorithm on shared-memory computers with multi-core CPUs, to accelerate circuit simulation and find a predictive method to decide whether a matrix should use parallel or sequential algorithm.

...read moreread less

Abstract: Sparse matrix solver has become the bottleneck in SPICE simulator. It is difficult to parallelize the solver because of the high data-dependency during the numerical LU factorization. This paper proposes a parallel LU factorization (with partial pivoting) algorithm on shared-memory computers with multi-core CPUs, to accelerate circuit simulation. Since not every matrix is suitable for parallel algorithm, a predictive method is proposed to decide whether a matrix should use parallel or sequential algorithm. The experimental results on 35 circuit matrices reveal that the developed algorithm achieves speedups of 2.11×∼8.38× (on geometric-average), compared with KLU, with 1∼8 threads, on the matrices which are suitable for parallel algorithm. Our solver can be downloaded from http://nicslu.weebly.com.

...read moreread less

Proceedings Article•DOI•

An application classification guided cache tuning heuristic for multi-core architectures

[...]

Marisha Rawlins¹, Ann Gordon-Ross¹•Institutions (1)

University of Florida¹

09 Mar 2012

TL;DR: A level one data cache tuning heuristic for a heterogeneous multi-core system is presented, which classifies applications based on data sharing and cache behavior, and uses this classification to guide cache tuning and reduce the number of cores that need to be tuned.

...read moreread less

Abstract: Since multi-core architectures are becoming more popular, recent multi-core optimizations focus on energy consumption. We present a level one data cache tuning heuristic for a heterogeneous multi-core system, which classifies applications based on data sharing and cache behavior, and uses this classification to guide cache tuning and reduce the number of cores that need to be tuned. Results reveal average energy savings of 25% for 2-, 4-, 8-, and 16-core systems while searching only 1% of the design space.

...read moreread less

Proceedings Article•DOI•

Yield enhancement for 3D-stacked ICs: Recent advances and challenges

[...]

Qiang Xu¹, Li Jiang¹, Huiyun Li², Bill Eklow³•Institutions (3)

The Chinese University of Hong Kong¹, Chinese Academy of Sciences², Cisco Systems, Inc.³

09 Mar 2012

TL;DR: Recent advances in 3D stack yield techniques are surveyed and challenges to be resolved in the future are pointed out.

...read moreread less

Abstract: Three-dimensional (3D) integrated circuits (ICs) that stack multiple dies vertically using through-silicon vias (TSVs) have gained wide interests of the semiconductor industry. The shift towards volume production of 3D-stacked ICs, however, requires their manufacturing yield to be commercially viable. Various techniques have been presented in the literature to address this important problem, including pre-bond testing techniques to tackle the “known good die” problem, TSV redundancy designs to provide defect-tolerance, and wafter/die matching solutions to improve the overall stack yield. In this paper, we survey recent advances in this filed and point out challenges to be resolved in the future.

...read moreread less