Showing papers presented at "Asia and South Pacific Design Automation Conference in 2015"

PDF

Open Access

Proceedings Article•DOI•

Technological exploration of RRAM crossbar array for matrix-vector multiplication

[...]

Peng Gu¹, Boxun Li¹, Tianqi Tang¹, Shimeng Yu², Yu Cao², Yu Wang¹, Huazhong Yang¹ - Show less +3 more•Institutions (2)

Tsinghua University¹, Arizona State University²

12 Mar 2015

TL;DR: The impact of nonlinear voltage-current relationship of RRAM devices and the interconnect resistance as well as other crossbar array parameters on the circuit performance is analyzed and a design guide is presented to achieve better trade-off among performance, energy and reliability for each specific application.

...read moreread less

Abstract: The matrix-vector multiplication is the key operation for many computationally intensive algorithms. In recent years, the emerging metal oxide resistive switching random access memory (RRAM) device and RRAM crossbar array have demonstrated a promising hardware realization of the analog matrix-vector multiplication with ultra-high energy efficiency. In this paper, we analyze the impact of nonlinear voltage-current relationship of RRAM devices and the interconnect resistance as well as other crossbar array parameters on the circuit performance and present a design guide. On top of that, we propose a technological exploration flow for device parameter configuration to overcome the impact of nonideal factors and achieve a better trade-off among performance, energy and reliability for each specific application. The simulation results of a support vector machine (SVM) and MNIST pattern recognition dataset show that the RRAM crossbar array-based SVM is robust to the input signal fluctuation but sensitive to the tunneling gap deviation. A further resistance resolution test presents that a 4-bit RRAM device is able to realize a recognition accuracy of ∼ 90%, indicating the physical feasibility of RRAM crossbar array-based SVM. In addition, the proposed technological exploration flow is able to achieve 10.98% improvement of recognition accuracy on the MNIST dataset and 26.4% energy savings compared with previous work.

...read moreread less

99 citations

Proceedings Article•DOI•

Determining the minimal number of swap gates for multi-dimensional nearest neighbor quantum circuits

[...]

Aaron Lye¹, Robert Wille¹, Rolf Drechsler¹•Institutions (1)

University of Bremen¹

01 Jan 2015

TL;DR: This work proposes an exact scheme for nearest neighbor optimization in multi-dimensional quantum circuits and shows that the proposed solution is sufficient to allow for a qualitative evaluation of the respective optimization steps.

...read moreread less

Abstract: Motivated by the promises of significant speed-ups for certain problems, quantum computing received significant attention in the past. While much progress has been made in the development of synthesis methods for quantum circuits, new physical developments constantly lead to new constraints to be addressed. The limited interaction distance between the respective qubits (i.e. nearest neighbor optimization) has already been considered intensely. But with the emerge of multi-dimensional quantum architectures, new physical requirements came up for which only a few automatic synthesis solutions exist yet all of them of heuristic nature. In this work, we propose an exact scheme for nearest neighbor optimization in multi-dimensional quantum circuits. Although the complexity of the problem is a serious obstacle, our experimental evaluation shows that the proposed solution is sufficient to allow for a qualitative evaluation of the respective optimization steps. Besides that, this enabled an exact comparison to heuristical results for the first time.

...read moreread less

92 citations

Proceedings Article•DOI•

Data sensing and analysis: Challenges for wearables

[...]

James Williamson¹, Qi Liu¹, Fenglong Lu¹, Wyatt Mohrman¹, Kun Li¹, Robert P. Dick², Li Shang¹ - Show less +3 more•Institutions (2)

University of Colorado Boulder¹, University of Michigan²

12 Mar 2015

TL;DR: The energy challenges for wearable sensing technologies are described, with a primary focus on the most widely used wearable sensors: MEMS-based inertial measurement units: MEMs IMU data sensing, analysis, and wireless communication.

...read moreread less

Abstract: Wearables are a leading category in the Internet of Things. Compared with mainstream mobile phones, wearables target one order of magnitude form factor reduction, and offer the potential of providing ubiquitous, personalized services to end users. Aggressive reduction in size imposes serious limits on battery capacity. Wearables are equipped with a range of sensors, such as accelerometers and gyroscopes. Most economical sensors were developed for mobile phones, with energy consumptions more appropriate for phones than for ultra-compact wearables. This article describes the energy challenges for wearable sensing technologies, with a primary focus on the most widely used wearable sensors: MEMS-based inertial measurement units. Using sports and fitness wearables as the pilot application, we analyze the energy characteristics of MEMS IMU data sensing, analysis, and wireless communication. We then discuss the technologies needed to solve the power and energy consumptions challenges for wearables.

...read moreread less

79 citations

Proceedings Article•DOI•

Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power

[...]

Chao Zhang¹, Guangyu Sun¹, Weiqi Zhang¹, Fan Mi², Hai Li², Weisheng Zhao³ - Show less +2 more•Institutions (3)

Peking University¹, University of Pittsburgh², Beihang University³

12 Mar 2015

TL;DR: This model introduces Macro Unit (MU) as the building block of RM, and analyzes the interaction of its attributes, and integrates the model into NVsim to enable the automatic exploration of its huge design space.

...read moreread less

Abstract: Recently, an emerging non-volatile memory called Racetrack Memory (RM) becomes promising to satisfy the requirement of increasing on-chip memory capacity RM can achieve ultra-high storage density by integrating many bits in a tape-like racetrack, and also provide comparable read/write speed with SRAM However, the lack of circuit-level modeling has limited the design exploration of RM, especially in the system-level To overcome this limitation, we develop an RM circuit-level model, with careful study of device configurations and circuit layouts This model introduces Macro Unit (MU) as the building block of RM, and analyzes the interaction of its attributes Moreover, we integrate the model into NVsim to enable the automatic exploration of its huge design space Our case study of RM cache demonstrates significant variance under different optimization targets, in respect of area, performance, and energy In addition, we show that the cross-layer optimization is critical for adoption of RM as on-chip memory

...read moreread less

58 citations

Proceedings Article•DOI•

Implementation of double arbiter PUF and its performance evaluation on FPGA

[...]

Takanori Machida¹, Dai Yamamoto², Mitsugu Iwamoto¹, Kazuo Sakiyama¹•Institutions (2)

University of Electro-Communications¹, Fujitsu²

12 Mar 2015

TL;DR: This paper implements Double APUF (DAPUF) that duplicates the original APUF in order to overcome the problems of low uniqueness and vulnerability to machine-learning attacks.

...read moreread less

Abstract: Low uniqueness and vulnerability to machine-learning attacks are known as two major problems of Arbiter-Based Physically Unclonable Function (APUF) implemented on FPGAs. In this paper, we implement Double APUF (DAPUF) that duplicates the original APUF in order to overcome the problems. From the experimental results on Xilinx Virtex-5, we show that the uniqueness of DAPUF becomes almost ideal, and the prediction rate of the machine-learning attack decreases from 86% to 57%.

...read moreread less

48 citations

Proceedings Article•DOI•

Powering the IoT: Storage-less and converter-less energy harvesting

[...]

Hyung Gyu Lee¹, Naehyuck Chang²•Institutions (2)

Daegu University¹, KAIST²

12 Mar 2015

TL;DR: This paper introduces a novel energy harvesting and management technique to power the IoT, which does not require any long-term energy storages nor voltage converters unlike traditional energy harvesting systems.

...read moreread less

Abstract: Wide spread of Internet of Things (IoTs) still have huddles in cost and maintenance. Energy harvesting is a promising option to mitigate battery replacement, but the current energy harvesting methods still rely on batteries or equivalent and power converters for the maximum power point tracking (MPPT). Unfortunately, batteries are subject to wear and tear, which is a primary factor to prevent from being maintenance free. Power converters are expensive, heavy and lossy as well. In this paper, we introduce a novel energy harvesting and management technique to power the IoT, which does not require any long-term energy storages nor voltage converters unlike traditional energy harvesting systems. Extensive simulations and measurements from our prototype demonstrate that the proposed method harvests 8% more energy and extends the operation time of the device 60% more during a day. This paper also demonstrates a UV (ultraviolet) level meter for skin protect, named SmartPatch, using the proposed energy harvesting method. The proposed method is not limited to photovoltaic energy harvesting but applicable to most energy harvesting IoT power supplies that require impedance tracking.

...read moreread less

46 citations

Proceedings Article•DOI•

Machine learning and pattern matching in physical design

[...]

Bei Yu¹, David Z. Pan¹, Tetsuaki Matsunawa², Xuan Zeng³•Institutions (3)

University of Texas at Austin¹, Toshiba², Fudan University³

12 Mar 2015

TL;DR: This paper will discuss key techniques and recent results of machine learning and pattern matching, with their applications in physical design.

...read moreread less

Abstract: Machine learning (ML) and pattern matching (PM) are powerful computer science techniques which can derive knowledge from big data, and provide prediction and matching. Since nanometer VLSI design and manufacturing have extremely high complexity and gigantic data, there has been a surge recently in applying and adapting machine learning and pattern matching techniques in VLSI physical design (including physical verification), e.g., lithography hotspot detection and data/pattern-driven physical design, as ML and PM can raise the level of abstraction from detailed physics-based simulations and provide reasonably good quality-of-result. In this paper, we will discuss key techniques and recent results of machine learning and pattern matching, with their applications in physical design.

...read moreread less

45 citations

Proceedings Article•DOI•

Aging mitigation in memory arrays using self-controlled bit-flipping technique

[...]

Anteneh Gebregiorgis, Mojtaba Ebrahimi¹, Saman Kiamehr¹, Fabian Oboril¹, Said Hamdioui, Mehdi B. Tahoori¹ - Show less +2 more•Institutions (1)

Karlsruhe Institute of Technology¹

12 Mar 2015

TL;DR: A low cost self-controlled bit-flipping technique which inverts all bit positions with respect to an existing bit is proposed which is applied to a register-file and cache units of an embedded microprocessor and results show that the reliability of the proposed technique is similar to that of existing bit-Flipping techniques, while imposing 64% less area overhead.

...read moreread less

Abstract: With CMOS technology downscaling into the nanometer regime, the reliability of SRAM memories is threatened by accelerated transistor aging mechanisms such as Bias Temperature Instability (BTI). BTI leads to a considerable degradation of SRAM cell Static Noise Margin (SNM), which increases the memory failure rate. Since BTI is workload dependent, the aging rates of different cells in a memory array are quite non-uniform. To address this issue, a variety of bit-flipping techniques has been proposed to decrease the SNM degradation by balancing the signal probabilities of the cells. However, existing bit-flipping techniques impose too much area and power overhead as at least an additional column is required to store the inversion flags. In this paper, we propose a low cost self-controlled bit-flipping technique which inverts all bit positions with respect to an existing bit. This technique is applied to a register-file and cache units of an embedded microprocessor. Our simulation results show that the reliability of the proposed technique is similar to that of existing bit-flipping techniques, while imposing 64% less area overhead.

...read moreread less

41 citations

Proceedings Article•DOI•

IC Piracy prevention via Design Withholding and Entanglement

[...]

Soroush Khaleghi¹, Kai Da Zhao¹, Wenjing Rao¹•Institutions (1)

University of Illinois at Chicago¹

12 Mar 2015

TL;DR: A novel protection scheme, called Entanglement, which can substantially strengthen the Design Withholding framework, and is distinguished from the previous works by not relying on the difficulty of finding the solution for some NP-Complete/NP-Hard problems, but rather, on the exponentially boosted number of problems that an attacker has to solve.

...read moreread less

Abstract: Globalization of the semiconductor industry has raised serious concerns about trustworthy hardware. Particularly, an untrusted manufacturer can steal the information of a design (Reverse Engineering), and/or produce extra chips illegally (IC Piracy). Among many candidates that address these attacks, Design Withholding techniques work by replacing a part of the design with a reconfigurable block on chip, so that none of the manufactured chips will function properly until they are activated in a trusted facility, where the withheld function is restored back into the reconfigurable block on chip. However, most existing approaches are ad-hoc based, and are facing two major challenges: 1) susceptibility to a category of algorithmic attacks, from attackers in a strong position, such as a manufacturer; and 2) scaling up the defense level is checkmated by the explosion of hardware cost that has to be paid at the designer's side. In this paper, we propose a novel protection scheme, called Entanglement, which can substantially strengthen the Design Withholding framework: 1) the algorithmic attacks are prevented by forcing the attacker to solve a huge number of problems of high computational complexity; 2) the attack cost (in terms of computational complexity) is quantitatively controllable at the designer's end, with low hardware overhead: while the cost of attack can be increased exponentially, the hardware overhead imposed on the designer's side grows only linearly. The proposed work distinguishes itself from the previous works by not relying on the difficulty of finding the solution for some NP-Complete/NP-Hard problems, but rather, on the exponentially boosted number of such problems that an attacker has to solve, while carefully maintaining the growth of the hardware overhead to be scalable via Entanglement.

...read moreread less

34 citations

Proceedings Article•DOI•

Minimizing MLC PCM write energy for free through profiling-based state remapping

[...]

Mengying Zhao¹, Yuan Xue², Chengmo Yang², Chun Jason Xue¹•Institutions (2)

City University of Hong Kong¹, University of Delaware²

12 Mar 2015

TL;DR: This paper first compares dynamic and static state remapping strategies in terms of their efficacy in reducing energy, and then proposes an effective and low-cost staticstate remapping algorithm.

...read moreread less

Abstract: Phase change memory is becoming one of the most promising candidates to replace DRAM as main memory in deep sub-micron regime. Multi-level cell (MLC) PCM outperforms single level cell (SLC) PCM in terms of storage capacity but requires an iterative programming-and-verifying scheme to program cells to different resistance levels. The energy consumed in programming different MLC states varies significantly, thus motivating a state remapping technique to minimize the overall write energy. In this paper, we first compare dynamic and static state remapping strategies in terms of their efficacy in reducing energy, and then propose an effective and low-cost static state remapping algorithm. The experimental studies show 10.6% average (up to 16.9%) reduction in MLC PCM write energy, achieved within negligible hardware and performance overhead. Compared with the most related work, the proposed scheme saves more write energy on average, with near-zero performance, area and energy overhead.

...read moreread less

30 citations

Proceedings Article•DOI•

Improving performance and lifetime of DRAM-PCM hybrid main memory through a proactive page allocation strategy

[...]

Hoda Aghaei Khouzani¹, Chengmo Yang¹, Jingtong Hu²•Institutions (2)

University of Delaware¹, Oklahoma State University–Stillwater²

12 Mar 2015

TL;DR: This work exploiting the flexibility of mapping virtual pages to physical pages, and proposing a page allocation algorithm that considers both segment information and conflict misses in DRAM to distribute heavily written pages across different DRAM sets, simultaneously improving performance and lifetime of DRAM-PCM hybrid main memory.

...read moreread less

Abstract: Phase change memory (PCM), given its non-volatility and low static energy consumption, is a promising candidate to be used as main memory. However, due to its limited endurance and slow write performance, state-of-the-art solutions tend to construct a DRAM-PCM hybrid memory instead of using PCM exclusively. While existing optimizations to this hybrid architecture focus on tuning DRAM configurations to further reduce writes to PCM, we aim at developing a proactive solution. Specifically, we exploit the flexibility of mapping virtual pages to physical pages, and propose a page allocation algorithm that considers both segment information and conflict misses in DRAM to distribute heavily written pages across different DRAM sets. Trace-driven experiments confirm the effectiveness of proposed technique in reducing both DRAM misses and PCM writes, thus simultaneously improving performance and lifetime of DRAM-PCM hybrid main memory.

...read moreread less

Proceedings Article•DOI•

Hardware Trojan detection using exhaustive testing of k-bit subspaces

[...]

Nicole Lesperance¹, Shrikant Kulkarni¹, Kwang-Ting Cheng¹•Institutions (1)

University of California, Santa Barbara¹

12 Mar 2015

TL;DR: By aiming to exhaustively cover all possible k subsets of signals, this work guarantees detection of Trojans using less than k plaintext bits in the trigger.

...read moreread less

Abstract: Post-silicon hardware Trojan detection is challenging because the attacker only needs to implement one of many possible design modifications, while the verification effort must guarantee the absence of all imaginable malicious circuitry. Existing test generation strategies for Trojan detection use controllability and observability metrics to limit the modifications targeted. However, for cryptographic hardware, the n plaintext bits are ideal for an attacker to use in Trojan triggering because the size of n prohibits exhaustive testing, and all n bits have identical controllability, making it impossible to bias testing using existing methods. Our detection method addresses this difficult case by observing that an attacker can realistically only afford to use a small subset, k, of all n possible signals for triggering. By aiming to exhaustively cover all possible k subsets of signals, we guarantee detection of Trojans using less than k plaintext bits in the trigger. We provide suggestions on how to determine k, and validate our approach using an AES design.

...read moreread less

Proceedings Article•DOI•

A trace-driven approach for fast and accurate simulation of manycore architectures

[...]

Anastasiia Butko¹, Rafael Garibotti¹, Luciano Ost¹, Vianney Lapotre¹, Abdoulaye Gamatié¹, Gilles Sassatelli¹, Chris Adeniyi-Jones - Show less +3 more•Institutions (1)

University of Montpellier¹

19 Jan 2015

TL;DR: A novel trace-driven simulation approach for efficient exploration of manycore architectures that limits the scope of possible explorations to configurations made of tens of cores.

...read moreread less

Abstract: The evolution of manycore systems, forecasted to feature hundreds of cores by the end of the decade calls for efficient solutions for design space exploration and debugging. Among the relevant existing solutions the well-known gem5 simulator provides a rich architecture description framework. However, these features come at the price of prohibitive simulation time that limits the scope of possible explorations to configurations made of tens of cores. To address this limitation, this paper proposes a novel trace-driven simulation approach for efficient exploration of manycore architectures.

...read moreread less

Proceedings Article•DOI•

Multilane Racetrack caches: Improving efficiency through compression and independent shifting

[...]

Haifeng Xu¹, Yong Li², Rami Melhem¹, Alex K. Jones¹•Institutions (2)

University of Pittsburgh¹, VMware²

12 Mar 2015

TL;DR: This work proposes multilane Racetrack caches (MRC), a RM last level cache design utilizing lightweight compression combined with independent shifting, demonstrating that an isocapacity MRC cache replacement can outperform SRAM caches while providing energy improvement over STT-MRAM caches.

...read moreread less

Abstract: Racetrack memory (RM), a spintronic domain-wall non-volatile memory has recently received attention as a high-capacity replacement for various structures in the memory system from secondary storage through caches. The main advantage of RM is an improved density and like other non-volatile memory structures, the static power of RM is dramatically lower than conventional CMOS memories. However, a major challenge of employing RM in universal memory components is the added access latency and dynamic energy consumption caused by shifts to align the data of interest with an access port. We propose multilane Racetrack caches (MRC), a RM last level cache design utilizing lightweight compression combined with independent shifting. MRC allows cache lines mapped to the same Racetrack structure to be accessed in parallel when compressed, mitigating potential shifting stalls in the RM cache. Our results demonstrate that unlike previously proposed RM caches, an isocapacity MRC cache replacement can outperform SRAM caches while providing energy improvement over STT-MRAM caches. In particular, MRC improves performance by 5% and reduces energy by 19% compared to an isocapacity baseline RM cache resulting in an energy delay product improvement of 25%.

...read moreread less

Proceedings Article•DOI•

Reverse BDD-based synthesis for splitter-free optical circuits

[...]

Robert Wille¹, Oliver Keszocze¹, Clemens Hopfmuller¹, Rolf Drechsler¹•Institutions (1)

University of Bremen¹

12 Mar 2015

TL;DR: This work presents a synthesis approach based on Binary Decision Diagrams (BDDs) that overcomes obstacles and yields circuits that rely on a total of zero splitters - at the expense of a moderate increase in the number of optical gates.

...read moreread less

Abstract: With the advancements in silicon photonics, optical devices have found applications e.g. for ultra-high speed and low-power interconnects as well as functional computations to be realized on-chip. Caused by the increasing complexity of the underlying functionality, also the need for computer-aided design methods for this technology rises. Motivated by that, initial work on the development of synthesis methods for optical circuits has been performed. But all approaches proposed thus far suffer e.g. from unsatisfactory synthesis results or restricted scalability. In particular, splittings in the resulting circuits which degrade the optical signals into hardly measurable fractions prevent an efficient and scalable synthesis for optical circuits. In this work, we present a synthesis approach based on Binary Decision Diagrams (BDDs) that overcomes these obstacles. The approach yields circuits that rely on a total of zero splitters - at the expense of a moderate increase in the number of optical gates. Experiments confirm that, by this, an efficient and scalable synthesis scheme for optical circuits eventually becomes available.

...read moreread less

Proceedings Article•DOI•

An efficient STT-RAM-based register file in GPU architectures

[...]

Xiaoxiao Liu¹, Mengjie Mao¹, Xiuyuan Bi¹, Hai Li¹, Yi Chen¹ - Show less +1 more•Institutions (1)

University of Pittsburgh¹

12 Mar 2015

TL;DR: This work proposes a novel GPU RF design based on the emerging multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) technology, which has much smaller cell area and almost zero standby power due to its non-volatility.

...read moreread less

Abstract: Modern GPGPUs employ a large register file (RF) to efficiently process heavily parallel threads in single instruction multiple thread (SIMT) fashion. The up-scaling of RF capacity, however, is greatly constrained by large cell area and high leakage power consumption of SRAM implementation. In this work, we propose a novel GPU RF design based on the emerging multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) technology. Compared to SRAM, MLC STT-RAM (or MLC-STT) has much smaller cell area and almost zero standby power due to its non-volatility. Moreover, by leveraging the asymmetric performance of the soft and the hard bits of a MLC-STT cell, we propose a remapping strategy to perform a flexible tradeoff between the access time and the capacity of the RF based on run-time access patterns. A novel rescheduling scheme is also developed to minimize the waiting time of the issued warps to access register banks. Experimental results over ISPASS2009 and CUDA benchmarks show that on average, our proposed MLC-STT RF can achieve 3.28% performance improvement, 9.48% energy reduction, and 38.9% energy efficiency enhancement compared to conventional SRAM-based design.

...read moreread less

Proceedings Article•DOI•

Checkpoint-aware instruction scheduling for nonvolatile processor with multiple functional units

[...]

Mimi Xie¹, Chen Pan¹, Jingtong Hu¹, Chengmo Yang², Yi Chen³ - Show less +1 more•Institutions (3)

Oklahoma State University–Stillwater¹, University of Delaware², University of Pittsburgh³

12 Mar 2015

TL;DR: The checkpoint aware instruction scheduling (CAIS) algorithm is proposed to reduce the writes to NV registers to improve performance and reduce power consumption in embedded systems powered with harvested energy.

...read moreread less

Abstract: Embedded systems powered with harvested energy experience frequent execution interruption due to unstable energy source. Nonvolatile (NV) register based processor is proposed to realize fast resume after power failure. The states in the volatile registers are checkpointed to NV registers. However, frequent checkpointing causes performance degradation and consumes excessive power. In this paper, we propose the checkpoint aware instruction scheduling (CAIS) algorithm to reduce the writes to NV registers. Experiments show that CAIS can improve performance and reduce power consumption.

...read moreread less

Proceedings Article•DOI•

Distributed computing in IoT: System-on-a-chip for smart cameras as an example

[...]

Shao-Yi Chien¹, Wei-Kai Chan¹, Yu-Hsiang Tseng¹, Chia-Han Lee², V. Srinivasa Somayazulu³, Yen-Kuang Chen³ - Show less +2 more•Institutions (3)

National Taiwan University¹, Center for Information Technology², Intel³

12 Mar 2015

TL;DR: This paper takes video sensing network as an example to show the idea of distributed computing in IoT and the architecture of a system-on-a-chip solution for distributed smart cameras is proposed with coarse-grained reconfigurable image stream processing architecture that can accelerate various computer vision algorithms for distributedsmart cameras in IoT.

...read moreread less

Abstract: There are four major components in application systems with internet-of-things (IoT): sensors, communications, computation and service, where large amount of data are acquired for ultra-big data analysis to discover the context information and knowledge behind signals. To support such large-scale data size and computation tasks, it is not feasible to employ centralized solutions on cloud servers. Thanks for the advances of silicon technology, the cost of computation become lower, and it is possible to distribute computation on every node in IoT. In this paper, we take video sensing network as an example to show the idea of distributed computing in IoT. Existing related works are reviewed and the architecture of a system-on-a-chip solution for distributed smart cameras is proposed with coarse-grained reconfigurable image stream processing architecture. It can accelerate various computer vision algorithms for distributed smart cameras in IoT.

...read moreread less

Proceedings Article•DOI•

Early stage real-time SoC power estimation using RTL instrumentation

[...]

Jianlei Yang¹, Liwei Ma², Kang Zhao², Yici Cai¹, Tin-Fook Ngai² - Show less +1 more•Institutions (2)

Tsinghua University¹, Intel²

12 Mar 2015

TL;DR: A model abstraction approach for real-time power estimation in the manner of machine learning is proposed and the singular value decomposition (SVD) technique is exploited to abstract the principle components of relationship between register toggling profile and accurate power waveform.

...read moreread less

Abstract: Early stage power estimation is critical for SoC architecture exploration and validation in modern VLSI design, but real-time, long time interval and accurate estimation is still challenging for system-level estimation and software/hardware tuning. This work proposes a model abstraction approach for real-time power estimation in the manner of machine learning. The singular value decomposition (SVD) technique is exploited to abstract the principle components of relationship between register toggling profile and accurate power waveform. The abstracted power model is automatically instrumented to RTL implementation and synthesized into FPGA platform for real-time power estimation by instrumenting the register toggling profile. The prototype implementation on three IP cores predicts the cycle-by-cycle power dissipation within 5% accuracy loss compared with a commercial power estimation tool.

...read moreread less

Proceedings Article•DOI•

Unified non-volatile memory and NAND flash memory architecture in smartphones

[...]

Renhai Chen¹, Yi Wang², Jingtong Hu³, Duo Liu⁴, Zili Shao¹, Yong Guan⁵ - Show less +2 more•Institutions (5)

Hong Kong Polytechnic University¹, Shenzhen University², Oklahoma State University–Stillwater³, Chongqing University⁴, Capital Normal University⁵

12 Mar 2015

TL;DR: The proposed unified NVM/flash architecture to improve the I/O performance is evaluated, and the experimental results show that the read and write performance is 2.45 times and 3.37 times better than that of the stock Android 4.2 system, respectively.

...read moreread less

Abstract: I/O is becoming one of major performance bottlenecks in NAND-flash-based smartphones. Novel NVMs (nonvolatile memories), such as PCM (Phase Change Memory) and STT-RAM (Spin-Transfer Torque Random Access Memory), can provide fast read/write operations. In this paper, we propose an unified NVM/flash architecture to improve the I/O performance. A transparent scheme, vFlash (Virtualized Flash), is also proposed to manage the unified architecture. Within vFlash, inter-app technique is proposed to optimize the application performance by exploiting the historic locality of applications. Since vFlash is on the bottom of the I/O stack, the application features will be lost. Therefore, we also propose a cross-layer technique to transfer the application information from the application layer to the vFlash layer. The proposed scheme is evaluated based on a real Android platform, and the experimental results show that the read and write performance for the proposed scheme is 2.45 times and 3.37 times better than that of the stock Android 4.2 system, respectively.

...read moreread less

Proceedings Article•DOI•

Approximation-aware scheduling on heterogeneous multi-core architectures

[...]

Cheng Tan¹, Thannirmalai Somu Muthukaruppan¹, Tulika Mitra¹, Lei Ju²•Institutions (2)

National University of Singapore¹, Shandong University²

01 Jan 2015

TL;DR: An approximation-aware scheduling framework for soft real-time tasks on the heterogeneous multi-core architectures that considers multiple versions of a task obtained by introducing approximation in the computation to provide different levels of quality of service (QoS) versus performance tradeoffs.

...read moreread less

Abstract: The high performance demand of embedded systems along with restrictive thermal design power (TDP) constraint have lead to the emergence of the heterogenous multi-core architectures, where cores with the same instruction-set architecture but different power-performance characteristics provide new opportunities for energy-efficient computing Heterogeneity introduces challenges in scheduling the tasks to the appropriate cores and selecting the frequency assignment of each core In this paper, we introduce an approximation-aware scheduling framework for soft real-time tasks on the heterogeneous multi-core architectures We consider multiple versions of a task obtained by introducing approximation in the computation to provide different levels of quality of service (QoS) versus performance tradeoffs The additional choice of approximation allows us more flexibility in meeting the performance and TDP constraints while maximizing QoS per unit of energy

...read moreread less

Proceedings Article•DOI•

Controlled placement of standard cell memory arrays for high density and low power in 28nm FD-SOI

[...]

Adam Teman¹, Davide Rossi², Pascal Meinerzhagen¹, Luca Benini², Andreas Burg¹ - Show less +1 more•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, University of Bologna²

12 Mar 2015

TL;DR: This paper presents a controlled placement design methodology for optimizing the physical implementation of SCM macros, leading to a structured, non-congested layout with close to 100% placement utilization and reduced wirelength as compared to unstructured layouts.

...read moreread less

Abstract: Standard cell memories (SCMs) are becoming a popular alternative to SRAM IPs due to their design flexibility, ease of implementation, and robust operation at low supply voltages. Exclusively composed of standard cells, these memory arrays are implemented as part of the standard digital design flow. However, the synthesis and place and route (P&R) algorithms employed by this flow do not exploit the distinct and regular structure of an SCM array, leaving room for optimization. In this paper, we present a controlled placement design methodology for optimizing the physical implementation of SCM macros, leading to a structured, non-congested layout with close to 100% placement utilization and reduced wirelength as compared to unstructured layouts. Three sample SCM macro sizes were implemented according to the proposed methodology in a state-of-the-art 28nm FD-SOI technology, and compared with equivalent macros designed with the non-controlled, standard flow, achieving as much as a 22% reduction in area, a 57% reduction in switching power, and a 42% reduction in leakage power. In addition, these macros provide as much as an 88% reduction in switching power, as compared to equivalently sized, foundry provided SRAM IPs, while enabling robust functionality well below the minimum operating voltage of these IPs.

...read moreread less

Proceedings Article•DOI•

Enhanced partitioned scheduling of Mixed-Criticality Systems on multicore platforms

[...]

Zaid Al-bayati¹, Qingling Zhao², Ahmed Youssef¹, Haibo Zeng³, Zonghua Gu² - Show less +1 more•Institutions (3)

McGill University¹, Zhejiang University², Virginia Tech³

12 Mar 2015

TL;DR: A novel mixed- criticality partitioning algorithm is presented, the Dual-Partitioned Mixed-Criticality (DPM) algorithm, that allows limited migration of LO-criticality tasks to enhance the efficiency of the partitioning while maintaining many of the advantages of partitioned systems.

...read moreread less

Abstract: Mixed Criticality Systems (MCS) have gained increasing interest in the past few years due to their industrial relevance. When mixed-criticality systems are implemented on multicore architectures, several challenges arise such as the efficient partitioning of these systems. In this paper, we address this issue by presenting a novel mixed-criticality partitioning algorithm, the Dual-Partitioned Mixed-Criticality (DPM) algorithm, that allows limited migration of LO-criticality tasks to enhance the efficiency of the partitioning while maintaining many of the advantages of partitioned systems. Experimental results show that DPM consistently outperforms existing mixed-criticality partitioning algorithms, for example, at utilizations of 0.8 or higher, DPM is able to schedule 17% more systems.

...read moreread less

Proceedings Article•DOI•

Layout decomposition co-optimization for hybrid e-beam and multiple patterning lithography

[...]

Yunfeng Yang¹, Wai-Shing Luk¹, Hai Zhou¹, Changhao Yan¹, Xuan Zeng¹, Dian Zhou¹ - Show less +2 more•Institutions (1)

Fudan University¹

12 Mar 2015

TL;DR: This paper proposes a primal-dual (PD) method for solving the underlying minimum odd-cycle cover problem efficiently, and proposes a random-initialized local search method that iteratively applies the PD solver.

...read moreread less

Abstract: As the feature size keeps scaling down and the circuit complexity increases rapidly, a more advanced hybrid lithography, which combines multiple patterning and e-beam lithography (EBL), is promising to further enhance the pattern resolution In this paper, we formulate the layout decomposition problem for this hybrid lithography as a minimum vertex deletion K-partition problem, where K is the number of masks in multiple patterning Stitch minimization and EBL throughput are considered uniformly by adding a virtual vertex between two feature vertices for each stitch candidate during the conflict graph construction phase For K = 2, we propose a primal-dual method for solving the underlying minimum odd-cycle cover problem efficiently In addition, a chain decomposition algorithm is employed for removing all “non-cyclable” edges For K > 2, we propose a random-initialized local search method that iteratively applies the primal-dual solver Experimental results show that compared with a two-stage method, our proposed methods reduce the EBL usage by 644% with double patterning and 387% with triple patterning on average for the benchmarks

...read moreread less

Proceedings Article•DOI•

Negotiation-based task scheduling and storage control algorithm to minimize user's electric bills under dynamic prices

[...]

Ji Li¹, Yanzhi Wang¹, Xue Lin¹, Shahin Nazarian¹, Massoud Pedram¹ - Show less +1 more•Institutions (1)

University of Southern California¹

12 Mar 2015

TL;DR: A negotiation-based iterative approach has been proposed for joint residential task scheduling and energy storage control that is inspired by the state-of-the-art Field-Programmable Gate Array (FPGA) routing algorithms, and achieves up to 64.22% in the total energy cost reduction compared with the baseline methods.

...read moreread less

Abstract: Dynamic energy pricing is a promising technique in the Smart Grid to alleviate the mismatch between electricity generation and consumption. Energy consumers are incentivized to shape their power demands, or more specifically, schedule their electricity-consuming applications (tasks) more prudently to minimize their electric bills. This has become a particularly interesting problem with the availability of residential photovoltaic (PV) power generation facilities and controllable energy storage systems. This paper addresses the problem of joint task scheduling and energy storage control for energy consumers with PV and energy storage facilities, in order to minimize the electricity bill. A general type of dynamic pricing scenario is assumed where the energy price is both time-of-use and power-dependent, and various energy loss components are considered including power dissipation in the power conversion circuitries as well as the rate capacity effect in the storage system. A negotiation-based iterative approach has been proposed for joint residential task scheduling and energy storage control that is inspired by the state-of-the-art Field-Programmable Gate Array (FPGA) routing algorithms. In each iteration, it rips-up and re-schedules all tasks under a fixed storage control scheme, and then derives a new charging/discharging scheme for the energy storage based on the latest task scheduling. The concept of congestion is introduced to dynamically adjust the schedule of each task based on the historical results as well as the current scheduling status, and a near-optimal storage control algorithm is effectively implemented by solving convex optimization problem(s) with polynomial time complexity. Experimental results demonstrate the proposed algorithm achieves up to 64.22% in the total energy cost reduction compared with the baseline methods.

...read moreread less

Proceedings Article•DOI•

Satisfiability Don't Care condition based circuit fingerprinting techniques

[...]

Carson Dunbar¹, Gang Qu¹•Institutions (1)

University of Maryland, College Park¹

12 Mar 2015

TL;DR: This paper proposes a novel gate replacement approach to encode fingerprints based on the inherent Satisfiability Don't Care conditions in the circuit and develops a practical method to implement this SDC-based circuit fingerprint.

...read moreread less

Abstract: Circuit fingerprints allow the authors of design intellectual properties (IPs) to trace each copy of their IPs by embedding features, known as digital fingerprints, which are unique to each device. In this paper, we propose a novel gate replacement approach to encode fingerprints based on the inherent Satisfiability Don't Care (SDC) conditions in the circuit. Moreover, existing fingerprinting schemes all require redesign of the circuit which makes it prohibitively expensive for manufacturing. We develop a practical method to implement our SDC-based circuit fingerprint. First, we introduce flexibilities during the logic synthesis phase by replacing certain library cells with versatile multiplexers (MUXs). The MUX can be configured either as the original gate or one of its replacements with identical functionality except the SDC conditions. Then at the post-silicon stage, we configure these MUXs to create distinct fingerprints. We consider standard benchmark circuits and demonstrate that even on these circuits with limited size, we can find sufficient locations to embed fingerprints. Simulation with TSMC 0.35µm technology shows non-trivial design overhead, however, such overhead will become negligible for large real-life circuits.

...read moreread less

Proceedings Article•DOI•

Read circuits for resistive memory (ReRAM) and memristor-based nonvolatile Logics

[...]

Meng-Fan Chang¹, Albert Lee¹, Chien-Chen Lin¹, Mon-Shu Ho, Ping-Cheng Chen², Chia-Chen Kuo³, Ming-Pin Chen³, Pei-Ling Tseng³, Tzu-Kun Ku³, Chien-Fu Chen, Kai-Shin Li, Jia-Min Shieh - Show less +8 more•Institutions (3)

National Tsing Hua University¹, I-Shou University², Industrial Technology Research Institute³

01 Jan 2015

TL;DR: Design challenges in read circuits for high-speed, area-efficient, and low-voltage ReRAM and nvLogics are discussed and silicon-verified solutions on read scheme and sense amplifiers are discussed.

...read moreread less

Abstract: Resistive memory device (Memristor) is one of the candidates for energy-efficient nonvolatile memory and nonvolatile logics (nvLogics) in the applications of wearable devices, Internet of Things (IoT), cloud computing, and big-data processing. However, resistive RAM (ReRAM) and memristor-based nvLogics suffer limited performance and low yield due to process variations in transistors and resistance of memristor. This presentation discusses the design challenges in read circuits for high-speed, area-efficient, and low-voltage ReRAM and nvLogics. Memristor-based nvLogics, such as nonvolatile-SRAM (nvSRAM), nonvolatile flip-flops (nvFF), and nonvolatile TCAM (nvTCAM) are included in this presentation. Several silicon-verified solutions on read scheme and sense amplifiers are also discussed in this presentation.

...read moreread less

Proceedings Article•DOI•

Polynomial time algorithm for area and power efficient adder synthesis in high-performance designs

[...]

Subhendu Roy¹, Mihir Choudhury², Ruchir Puri², David Z. Pan¹•Institutions (2)

University of Texas at Austin¹, IBM²

11 Mar 2015

TL;DR: A polynomial-time algorithm is proposed to synthesize n bit parallel prefix adders targeting the minimization of the size of the prefix graph with log2n logic level and any arbitrary fan-out restriction to provide pareto-optimal solutions for delay vs. power trade-off.

...read moreread less

Abstract: Adders are the most fundamental arithmetic units, and often on the timing critical paths of microprocessors. Among various adder configurations, parallel prefix structures provide the high performance adders for higher bit-widths. With aggressive technology scaling, the performance of a parallel prefix adder, in addition to the dependence on the logic-level, is determined by wire-length and congestion which can be mitigated by adjusting fan-out. This paper proposes a polynomial-time algorithm to synthesize n bit parallel prefix adders targeting the minimization of the size of the prefix graph with log 2 n logic level and any arbitrary fan-out restriction. The design space exploration by our algorithm provides a set of pareto-optimal solutions for delay vs. power trade-off, and these pareto-optimal solutions can be used in high-performance designs instead of picking from a fixed library (Kogge Stone, Sklansky etc.). Experimental results demonstrate that our approach (i) excels highly competitive industry standard Synopsys Design Compiler adder (128 bit) in performance (2%), area (25%) and power (13.3%) in 32nm technology node, and (ii) improves performance/area over even 64 bit custom designed adders targeting 22nm technology library and implemented in an industrial high-performance design.

...read moreread less

Proceedings Article•DOI•

Toward large-scale access-transistor-free memristive crossbars

[...]

Amirali Ghofrani¹, Miguel Angel Lastras-Montano¹, Kwang-Ting Cheng¹•Institutions (1)

University of California, Santa Barbara¹

01 Jan 2015

TL;DR: This paper discusses challenges facing the scalability of access-transistor-free (ATF) memristive crossbars and describes some solutions addressing these challenges at multiple levels of design abstraction.

...read moreread less

Abstract: Memristive crossbars have been shown to be excellent candidates for building an ultra-dense memory system because a per-cell access-transistor may no longer be necessary However, the elimination of the access-transistor introduces several parasitic effects due to the existence of partially-selected devices during memory accesses, which could limit the scalability of access-transistor-free (ATF) memristive crossbars In this paper we discuss these challenges in detail and describe some solutions addressing these challenges at multiple levels of design abstraction

...read moreread less

Proceedings Article•DOI•

Cut mask optimization with wire planning in self-aligned multiple patterning full-chip routing

[...]

Shao-Yun Fang¹•Institutions (1)

National Taiwan University of Science and Technology¹

12 Mar 2015

TL;DR: This paper proposes the first work of cut mask optimization with wire planning in SAMP full-chip routing, and shows that the proposed routing algorithms are effective in generating routing results with optimized cut masks.

...read moreread less

Abstract: Because of the delay of next generation lithography technologies, self-aligned double patterning (SADP) has become one of the major lithography solutions for sub-20nm technology nodes. For advanced sub-10nm nodes, self-aligned quadruple patterning (SAQP) or even self-aligned octuple patterning (SAOP) will be required. Due to considerable design complexity and unmanageable process variation, one-dimensional grid-based layout structure will be adopted, which can be achieved with sophisticated self-aligned multiple patterning (SAMP) process with the use of a cut mask. However, cut masks for arbitrary layouts are hardly manufacturable because cut mask rules are limited by conventional 193nm lithography. To the best of our knowledge, existing SADP- and SAQP-aware detailed routers would fail to generate cut mask-friendly routing results for general SAMP. In this paper, we propose the first work of cut mask optimization with wire planning in SAMP full-chip routing. We first identify cut mask-aware routing rules to guide our router. Then, cut mask-aware wire planning, detailed routing, and post-layout modification techniques are proposed in the routing flow. Experimental results show that the proposed routing algorithms are effective in generating routing results with optimized cut masks.

...read moreread less

Collapse