scispace - formally typeset
Search or ask a question

Showing papers presented at "Asia and South Pacific Design Automation Conference in 2014"


Proceedings ArticleDOI
20 Feb 2014
TL;DR: In this article, the authors proposed an optimization method that considers qubit-to-qubit interactions in 2D grid architectures to alleviate the latency of quantum circuits mapped to these architectures.
Abstract: Regular, local-neighbor topologies of quantum architectures restrict interactions to adjacent qubits, which in turn increases the latency of quantum circuits mapped to these architectures. To alleviate this effect, optimization methods that consider qubit-to-qubit interactions in 2D grid architectures are presented in this paper. The proposed approaches benefit from Mixed Integer Programming (MIP) formulation for the qubit placement problem. Simulation results on various benchmarks show 27% on average reduction in communication overhead between qubits compared to best results of previous work.

122 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: This paper proposes to expand the application scope, error tolerance as well as the energy savings of inexact computing systems through neural network architectures, and demonstrates that the proposed inexact neural network accelerator could achieve 43.91%-62.49% savings in energy consumption.
Abstract: In recent years, inexact computing has been increasingly regarded as one of the most promising approaches for reducing energy consumption in many applications that can tolerate a degree of inaccuracy. Driven by the principle of trading tolerable amounts of application accuracy in return for significant resource savings - the energy consumed, the (critical path) delay and the (silicon) area being the resources - this approach has been limited to certain application domains. In this paper, we propose to expand the application scope, error tolerance as well as the energy savings of inexact computing systems through neural network architectures. Such neural networks are fast emerging as popular candidate accelerators for future heterogeneous multi-core platforms, and have flexible error tolerance limits owing to their ability to be trained. Our results based on simulated 65nm technology designs demonstrate that the proposed inexact neural network accelerator could achieve 43.91%-62.49% savings in energy consumption (with corresponding delay and area savings being 18.79% and 31.44% respectively) when compared to existing baseline neural network implementation, at the cost of an accuracy loss (quantified as the Mean Square Error (MSE) which increases from 0.14 to 0.20 on average).

114 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: This work modify the original stochastic gradient descent algorithm by approximating calculations and designing an alternative computing method, and proposes a mixed-signal acceleration architecture for the modified training algorithm by equipping the original memristor-based neural network architecture with the copy crossbar technique, weight update units, sign calculation units and other assistant units.
Abstract: The artificial neural network (ANN) is among the most widely used methods in data processing applications. The memristor-based neural network further demonstrates a power efficient hardware realization of ANN. Training phase is the critical operation of memristor-based neural network. However, the traditional training method for memristor-based neural network is time consuming and energy inefficient. Users have to first work out the parameters of memristors through digital computing systems and then tune the memristor to the corresponding state. In this work, we introduce a mixed-signal training acceleration framework, which realizes the self-training of memristor-based neural network. We first modify the original stochastic gradient descent algorithm by approximating calculations and designing an alternative computing method. We then propose a mixed-signal acceleration architecture for the modified training algorithm by equipping the original memristor-based neural network architecture with the copy crossbar technique, weight update units, sign calculation units and other assistant units. The experiment on the MNIST database demonstrates that the proposed mixed-signal acceleration is 3 orders of magnitude faster and 4 orders of magnitude more energy efficient than the CPU implementation counterpart at the cost of a slight decrease of the recognition accuracy (<; 5%).

88 citations


Proceedings ArticleDOI
Jungmoon Kim1, Minseob Shim1, Junwon Jung1, Heejun Kim1, Chulwoo Kim1 
20 Feb 2014
TL;DR: A finely controlled zero-current switching (ZCS) scheme together with the accurate MPPT technique enhances the overall efficiency of the converter because of an optimal turn-on time generated by a one-shot pulse generator that is proposed.
Abstract: This paper presents a boost converter with the maximum power point tracking (MPPT) technique for thermoelectric energy harvesting (EH) applications. The technique realizes variation tolerance by adjusting the switching frequency fSW of the converter. A finely controlled zero-current switching (ZCS) scheme together with the accurate MPPT technique enhances the overall efficiency (η) of the converter because of an optimal turn-on time generated by a one-shot pulse generator that is proposed. Moreover, the ZCS technique can deal with low and high temperature differences applied to the thermoelectric generator. Experimentally, the converter implemented in a 0.35 um BCDMOS process had a peak of 72% at the input voltage VIN of 500mV while supplying a 5.62V output.

87 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: This work presents an exact approach that enables nearest neighbor-compliance by inserting a minimal number of SWAP gates, and demonstrates the applicability of the approach which enabled a comparison of results obtained by heuristic methods to the actual optimum.
Abstract: Motivated by its promising applications e.g. for database search or factorization, significant progress has been made in the development of automated design methods for quantum circuits. But in order to keep up with recent physical developments in this domain, new technological constraints have to be considered. Limited interaction distance between gate qubits is one of the most common of these constraints. This led to the development of several strategies aiming at making a given quantum circuit nearest neighbor-compliant by inserting SWAP gates into the existing structure. Usually these strategies are of heuristic nature. In this work, we present an exact approach that enables nearest neighbor-compliance by inserting a minimal number of SWAP gates. Experiments demonstrate the applicability of the approach which enabled a comparison of results obtained by heuristic methods to the actual optimum.

78 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: This paper studies the reconfiguration of services provided to low criticality tasks in reaction to the overruns of highcritical tasks, and derives tight analysis results under Earliest Deadline First (EDF) scheduling.
Abstract: Complex embedded systems are typically mixed-critical, where heterogeneous guarantees must be provided for functionalities of different criticalities We study in this paper the reconfiguration of services provided to low criticality tasks in reaction to the overruns of high criticality tasks We further investigate the quantification of the resetting time of the system services For both service reconfiguration and resetting, we derive tight analysis results under Earliest Deadline First (EDF) scheduling

62 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: This work forms the co-synthesis problem of task and communication schedules as a Mixed Integer Programming (MIP) model taking into account a number of Ethernet-specific timing parameters such as interframe gap, precision and synchronization error.
Abstract: In this paper, we study time-triggered distributed systems where periodic application tasks are mapped onto different end stations (processing units) communicating over a switched Ethernet network. We address the problem of application level (i.e., both task- and network-level) schedule synthesis and optimization. In this context, most of the recent works [10], [11] either focus on communication schedule or consider a simplified task model. In this work, we formulate the co-synthesis problem of task and communication schedules as a Mixed Integer Programming (MIP) model taking into account a number of Ethernet-specific timing parameters such as interframe gap, precision and synchronization error. Our formulation is able to handle one or multiple timing objectives such as application response time, end-to-end delay and their combinations. We show the applicability of our formulation considering an industrial size case study using a number of different sets of objectives. Further, we show that our formulation scales to systems with reasonably large size.

60 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: This paper pioneers the maximum power point tracking (MPPT) of photovoltaic cells that directly supply power to a microprocessor without an energy storage element (a battery or a large-size capacitor) nor power converters with huge reduction in cost, weight and volume, and extended lifetime.
Abstract: This paper pioneers the maximum power point tracking (MPPT) of photovoltaic (PV) cells that directly supply power to a microprocessor without an energy storage element (a battery or a large-size capacitor) nor power converters. The maximum power point tracking is conventionally performed by an MPPT charger that stores in the energy storage element, and a voltage regulator (typically a DC-DC converter) produces a proper voltage level for the microprocessor. The energy storage element is an energy buffer and makes it possible to perform MPPT of the PV cells and power management of the microprocessor independently. However, the energy storage element, MPPT charger and DC-DC converter cause seriously limited lifetime (when a typical battery is adopted), significant energy loss (typically over 20%), increased weight/volume and high cost, etc. The proposed method enables extremely fine-grain dynamic power management (DPM) in every a few hundred microseconds and performs the MPPT without using an MPPT charger and a DC-DC converter as well as an energy storage element. We achieve 84.5% of energy harvesting efficiency using the proposed setup with huge reduction in cost, weight and volume, and extended lifetime, which is not even numerically comparable with conventional MPPT methods.

60 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: A MBD method and its associated tool for the purpose of designing and validating various control algorithms for a residential microgrid is demonstrated and various use cases are presented to demonstrate how different levels of control algorithms may be developed, simulated, debugged, and analyzed by using the GridMat toolbox.
Abstract: Cyber-Physical Energy Systems (CPES) are an amalgamation of both power gird technology, and the intelligent communication and co-ordination between the supply and the demand side through distributed embedded computing. Through this combination, CPES are intended to deliver power efficiently, reliably, and economically. The design and development work needed to either implement a new power grid network or upgrade a traditional power grid to a CPES-compliant one is both challenging and time consuming due to the heterogeneous nature of the associated components/subsystems. The Model Based Design (MBD) methodology has been widely seen as a promising solution to address the associated design challenges of creating a CPES. In this paper, we demonstrate a MBD method and its associated tool for the purpose of designing and validating various control algorithms for a residential microgrid. Our presented co-simulation engine GridMat is a MATLAB/Simulink toolbox; the purpose of it is to co-simulate the power systems modeled in GridLAB-D as well as the control algorithms that are modeled in Simulink. We have presented various use cases to demonstrate how different levels of control algorithms may be developed, simulated, debugged, and analyzed by using our GridMat toolbox for a residential mi-crogrid.

56 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: This paper proposes a dynamic control policy that modulates the data center power consumption in response to ISO requests by leveraging server power capping techniques and various server power states, and demonstrates that using this policy, data centers can provide fast reserves in quantities that are substantial proportions of their average energy consumption.
Abstract: To accommodate the increasing presence of volatile and intermittent renewable energy sources in power generation, independent system operators (ISO) offer opportunities for demand side regulation service (RS) so as to stabilize the grid load. These power market features allow the demand side to earn monetary credits by modulating its power consumption dynamically following an RS signal broadcast by ISO. This paper studies the capacities and benefits of a major potential demand side, the data center, to provide RS. We propose a dynamic control policy that modulates the data center power consumption in response to ISO requests by leveraging server power capping techniques and various server power states. Results demonstrate that using our policy, data centers can provide fast reserves in quantities that are substantial proportions (around 50%) of their average energy consumption, with no major deterioration in quality of service (QoS). By doing so, data centers decrease their energy costs around 50%, while providing the ISOs and the society in general with cost effective demand side reserves that render massive renewable generation adoption affordable.

52 citations


Proceedings ArticleDOI
20 Feb 2014
TL;DR: Carbon Nanotube PUF is presented, the first PUF design that takes advantage of unique CNFET characteristics and achieves higher reliability against environmental variations and increased resistance against modeling attacks.
Abstract: Physically Unclonable Functions (PUFs) are used to provide identification, authentication and secret key generation based on unique and unpredictable physical characteristics. Carbon Nanotube Field Effect Transistors (CNFETs) were shown to have excellent electrical and unique physical characteristics and are promising candidates to replace silicon transistors in future Very Large Scale Integration (VLSI) designs. We present Carbon Nanotube PUF (CNPUF), the first PUF design that takes advantage of unique CNFET characteristics. CNPUF achieves higher reliability against environmental variations and increased resistance against modeling attacks. Furthermore, CNPUF has a considerable power and energy reduction in comparison to previous ultra-low power PUF designs of 89.6% and 98%, respectively. Additionally, CNPUF allows power-security tradeoff.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: For the first time, an optimal sample preparation algorithm based on a minimum-cost maximum-flow model is presented that can obtain both the optimal cost of sample and buffer usage and the waste amount even for multiple-target concentrations.
Abstract: Sample preparation, which is a front-end process to produce droplets of the desired target concentrations from input reagents, plays a pivotal role in every assay, laboratory, and application in biomedical engineering and life science. The consumption of sample/buffer/waste is usually used to evaluate the effectiveness of a sample preparation process. In this paper, for the first time, we present an optimal sample preparation algorithm based on a minimum-cost maximum-flow model. By using the proposed model, we can obtain both the optimal cost of sample and buffer usage and the waste amount even for multiple-target concentrations. Experiments demonstrate that we can consistently achieve much better results not only in the consumption of sample and buffer but also the waste amount when compared with all the state-of-the-art of the previous approaches.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: This work provides a very detailed analysis of SOT-MRAM at both circuit- and architecture-level, and presents a detailed evaluation of performance and energy related parameters and compares the novel SOTS MRAM with several other memory technologies.
Abstract: Magnetic Random Access Memory (MRAM) is a very promising emerging memory technology because of its various advantages such as non-volatility, high density and scalability. In particular, Spin Orbit Torque (SOT) MRAM is gaining interest as it comes along with all the benefits of its predecessor Spin Transfer Torque (STT) MRAM, but is supposed to eliminate some of its shortcomings. Especially the split of read and write paths in SOT-MRAM promises faster access times and lower energy consumption compared to STT-MRAM. In this work, we provide a very detailed analysis of SOT-MRAM at both circuit- and architecture-level. We present a detailed evaluation of performance and energy related parameters and compare the novel SOT-MRAM with several other memory technologies. Our architecture-level analysis shows that with a hybrid-combination of SRAM for the L1-cache and SOT-MRAM for the L2-cache the energy consumption can be reduced by 63 % in average while the performance can be increased by 1 %. In addition, the memory area is 43% lower compared to an SRAM-only configuration.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: A runtime technique to dynamically partition GPU resources between concurrently running applications - at least one of which has a quality-of-service requirement - can satisfy a 100% QoS requirement while also achieving either a 7W power consumption reduction or a 17.57% performance improvement for co-executing best-effort applications.
Abstract: General-purpose computing on GPUs (GPGPU computing) is becoming widely adopted; however, some GPGPU applications fail to fully utilize GPU resources. In these cases, spatial multitasking better exploits the parallelism offered by GPUs by partitioning the GPU resources among simultaneously-running applications. When one or more such applications have quality-of-service (QoS) requirements, enough resources must be allocated for those applications to satisfy their requirements. Remaining resources can be either disabled to reduce power consumption or used to accelerate other applications. However, we observe that the amount of resources for a QoS application to satisfy its performance requirement is dependent in part upon the co-executing applications. In this paper, we propose a runtime technique to dynamically partition GPU resources between concurrently running applications - at least one of which has a QoS requirement. We demonstrate that the proposed technique can satisfy a 100% QoS requirement while also achieving either a 7W power consumption reduction or a 17.57% performance improvement for co-executing best-effort applications.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: An automatic synthesis approach for quantum circuits that implement Clifford Group operations that exploits specific properties of the unitary transformation matrices that are associated to quantum operations is proposed.
Abstract: Quantum circuits established themselves as a promising emerging technology and, hence, attracted considerable attention in the domain of computer-aided design. As a result, many approaches for synthesis of corresponding netlists have been proposed in the last decade. However, as the design of quantum circuits faces serious obstacles caused by phenomena such as superposition, entanglement, and phase shifts, automatic synthesis still represents a significant challenge. In this paper, we propose an automatic synthesis approach for quantum circuits that implement Clifford Group operations. These circuits are essential for many quantum applications and cover core aspects of quantum functionality. The proposed approach exploits specific properties of the unitary transformation matrices that are associated to quantum operations. Furthermore, Quantum Multiple-Valued Decision Diagrams (QMDDs) are employed for an efficient representation of these matrices. Experimental results confirm that this enables a compact realization of the respective quantum functionality.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: Up to 35% drop in the network costs can be gained by adjusting the level of contiguity compared to non-contiguous cases, while the achieved throughput is kept constant in CASqA.
Abstract: In this paper, we propose a run-time mapping algorithm, CASqA, for networked many-core systems. In this algorithm, the level of contiguousness of the allocated processors (α) can be adjusted in a fine-grained fashion. A strictly contiguous allocation (α = 0) decreases the latency and power dissipation of the network and improves the applications execution time. However, it limits the achievable throughput and increases the turnaround time of the applications. As a result, recent works consider non-contiguous allocation (α = 1) to improve the throughput traded off against applications execution time and network metrics. In contradiction, our experiments show that a higher throughput (by 3%) with improved network performance can be achieved when using intermediate α values. More precisely, up to 35% drop in the network costs can be gained by adjusting the level of contiguity compared to non-contiguous cases, while the achieved throughput is kept constant. Moreover, CASqA provides at least 32% energy saving in the network compared to other works.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: This paper proposes a hybrid L1 cache architecture that incorporates both SRAM and STT-RAM and the key novelty of the proposal is the exploition of the MESI cache coherence protocol to perform dynamic block reallocation between different cache partitions.
Abstract: STT-RAM is an emerging NVRAM technology that promises high density, low energy and a comparable access speed to conventional SRAM. This paper proposes a hybrid L1 cache architecture that incorporates both SRAM and STT-RAM. The key novelty of the proposal is the exploition of the MESI cache coherence protocol to perform dynamic block reallocation between different cache partitions. Compared to the pure SRAM-based design, our hybrid scheme achieves 38% of energy saving with a mere 0.8% IPC degradation while extending the lifespan of STT-RAM partition at the same time.

Proceedings ArticleDOI
Jia Zhu1, Zhenyu Liu1, Dongsheng Wang1, Qingrui Han2, Yang Song2 
01 Jan 2014
TL;DR: To alleviate the burden of Intra encoder, the RD-cost from the source image textures is estimated, and two promising CU/PU mode candidates are dynamically select to execute exhaustive RDO processing.
Abstract: HEVC doubles the coding efficiency with more than 4x coding complexity as compared to H.264/AVC. To alleviate the burden of Intra encoder, we estimate the RD-cost from the source image textures, and dynamically select two promising CU/PU mode candidates to execute exhaustive RDO processing. As integrated in our hardwired encoder, the averaged 61.7% computation complexity was saved with 4.53% rate augment. With TSMC 90nm technology, the real-time encoder for HDTV1080p at 44fps is implemented with 2269k-gate at 357MHz operating frequency.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: The proposed framework analyzes the links dependency and then determines the ordering of queuing analysis for performance modeling, and can be used to analyze various traffic scenarios for NoC platforms with arbitrary buffer and packet lengths.
Abstract: In this work, we propose a new, accurate, and comprehensive analytical model for Network-on-Chip (NoC) performance analysis. Given the application communication graph, the NoC architecture, and the routing algorithm, the proposed framework analyzes the links dependency and then determines the ordering of queuing analysis for performance modeling. The channel waiting times in the links are estimated using a generalized G/G/1/K queuing model, which can tackle bursty traffic and dependent arrival times with general service time distributions. The proposed model is general and can be used to analyze various traffic scenarios for NoC platforms with arbitrary buffer and packet lengths. Experimental results on both synthetic and real applications demonstrate the accuracy and scalability of the newly proposed model.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: A macro cell design composed of multiple parallel connecting memristors can be successfully used in implementing the weight storage unit and the stochastic neuron - the two fundamental components in neural network (NN)s, providing a feasible solution in memristor-based hardware implementation.
Abstract: Memristor, the fourth basic circuit element, has shown great potential in neuromorphic circuit design for its unique synapse-like feature. However, though the continuous resistance state of memristor has been expected, obtaining and maintaining an arbitrary intermediate state cannot be well controlled in nowadays memristive system. In addition, the stochastic switching behaviors have been widely observed. To facilitate the investigation on memristor-based hardware implementation, we built a stochastic behavior model of TiO2 memristive devices based on the real experimental results. By leveraging the stochastic behavior of memristors, a macro cell design composed of multiple parallel connecting memristors can be successfully used in implementing the weight storage unit and the stochastic neuron - the two fundamental components in neural network (NN)s, providing a feasible solution in memristor-based hardware implementation.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: This work formally verify the throughput of an AMS signaling system - modelled in SPICE using 22nm BSIM4 transistors, Booleanized with high accuracy using ABCD-NL, and property-checked using ABC.
Abstract: We present ABCD-NL, a technique that approximates non-linear analog circuits using purely Boolean models, to high accuracy. Given an analog/mixed-signal (AMS) system (e.g., a SPICE netlist), ABCD-NL produces a Boolean circuit representation (e.g., an And Inverter Graph, Finite State Machine, or Binary Decision Diagram) that captures the I/O behaviour of the given system, to near SPICE-level accuracy, without making any apriori simplifications. The Boolean models produced by ABCD-NL can be used for high-speed simulation and formal verification of AMS designs, by leveraging existing tools developed for Boolean/hybrid systems analysis (e.g., ABC [1]). We apply ABCD-NL to a number of SPICE-level AMS circuits, including data converters, charge pumps, comparators, non-linear signaling/communications sub-systems, etc. Also, we formally verify the throughput of an AMS signaling system - modelled in SPICE using 22nm BSIM4 transistors, Booleanized with high accuracy using ABCD-NL, and property-checked using ABC.

Proceedings ArticleDOI
Jie Guo1, Zhijie Chen1, Danghui Wang, Zili Shao, Yi Chen1 
20 Feb 2014
TL;DR: Data Pattern Aware (DPA) error protection technique is proposed to extend the lifespan of NAND flash based storage systems (NFSS) by up to 4×, offering a complementing solution to other lifetime enhancement techniques like wear-leveling.
Abstract: The recent research reveals that the bit error rate of a NAND flash cell is highly dependent on the stored data patterns. In this work, we propose Data Pattern Aware (DPA) error protection technique to extend the lifespan of NAND flash based storage systems (NFSS). DPA manipulates the ratio of 1's and 0's in the stored data to minimize occurrence of the data patterns which are susceptible to bit error noise. Consequently, the NAND flash cell bit error rate is reduced, leading to system endurance extension. Our simulation result shows that, with marginal hardware and power overhead, DPA scheme can increase the NFSS lifetime by up to 4×, offering a complementing solution to other lifetime enhancement techniques like wear-leveling.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: A novel shield mechanism utilizing the micro-channel, a technique conventionally used for heat removal, to reduce the substrate loss and increase the quality factor and the inductance of the TSV inductor is proposed.
Abstract: Through-silicon-vias (TSVs) can potentially be used to implement inductors in three-dimensional (3D) integrated systems for minimal footprint and large inductance. However, different from conventional 2D spiral inductors, TSV inductors are fully buried in the lossy substrate, thus suffering from low quality factor. In this paper, we propose a novel shield mechanism utilizing the micro-channel, a technique conventionally used for heat removal, to reduce the substrate loss. This technique increases the quality factor and the inductance of the TSV inductor by up to 21x and 17x respectively. It enables us to implement TSV inductors of up to 38x smaller area and 33% higher quality factor, compared with spiral inductors of the same inductance. To the best of the authors' knowledge, this is the first proposal on improving quality factor of TSV inductors. We hope our study shall point out a new and exciting research direction for 3D IC designers.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: To enable the capacitance extraction of chip-scale large VLSI layout using the floating random walk (FRW) algorithm, two techniques are proposed, including a virtual Gaussian surface sampling technique that makes efficient random sampling on theGaussian surface for complex nets with vias, and optimizes the sampling scheme to reduce the time of random walk.
Abstract: To enable the capacitance extraction of chip-scale large VLSI layout using the floating random walk (FRW) algorithm, two techniques are proposed. The first one is a virtual Gaussian surface sampling technique. It makes efficient random sampling on the Gaussian surface for complex nets with vias, and optimizes the sampling scheme to reduce the time of random walk. The other one is a parallelized, improved construction approach for Octree based space management structure. It can be over 5000X faster than the existing approach and provides same convenience to the FRW procedure. Numerical experiments on large cases with up to half million conductors validate the proposed techniques, and demonstrate a fast FRW solver for chip-scale extraction task.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: It is shown that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire, called DW-NN.
Abstract: Image processing in conventional logic-memory I/O-integrated systems will incur significant communication congestion at memory I/Os for excessive big image data at exa-scale. This paper explores an in-memory machine learning on neural network architecture by utilizing the newly introduced domain-wall nanowire, called DW-NN. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire. Domain-wall nanowire based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 92x when compared to conventional image processing system.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: The status and prospects of spin-based integrated circuits under intense investigation are overviewed and particularly their merits and challenges for practical applications are addressed.
Abstract: Conventional CMOS integrated circuits suffer from serve power and scalability challenges as technology node scales into ultra-deep-micron technology nodes. Alternative approaches beyond charge-only based circuits. In particular, spin-based devices or integrated circuits show promising merits to overcome these issues by adding the spin freedom of electrons to the electronic circuits. Spintronics has now become a hot topic in both academics and industrials. This paper overviews the status and prospects of spin-based integrated circuits under intense investigation and address particularly their merits and challenges for practical applications.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: An array-level model is developed which is capable of analyzing the read/write noise margin of a 3D-VRAM array in the presence of the sneak leakage current and voltage drop and a system-level design tool is built that is able to explore the design space with specified constraints and find the optimal design points with different targets.
Abstract: Resistive Random Access Memory (ReRAM) is one of the most promising emerging non-volatile memory (NVM) candidates due to its fast read/write speed, excellent scalability and low-power operation. Recently proposed 3D vertical cross-point ReRAM (3D-VRAM) architecture attracts a lot of attention because it offers a cost-competitive solution as NAND Flash replacement. In this work, we first develop an array-level model which includes the geometries and properties of all the components in the 3D structure. The model is capable of analyzing the read/write noise margin of a 3D-VRAM array in the presence of the sneak leakage current and voltage drop. Then we build a system-level design tool that is able to explore the design space with specified constraints and find the optimal design points with different targets. We also study the impact of different design parameters on the array size, bit density, and overall cost-per-bit. Compared to the state-of-the-art 3D horizontal ReRAM (3D-HRAM), the 3D-VRAM shows great cost advantage when stacking more than 16 layers.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: A comprehensive analysis of the computational complexity, power consumption, temperature, and memory access behavior for the next-generation High Efficiency Video Coding (HEVC) standard is provided.
Abstract: This paper provides a comprehensive analysis of the computational complexity, power consumption, temperature, and memory access behavior for the next-generation High Efficiency Video Coding (HEVC) standard. We highlight the associated design challenges and present several low-power algorithmic and architectural techniques for developing power-efficient HEVC-based multimedia system. We explore the interplay between the algorithms and architectures to provide high power efficiency while leveraging the application-specific knowledge and video content characteristics.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: A novel ω-LAT coding scheme is proposed to reduce the capacitive crosstalk and minimize the power consumption overhead in the TSV array and combining with the Transition Signaling, the LAT coding scheme restricts the number of transitions in every transmission cycle to minimize the crosStalk and power consumption.
Abstract: 3D integration is one of the promising solutions to overcome the interconnect bottleneck with vertical interconnect through-silicon vias (TSVs). This paper investigates the crosstalk in 3D IC designs, especially the capacitive crosstalk in TSV interconnects. We propose a novel ω-LAT coding scheme to reduce the capacitive crosstalk and minimize the power consumption overhead in the TSV array. Combining with the Transition Signaling, the LAT coding scheme restricts the number of transitions in every transmission cycle to minimize the crosstalk and power consumption. Compared to other 3D crosstalk minimization coding schemes, the proposed coding can provide the same delay reduction with more affordable overhead. The performance and power analysis show that when ω is 4, the proposed LAT coding scheme can achieve 38% interconnect crosstalk delay reduction compared to the data transmission without coding. By reducing the value of ω, further reduction can be achieved1.

Proceedings ArticleDOI
20 Feb 2014
TL;DR: A new layout decomposition framework for self-aligned double patterning and complementary EBL is presented, which considers overlay minimization and EBL throughput optimization simultaneously and performs conflict elimination by merge-and-cut technique.
Abstract: Advanced lithography techniques enable higher pattern resolution; however, techniques such as extreme ultraviolet lithography and e-beam lithography (EBL) are not yet ready for high volume production. Recently, complementary lithography has become promising, which allows two different lithography processes work together to achieve high quality layout patterns while not increasing much manufacturing cost. In this paper, we present a new layout decomposition framework for self-aligned double patterning and complementary EBL, which considers overlay minimization and EBL throughput optimization simultaneously. We perform conflict elimination by merge-and-cut technique and formulate it as a matching-based problem. The results show that our approach is fast and effective, where all conflicts are solved with minimal overlay error and e-beam utilization.