Social Network Analysis

In this paper, recent progress of phase change memory (PCM) is reviewed. The electrical and thermal properties of phase change materials are surveyed with a focus on the scalability of the materials and their impact on device design. Innovations in the device structure, memory cell selector, and strategies for achieving multibit operation and 3-D, multilayer high-density memory arrays are described. The scaling properties of PCM are illustrated with recent experimental results using special device test structures and novel material synthesis. Factors affecting the reliability of PCM are discussed.

Phase Change Memory

International Technology Roadmap for Semiconductors 2003の要求清浄度について － シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について －

Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

Memristor-enabled neuromorphic computing systems provide a fast and energy-efficient approach to training neural networks1–4. However, convolutional neural networks (CNNs)—one of the most important models for image recognition5—have not yet been fully hardware-implemented using memristor crossbars, which are cross-point arrays with a memristor device at each intersection. Moreover, achieving software-comparable results is highly challenging owing to the poor yield, large variation and other non-ideal characteristics of devices6–9. Here we report the fabrication of high-yield, high-performance and uniform memristor crossbar arrays for the implementation of CNNs, which integrate eight 2,048-cell memristor arrays to improve parallel-computing efficiency. In addition, we propose an effective hybrid-training method to adapt to device imperfections and improve the overall system performance. We built a five-layer memristor-based CNN to perform MNIST10 image recognition, and achieved a high accuracy of more than 96 per cent. In addition to parallel convolutions using different kernels with shared inputs, replication of multiple identical kernels in memristor arrays was demonstrated for processing different inputs in parallel. The memristor-based CNN neuromorphic system has an energy efficiency more than two orders of magnitude greater than that of state-of-the-art graphics-processing units, and is shown to be scalable to larger networks, such as residual neural networks. Our results are expected to enable a viable memristor-based non-von Neumann hardware solution for deep neural networks and edge computing. A fully hardware-based memristor convolutional neural network using a hybrid training method achieves an energy efficiency more than two orders of magnitude greater than that of graphics-processing units.

Fully hardware-implemented memristor convolutional neural network

OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

This work demonstrates the first fabricated 1 Mb nonvolatile TCAM using 2-transistor/2-resistive-storage (2T-2R) cells to achieve >10× smaller cell size than SRAM-based TCAMs at the same technology node. The test chip was designed and fabricated in IBM 90 nm CMOS technology and mushroom phase-change memory (PCM) technology. The primary challenge for enabling reliable array operation with such aggressive cell is presented, namely, severely degraded sensing margin due to significantly lower ON/OFF ratio of resistive memories (~102 for PCM) than that of traditional MOSFETs (>105 ). To address this challenge, two enabling techniques were developed and implemented in hardware: 1) two-bit encoding and 2) a clocked self-referenced sensing scheme (CSRSS). In addition, the two-bit encoding can also improve algorithmic mapping by effectively compressing TCAM entries. The 1 Mb chip demonstrates reliable low voltage search operation (VDDmin ~750 mV) and a match delay of 1.9 ns under nominal operating conditions.

1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing

Spin-torque transfer magnetic RAM (STT MRAM) is a promising candidate for future embedded applications. It combines the desirable attributes of current memory technologies such as SRAM, DRAM, and flash memories (fast access time, low cost, high density, and non-volatility). It also solves the critical drawbacks of conventional MRAM technology: poor scalability and high write current. However, variations in process parameters can lead to a large number of cells to fail, severely affecting the yield of the memory array. In this paper, we analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations. Based on the model, we performed a thorough analysis of the impact of design parameters on parametric failures due to process variations. To achieve high memory yield without incurring expensive technology modification, we developed an efficient design paradigm from circuit and/or architecture perspective-to improve the robustness and integration density. The proposed technique effectively relaxes or completely decouples the conflicting design requirements for read stability, writability and cell area. It can be used at an early stage of the design cycle for yield enhancement.

Design Paradigm for Robust Spin-Torque Transfer Magnetic RAM (STT MRAM) From Circuit/Architecture Perspective

Spin-torque transfer magnetic RAM (STT MRAM) is a promising candidate for future universal memory. It combines the desirable attributes of current memory technologies such as SRAM, DRAM and flash memories. It also solves the key drawbacks of conventional MRAM technology: poor scalability and high write current. In this paper, we analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations. Based on the model, we developed an efficient simulation tool to capture the coupled electro/magnetic dynamics of spintronic device, leading to effective prediction for memory yield. We also developed a statistical optimization methodology to minimize the memory failure probability. The proposed methodology can be used at an early stage of the design cycle to enhance memory yield.

Modeling of failure probability and statistical design of spin-torque transfer magnetic random access memory (STT MRAM) array for yield enhancement

This work demonstrates the first fabricated nonvolatile TCAM using 2-transistor/2-resistive-storage (2T-2R) cells to achieve >10× smaller cell size than SRAM-based TCAMs at the same technology node. The test chip was designed and fabricated in IBM 90nm CMOS technology and mushroom phase-change memory (PCM) process. To ensure reliable search operation with such compact cells, two enabling techniques were developed and implemented in hardware: 1) two-bit encoding, and 2) a clocked self-referenced sensing scheme (CSRSS). The 1Mb chip demonstrates reliable low voltage search operation (VDDmin~750mV) and a match delay of 1.9 ns under nominal operating conditions.

Jing Li

Papers

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing

Design Paradigm for Robust Spin-Torque Transfer Magnetic RAM (STT MRAM) From Circuit/Architecture Perspective

Modeling of failure probability and statistical design of spin-torque transfer magnetic random access memory (STT MRAM) array for yield enhancement

1Mb 0.41 µm 2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing