scispace - formally typeset
Search or ask a question
Author

Jing Li

Bio: Jing Li is an academic researcher from University of Pennsylvania. The author has contributed to research in topics: Phase-change memory & CMOS. The author has an hindex of 25, co-authored 111 publications receiving 1823 citations. Previous affiliations of Jing Li include IBM & University of Wisconsin-Madison.


Papers
More filters
Proceedings ArticleDOI
22 Feb 2017
TL;DR: An analytical performance model is proposed and an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs are performed and a new kernel design is proposed to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access.
Abstract: OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.

199 citations

Journal ArticleDOI
TL;DR: This work demonstrates the first fabricated 1 Mb nonvolatile TCAM using 2-transistor/2-resistive-storage (2T-2R) cells to achieve >10× smaller cell size than SRAM-based TCAMs at the same technology node.
Abstract: This work demonstrates the first fabricated 1 Mb nonvolatile TCAM using 2-transistor/2-resistive-storage (2T-2R) cells to achieve >10× smaller cell size than SRAM-based TCAMs at the same technology node. The test chip was designed and fabricated in IBM 90 nm CMOS technology and mushroom phase-change memory (PCM) technology. The primary challenge for enabling reliable array operation with such aggressive cell is presented, namely, severely degraded sensing margin due to significantly lower ON/OFF ratio of resistive memories (~102 for PCM) than that of traditional MOSFETs (>105 ). To address this challenge, two enabling techniques were developed and implemented in hardware: 1) two-bit encoding and 2) a clocked self-referenced sensing scheme (CSRSS). In addition, the two-bit encoding can also improve algorithmic mapping by effectively compressing TCAM entries. The 1 Mb chip demonstrates reliable low voltage search operation (VDDmin ~750 mV) and a match delay of 1.9 ns under nominal operating conditions.

146 citations

Journal ArticleDOI
TL;DR: This paper analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations and developed an efficient design paradigm from circuit and/or architecture perspective-to improve the robustness and integration density.
Abstract: Spin-torque transfer magnetic RAM (STT MRAM) is a promising candidate for future embedded applications. It combines the desirable attributes of current memory technologies such as SRAM, DRAM, and flash memories (fast access time, low cost, high density, and non-volatility). It also solves the critical drawbacks of conventional MRAM technology: poor scalability and high write current. However, variations in process parameters can lead to a large number of cells to fail, severely affecting the yield of the memory array. In this paper, we analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations. Based on the model, we performed a thorough analysis of the impact of design parameters on parametric failures due to process variations. To achieve high memory yield without incurring expensive technology modification, we developed an efficient design paradigm from circuit and/or architecture perspective-to improve the robustness and integration density. The proposed technique effectively relaxes or completely decouples the conflicting design requirements for read stability, writability and cell area. It can be used at an early stage of the design cycle for yield enhancement.

131 citations

Proceedings ArticleDOI
08 Jun 2008
TL;DR: This paper analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations and developed an efficient simulation tool to capture the coupled electro/magnetic dynamics of spintronic device, leading to effective prediction for memory yield.
Abstract: Spin-torque transfer magnetic RAM (STT MRAM) is a promising candidate for future universal memory. It combines the desirable attributes of current memory technologies such as SRAM, DRAM and flash memories. It also solves the key drawbacks of conventional MRAM technology: poor scalability and high write current. In this paper, we analyzed and modeled the failure probabilities of STT MRAM cells due to parameter variations. Based on the model, we developed an efficient simulation tool to capture the coupled electro/magnetic dynamics of spintronic device, leading to effective prediction for memory yield. We also developed a statistical optimization methodology to minimize the memory failure probability. The proposed methodology can be used at an early stage of the design cycle to enhance memory yield.

93 citations

Proceedings Article
11 Jun 2013
TL;DR: This work demonstrates the first fabricated nonvolatile TCAM using 2-transistor/2-resistive-storage cells to achieve >10x smaller cell size than SRAM-based TCAMs at the same technology node.
Abstract: This work demonstrates the first fabricated nonvolatile TCAM using 2-transistor/2-resistive-storage (2T-2R) cells to achieve >10× smaller cell size than SRAM-based TCAMs at the same technology node. The test chip was designed and fabricated in IBM 90nm CMOS technology and mushroom phase-change memory (PCM) process. To ensure reliable search operation with such compact cells, two enabling techniques were developed and implemented in hardware: 1) two-bit encoding, and 2) a clocked self-referenced sensing scheme (CSRSS). The 1Mb chip demonstrates reliable low voltage search operation (VDDmin~750mV) and a match delay of 1.9 ns under nominal operating conditions.

85 citations


Cited by
More filters
01 Jan 2012

3,692 citations

Journal ArticleDOI
20 Apr 2010
TL;DR: The physics behind this large resistivity contrast between the amorphous and crystalline states in phase change materials is presented and how it is being exploited to create high density PCM is described.
Abstract: In this paper, recent progress of phase change memory (PCM) is reviewed. The electrical and thermal properties of phase change materials are surveyed with a focus on the scalability of the materials and their impact on device design. Innovations in the device structure, memory cell selector, and strategies for achieving multibit operation and 3-D, multilayer high-density memory arrays are described. The scaling properties of PCM are illustrated with recent experimental results using special device test structures and novel material synthesis. Factors affecting the reliability of PCM are discussed.

1,488 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.
Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

1,197 citations

Journal ArticleDOI
29 Jan 2020-Nature
TL;DR: The fabrication of high-yield, high-performance and uniform memristor crossbar arrays for the implementation of CNNs and an effective hybrid-training method to adapt to device imperfections and improve the overall system performance are proposed.
Abstract: Memristor-enabled neuromorphic computing systems provide a fast and energy-efficient approach to training neural networks1–4. However, convolutional neural networks (CNNs)—one of the most important models for image recognition5—have not yet been fully hardware-implemented using memristor crossbars, which are cross-point arrays with a memristor device at each intersection. Moreover, achieving software-comparable results is highly challenging owing to the poor yield, large variation and other non-ideal characteristics of devices6–9. Here we report the fabrication of high-yield, high-performance and uniform memristor crossbar arrays for the implementation of CNNs, which integrate eight 2,048-cell memristor arrays to improve parallel-computing efficiency. In addition, we propose an effective hybrid-training method to adapt to device imperfections and improve the overall system performance. We built a five-layer memristor-based CNN to perform MNIST10 image recognition, and achieved a high accuracy of more than 96 per cent. In addition to parallel convolutions using different kernels with shared inputs, replication of multiple identical kernels in memristor arrays was demonstrated for processing different inputs in parallel. The memristor-based CNN neuromorphic system has an energy efficiency more than two orders of magnitude greater than that of state-of-the-art graphics-processing units, and is shown to be scalable to larger networks, such as residual neural networks. Our results are expected to enable a viable memristor-based non-von Neumann hardware solution for deep neural networks and edge computing. A fully hardware-based memristor convolutional neural network using a hybrid training method achieves an energy efficiency more than two orders of magnitude greater than that of graphics-processing units.

1,033 citations