scispace - formally typeset
Search or ask a question
Author

Junfeng Zhao

Bio: Junfeng Zhao is an academic researcher from Huawei. The author has contributed to research in topics: Semiconductor memory & Computing with Memory. The author has an hindex of 7, co-authored 15 publications receiving 162 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: It is shown that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by nonvolatile domain-wall nanowire, which significantly alleviates the bandwidth congestion issue and improves the energy efficiency.
Abstract: The data-oriented applications have introduced increased demands on memory capacity and bandwidth, which raises the need to rethink the architecture of the current computing platforms. The logic-in-memory architecture is highly promising as future logic-memory integration paradigm for high throughput data-driven applications. From memory technology aspect, as one recently introduced nonvolatile memory device, domain-wall nanowire (or race-track) not only shows potential as future power efficient memory, but also computing capacity by its unique physics of spintronics. This paper explores a novel distributed in-memory computing architecture where most logic functions are executed within the memory, which significantly alleviates the bandwidth congestion issue and improves the energy efficiency. The proposed distributed in-memory computing architecture is purely built by domain-wall nanowire, i.e., both memory and logic are implemented by domain-wall nanowire devices. As a case study, neural network-based image resolution enhancement algorithm, called DW-NN, is examined within the proposed architecture. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by nonvolatile domain-wall nanowire. Domain-wall nanowire-based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that the domain-wall memory can reduce 92% leakage power and 16% dynamic power compared to main memory implemented by DRAM; and domain-wall logic can reduce 31% both dynamic and 65% leakage power under the similar performance compared to CMOS transistor-based logic. And system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 56x when compared to conventional image processing system.

72 citations

Proceedings ArticleDOI
10 Mar 2016
TL;DR: Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.
Abstract: Emerging resistive random-access memory (RRAM) can provide non-volatile memory storage but also intrinsic logic for matrix-vector multiplication, which is ideal for low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM-based computing device is mainly assumed on a multi-level analog computing, whose result is sensitive to process non-uniformity as well as additional AD- conversion and I/O overhead. This paper explores the data analytics accelerator on binary RRAM-crossbar. Accordingly, one distributed in-memory computing architecture is proposed with design of according component and control protocol. Both memory array and logic accelerator can be implemented by RRAM-crossbar purely in binary, where logic-memory pairs can be distributed with protocol of control bus. Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.

29 citations

Proceedings ArticleDOI
Yuan Liang1, Hao Yu1, Junfeng Zhao2, Wei Yang2, Yuangang Wang2 
22 Jul 2015
TL;DR: An energy-efficient and low-crosstalk sub-THz (0.1T-1T) I/O with use of surface-wave based modulator and interconnects in CMOS and a high on/off-ratio surface- wave modulator is also proposed to support on-chip THz communication.
Abstract: Free-space EM-wave based GHz interconnect has significant loss and crosstalk that cannot be deployed as low-power and dense I/Os for future network-on-chip (NoC) integration of many-core and memory. This paper proposes an energy-efficient and low-crosstalk sub-THz (0.1T–1T) I/O with use of surface-wave based modulator and interconnects in CMOS. By introducing sub-wavelength periodical corrugation structure onto transmission line, the surface-wave is established to propagate signal that is strongly localized on surface of top-layer metal wire, which results in low coupling into lossy substrate and neighboring metal wires. As such, significant power saving and cross-talk reduction can be observed with high communication bandwidth. In addition, a high on/off-ratio surface-wave modulator is also proposed to support on-chip THz communication. As designed in 65nm CMOS, the results have shown that the proposed surface-wave I/O interface achieves 25Gbps data rate and 0.016pJ/bit/mm energy efficiency at 140GHz carrier frequency over 20mm surface-wave channels. They can be placed with 2.4µm channel spacing and a −20dB crosstalk ratio. The surface-wave modulator also achieves significant reduction of radiation loss with 23dB extinction ratio.

19 citations

Proceedings ArticleDOI
22 Jul 2015
TL;DR: The numerical experiments demonstrate that the proposed optimized Boolean embedding on RRAM crossbar exhibits 10x faster speed, 17x better energy efficiency, and three orders of magnitude smaller area with slight accuracy penalty, when compared to the optimized real-valuedembedding on CMOS ASIC platform.
Abstract: The emerging resistive random-access-memory (RRAM) crossbar provides an intrinsic fabric for matrix-vector multiplication, which can be leveraged as power efficient linear embedding hardware for data analytics such as compressive sensing. As the matrix elements are represented by resistance of RRAM cells, it imposes constraints for the embedding matrix due to limited RRAM programming resolution. A random Boolean embedding can be efficiently mapped to the RRAM crossbar but suffers from poor performance. Learning-based embedding matrices can deliver optimized performance but are continuous-valued which prevents it from being mapped to RRAM crossbar structure directly. In this paper, we have proposed one algorithm that can find an optimal Boolean embedding matrix for a given learned real-valued embedding matrix, so that it can be effectively mapped to the RRAM crossbar structure while high performance is preserved. The numerical experiments demonstrate that the proposed optimized Boolean embedding can reduce the embedding distortion by 2.7x, and image recovery error by 2.5x compared to the random Boolean embedding, both mapped on RRAM crossbar. In addition, optimized Boolean embedding on RRAM crossbar exhibits 10x faster speed, 17x better energy efficiency, and three orders of magnitude smaller area with slight accuracy penalty, when compared to the optimized real-valued embedding on CMOS ASIC platform.

11 citations

Proceedings ArticleDOI
12 Mar 2015
TL;DR: 3D die stacking is demonstrated, whereby disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.
Abstract: Energy becomes the primary concern in nowadays multi-core architecture designs. Moore's law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

10 citations


Cited by
More filters
Journal ArticleDOI
01 Jan 2021
TL;DR: This article defines the main figures of merit (FoMs) of analog RSM hardware including the basic device characteristics, hardware algorithms, and the corresponding mapping methods for device arrays, as well as the architecture and circuit design considerations for neural networks.
Abstract: In this article, we review the existing analog resistive switching memory (RSM) devices and their hardware technologies for in-memory learning, as well as their challenges and prospects. Since the characteristics of the devices are different for in-memory learning and digital memory applications, it is important to have an in-depth understanding across different layers from devices and circuits to architectures and algorithms. First, based on a top-down view from architecture to devices for analog computing, we define the main figures of merit (FoMs) and perform a comprehensive analysis of analog RSM hardware including the basic device characteristics, hardware algorithms, and the corresponding mapping methods for device arrays, as well as the architecture and circuit design considerations for neural networks. Second, we classify the FoMs of analog RSM devices into two levels. Level 1 FoMs are essential for achieving the functionality of a system (e.g., linearity, symmetry, dynamic range, level numbers, fluctuation, variability, and yield). Level 2 FoMs are those that make a functional system more efficient and reliable (e.g., area, operational voltage, energy consumption, speed, endurance, retention, and compatibility with back-end-of-line processing). By constructing a device-to-application simulation framework, we perform an in-depth analysis of how these FoMs influence in-memory learning and give a target list of the device requirements. Lastly, we evaluate the main FoMs of most existing devices with analog characteristics and review optimization methods from programming schemes to materials and device structures. The key challenges and prospects from the device to system level for analog RSM devices are discussed.

110 citations

Journal ArticleDOI
24 Mar 2020
TL;DR: An overview of the major developments of RTM technology from both the physics and computer architecture perspectives over the past decade is provided, enabling a new era of cache, graphical processing units, and high capacity memory devices.
Abstract: Racetrack memory (RTM) is a novel spintronic memory-storage technology that has the potential to overcome fundamental constraints of existing memory and storage devices. It is unique in that its core differentiating feature is the movement of data, which is composed of magnetic domain walls (DWs), by short current pulses. This enables more data to be stored per unit area compared to any other current technologies. On the one hand, RTM has the potential for mass data storage with unlimited endurance using considerably less energy than today’s technologies. On the other hand, RTM promises an ultrafast nonvolatile memory competitive with static random access memory (SRAM) but with a much smaller footprint. During the last decade, the discovery of novel physical mechanisms to operate RTM has led to a major enhancement in the efficiency with which nanoscopic, chiral DWs can be manipulated. New materials and artificially atomically engineered thin-film structures have been found to increase the speed and lower the threshold current with which the data bits can be manipulated. With these recent developments, RTM has attracted the attention of the computer architecture community that has evaluated the use of RTM at various levels in the memory stack. Recent studies advocate RTM as a promising compromise between, on the one hand, power-hungry, volatile memories and, on the other hand, slow, nonvolatile storage. By optimizing the memory subsystem, significant performance improvements can be achieved, enabling a new era of cache, graphical processing units, and high capacity memory devices. In this article, we provide an overview of the major developments of RTM technology from both the physics and computer architecture perspectives over the past decade. We identify the remaining challenges and give an outlook on its future.

85 citations

Journal ArticleDOI
TL;DR: This paper presents practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width convolutional neural networks (CNNs) as well as data encryption, and shows ~77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain-wall-based design, respectively.
Abstract: In this paper, we propose MRIMA, as a novel magnetic RAM (MRAM)-based in-memory accelerator for nonvolatile, flexible, and efficient in-memory computing. MRIMA transforms current spin transfer torque magnetic random access memory (STT-MRAM) arrays to massively parallel computational units capable of working as both nonvolatile memory and in-memory logic. Instead of integrating complex logic units in cost-sensitive memory, MRIMA exploits hardware-friendly bit-line computing methods to implement complete Boolean logic functions between operands within a memory array in a single clock cycle, overcoming the multicycle logic issue in contemporary processing-in-memory (PIM) platforms. We present practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width convolutional neural networks (CNNs) as well as data encryption. Our device-to-architecture co-simulation results on CNN acceleration demonstrate that MRIMA can obtain $1.7 {\times }$ better energy-efficiency and $11.2{\times }$ speed-up compared to ASICs, and $1.8 {\times }$ better energy-efficiency and $2.4 {\times }$ speed-up over the best DRAM-based PIM solutions. As an advanced encryption standard (AES) in-memory encryption engine, MRIMA shows ~77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain-wall-based design, respectively.

76 citations

Journal ArticleDOI
Leibin Ni1, Hantao Huang1, Zichuan Liu1, Rajiv V. Joshi2, Hao Yu1 
TL;DR: Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.
Abstract: The recently emerging resistive random-access memory (RRAM) can provide nonvolatile memory storage but also intrinsic computing for matrix-vector multiplication, which is ideal for the low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM crossbar--based computing is mainly assumed as a multilevel analog computing, whose result is sensitive to process nonuniformity as well as additional overhead from AD-conversion and I/O. In this article, we explore the matrix-vector multiplication accelerator on a binary RRAM crossbar with adaptive 1-bit-comparator--based parallel conversion. Moreover, a distributed in-memory computing architecture is also developed with the according control protocol. Both memory array and logic accelerator are implemented on the binary RRAM crossbar, where the logic-memory pair can be distributed with the control bus protocol. Experimental results have shown that compared to the analog RRAM crossbar, the proposed binary RRAM crossbar can achieve significant area savings with better calculation accuracy. Moreover, significant speedup can be achieved for matrix-vector multiplication in neural network--based machine learning such that the overall training and testing time can be both reduced. In addition, large energy savings can be also achieved when compared to the traditional CMOS-based out-of-memory computing architecture.

65 citations

Proceedings ArticleDOI
21 Jan 2019
TL;DR: A sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization and a crossbar-grained pruning algorithm to remove the crossbars with low utilization is proposed.
Abstract: With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 -- 5X energy efficiency and more than 2.5 -- 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.

64 citations