Home
/
Authors
/
Junfeng Zhao

Author

Junfeng Zhao

Bio: Junfeng Zhao is an academic researcher from Huawei. The author has contributed to research in topics: Semiconductor memory & Computing with Memory. The author has an hindex of 7, co-authored 15 publications receiving 162 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

An Energy-Efficient Nonvolatile In-Memory Computing Architecture for Extreme Learning Machine by Domain-Wall Nanowire Devices

[...]

Yuhao Wang¹, Hao Yu¹, Leibin Ni¹, Guang-Bin Huang¹, Mei Yan¹, Chuliang Weng², Wei Yang², Junfeng Zhao² - Show less +4 more•Institutions (2)

Nanyang Technological University¹, Huawei²

19 Jun 2015-IEEE Transactions on Nanotechnology

TL;DR: It is shown that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by nonvolatile domain-wall nanowire, which significantly alleviates the bandwidth congestion issue and improves the energy efficiency.

...read moreread less

Abstract: The data-oriented applications have introduced increased demands on memory capacity and bandwidth, which raises the need to rethink the architecture of the current computing platforms. The logic-in-memory architecture is highly promising as future logic-memory integration paradigm for high throughput data-driven applications. From memory technology aspect, as one recently introduced nonvolatile memory device, domain-wall nanowire (or race-track) not only shows potential as future power efficient memory, but also computing capacity by its unique physics of spintronics. This paper explores a novel distributed in-memory computing architecture where most logic functions are executed within the memory, which significantly alleviates the bandwidth congestion issue and improves the energy efficiency. The proposed distributed in-memory computing architecture is purely built by domain-wall nanowire, i.e., both memory and logic are implemented by domain-wall nanowire devices. As a case study, neural network-based image resolution enhancement algorithm, called DW-NN, is examined within the proposed architecture. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by nonvolatile domain-wall nanowire. Domain-wall nanowire-based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that the domain-wall memory can reduce 92% leakage power and 16% dynamic power compared to main memory implemented by DRAM; and domain-wall logic can reduce 31% both dynamic and 65% leakage power under the similar performance compared to CMOS transistor-based logic. And system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 56x when compared to conventional image processing system.

...read moreread less

72 citations

Proceedings Article•DOI•

An energy-efficient matrix multiplication accelerator by distributed in-memory computing on binary RRAM crossbar

[...]

Leibin Ni¹, Yuhao Wang¹, Hao Yu¹, Wei Yang², Chuliang Weng², Junfeng Zhao² - Show less +2 more•Institutions (2)

Nanyang Technological University¹, Huawei²

10 Mar 2016

TL;DR: Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.

...read moreread less

Abstract: Emerging resistive random-access memory (RRAM) can provide non-volatile memory storage but also intrinsic logic for matrix-vector multiplication, which is ideal for low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM-based computing device is mainly assumed on a multi-level analog computing, whose result is sensitive to process non-uniformity as well as additional AD- conversion and I/O overhead. This paper explores the data analytics accelerator on binary RRAM-crossbar. Accordingly, one distributed in-memory computing architecture is proposed with design of according component and control protocol. Both memory array and logic accelerator can be implemented by RRAM-crossbar purely in binary, where logic-memory pairs can be distributed with protocol of control bus. Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.

...read moreread less

29 citations

Proceedings Article•DOI•

An energy efficient and low cross-talk CMOS sub-THz I/O with surface-wave modulator and interconnect

[...]

Yuan Liang¹, Hao Yu¹, Junfeng Zhao², Wei Yang², Yuangang Wang² - Show less +1 more•Institutions (2)

Nanyang Technological University¹, Huawei²

22 Jul 2015

TL;DR: An energy-efficient and low-crosstalk sub-THz (0.1T-1T) I/O with use of surface-wave based modulator and interconnects in CMOS and a high on/off-ratio surface- wave modulator is also proposed to support on-chip THz communication.

...read moreread less

Abstract: Free-space EM-wave based GHz interconnect has significant loss and crosstalk that cannot be deployed as low-power and dense I/Os for future network-on-chip (NoC) integration of many-core and memory. This paper proposes an energy-efficient and low-crosstalk sub-THz (0.1T–1T) I/O with use of surface-wave based modulator and interconnects in CMOS. By introducing sub-wavelength periodical corrugation structure onto transmission line, the surface-wave is established to propagate signal that is strongly localized on surface of top-layer metal wire, which results in low coupling into lossy substrate and neighboring metal wires. As such, significant power saving and cross-talk reduction can be observed with high communication bandwidth. In addition, a high on/off-ratio surface-wave modulator is also proposed to support on-chip THz communication. As designed in 65nm CMOS, the results have shown that the proposed surface-wave I/O interface achieves 25Gbps data rate and 0.016pJ/bit/mm energy efficiency at 140GHz carrier frequency over 20mm surface-wave channels. They can be placed with 2.4µm channel spacing and a −20dB crosstalk ratio. The surface-wave modulator also achieves significant reduction of radiation loss with 23dB extinction ratio.

...read moreread less

19 citations

Proceedings Article•DOI•

Optimizing Boolean embedding matrix for compressive sensing in RRAM crossbar

[...]

Yuhao Wang¹, Xin Li¹, Hao Yu¹, Leibin Ni¹, Wei Yang², Chuliang Weng², Junfeng Zhao² - Show less +3 more•Institutions (2)

Nanyang Technological University¹, Huawei²

22 Jul 2015

TL;DR: The numerical experiments demonstrate that the proposed optimized Boolean embedding on RRAM crossbar exhibits 10x faster speed, 17x better energy efficiency, and three orders of magnitude smaller area with slight accuracy penalty, when compared to the optimized real-valuedembedding on CMOS ASIC platform.

...read moreread less

Abstract: The emerging resistive random-access-memory (RRAM) crossbar provides an intrinsic fabric for matrix-vector multiplication, which can be leveraged as power efficient linear embedding hardware for data analytics such as compressive sensing. As the matrix elements are represented by resistance of RRAM cells, it imposes constraints for the embedding matrix due to limited RRAM programming resolution. A random Boolean embedding can be efficiently mapped to the RRAM crossbar but suffers from poor performance. Learning-based embedding matrices can deliver optimized performance but are continuous-valued which prevents it from being mapped to RRAM crossbar structure directly. In this paper, we have proposed one algorithm that can find an optimal Boolean embedding matrix for a given learned real-valued embedding matrix, so that it can be effectively mapped to the RRAM crossbar structure while high performance is preserved. The numerical experiments demonstrate that the proposed optimized Boolean embedding can reduce the embedding distortion by 2.7x, and image recovery error by 2.5x compared to the random Boolean embedding, both mapped on RRAM crossbar. In addition, optimized Boolean embedding on RRAM crossbar exhibits 10x faster speed, 17x better energy efficiency, and three orders of magnitude smaller area with slight accuracy penalty, when compared to the optimized real-valued embedding on CMOS ASIC platform.

...read moreread less

11 citations

Proceedings Article•DOI•

Heterogeneous architecture design with emerging 3D and non-volatile memory technologies

[...]

Qiaosha Zou¹, Matthew Poremba¹, Rui He², Wei Yang², Junfeng Zhao², Yuan Xie³ - Show less +2 more•Institutions (3)

Pennsylvania State University¹, Huawei², University of California, Santa Barbara³

12 Mar 2015

TL;DR: 3D die stacking is demonstrated, whereby disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.

...read moreread less

Abstract: Energy becomes the primary concern in nowadays multi-core architecture designs. Moore's law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

...read moreread less

10 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

In-memory Learning with Analog Resistive Switching Memory: A Review and Perspective

[...]

Yue Xi¹, Bin Gao¹, Jianshi Tang¹, An Chen², Meng-Fan Chang³, Xiaobo Sharon Hu⁴, Jan Van der Spiegel⁵, He Qian¹, Huaqiang Wu¹ - Show less +5 more•Institutions (5)

Tsinghua University¹, Semiconductor Research Corporation², National Tsing Hua University³, University of Notre Dame⁴, University of Pennsylvania⁵

01 Jan 2021

TL;DR: This article defines the main figures of merit (FoMs) of analog RSM hardware including the basic device characteristics, hardware algorithms, and the corresponding mapping methods for device arrays, as well as the architecture and circuit design considerations for neural networks.

...read moreread less

Abstract: In this article, we review the existing analog resistive switching memory (RSM) devices and their hardware technologies for in-memory learning, as well as their challenges and prospects. Since the characteristics of the devices are different for in-memory learning and digital memory applications, it is important to have an in-depth understanding across different layers from devices and circuits to architectures and algorithms. First, based on a top-down view from architecture to devices for analog computing, we define the main figures of merit (FoMs) and perform a comprehensive analysis of analog RSM hardware including the basic device characteristics, hardware algorithms, and the corresponding mapping methods for device arrays, as well as the architecture and circuit design considerations for neural networks. Second, we classify the FoMs of analog RSM devices into two levels. Level 1 FoMs are essential for achieving the functionality of a system (e.g., linearity, symmetry, dynamic range, level numbers, fluctuation, variability, and yield). Level 2 FoMs are those that make a functional system more efficient and reliable (e.g., area, operational voltage, energy consumption, speed, endurance, retention, and compatibility with back-end-of-line processing). By constructing a device-to-application simulation framework, we perform an in-depth analysis of how these FoMs influence in-memory learning and give a target list of the device requirements. Lastly, we evaluate the main FoMs of most existing devices with analog characteristics and review optimization methods from programming schemes to materials and device structures. The key challenges and prospects from the device to system level for analog RSM devices are discussed.

...read moreread less

110 citations

Journal Article•DOI•

Magnetic Racetrack Memory: From Physics to the Cusp of Applications Within a Decade

[...]

Robin Bläsing¹, Asif Ali Khan², Panagiotis Ch. Filippou³, Chirag Garg³, Fazal Hameed², Jeronimo Castrillon², Stuart S. P. Parkin¹ - Show less +3 more•Institutions (3)

Max Planck Society¹, Dresden University of Technology², IBM³

24 Mar 2020

TL;DR: An overview of the major developments of RTM technology from both the physics and computer architecture perspectives over the past decade is provided, enabling a new era of cache, graphical processing units, and high capacity memory devices.

...read moreread less

Abstract: Racetrack memory (RTM) is a novel spintronic memory-storage technology that has the potential to overcome fundamental constraints of existing memory and storage devices. It is unique in that its core differentiating feature is the movement of data, which is composed of magnetic domain walls (DWs), by short current pulses. This enables more data to be stored per unit area compared to any other current technologies. On the one hand, RTM has the potential for mass data storage with unlimited endurance using considerably less energy than today’s technologies. On the other hand, RTM promises an ultrafast nonvolatile memory competitive with static random access memory (SRAM) but with a much smaller footprint. During the last decade, the discovery of novel physical mechanisms to operate RTM has led to a major enhancement in the efficiency with which nanoscopic, chiral DWs can be manipulated. New materials and artificially atomically engineered thin-film structures have been found to increase the speed and lower the threshold current with which the data bits can be manipulated. With these recent developments, RTM has attracted the attention of the computer architecture community that has evaluated the use of RTM at various levels in the memory stack. Recent studies advocate RTM as a promising compromise between, on the one hand, power-hungry, volatile memories and, on the other hand, slow, nonvolatile storage. By optimizing the memory subsystem, significant performance improvements can be achieved, enabling a new era of cache, graphical processing units, and high capacity memory devices. In this article, we provide an overview of the major developments of RTM technology from both the physics and computer architecture perspectives over the past decade. We identify the remaining challenges and give an outlook on its future.

...read moreread less

85 citations

Journal Article•DOI•

MRIMA: An MRAM-Based In-Memory Accelerator

[...]

Shaahin Angizi¹, Zhezhi He¹, Amro Awad¹, Deliang Fan¹•Institutions (1)

University of Central Florida¹

01 May 2020-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: This paper presents practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width convolutional neural networks (CNNs) as well as data encryption, and shows ~77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain-wall-based design, respectively.

...read moreread less

Abstract: In this paper, we propose MRIMA, as a novel magnetic RAM (MRAM)-based in-memory accelerator for nonvolatile, flexible, and efficient in-memory computing. MRIMA transforms current spin transfer torque magnetic random access memory (STT-MRAM) arrays to massively parallel computational units capable of working as both nonvolatile memory and in-memory logic. Instead of integrating complex logic units in cost-sensitive memory, MRIMA exploits hardware-friendly bit-line computing methods to implement complete Boolean logic functions between operands within a memory array in a single clock cycle, overcoming the multicycle logic issue in contemporary processing-in-memory (PIM) platforms. We present practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width convolutional neural networks (CNNs) as well as data encryption. Our device-to-architecture co-simulation results on CNN acceleration demonstrate that MRIMA can obtain $1.7 {\times }$ better energy-efficiency and $11.2{\times }$ speed-up compared to ASICs, and $1.8 {\times }$ better energy-efficiency and $2.4 {\times }$ speed-up over the best DRAM-based PIM solutions. As an advanced encryption standard (AES) in-memory encryption engine, MRIMA shows ~77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain-wall-based design, respectively.

...read moreread less

76 citations

Journal Article•DOI•

Distributed In-Memory Computing on Binary RRAM Crossbar

[...]

Leibin Ni¹, Hantao Huang¹, Zichuan Liu¹, Rajiv V. Joshi², Hao Yu¹ - Show less +1 more•Institutions (2)

Nanyang Technological University¹, IBM²

17 Mar 2017-ACM Journal on Emerging Technologies in Computing Systems

...read moreread less

Abstract: The recently emerging resistive random-access memory (RRAM) can provide nonvolatile memory storage but also intrinsic computing for matrix-vector multiplication, which is ideal for the low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM crossbar--based computing is mainly assumed as a multilevel analog computing, whose result is sensitive to process nonuniformity as well as additional overhead from AD-conversion and I/O. In this article, we explore the matrix-vector multiplication accelerator on a binary RRAM crossbar with adaptive 1-bit-comparator--based parallel conversion. Moreover, a distributed in-memory computing architecture is also developed with the according control protocol. Both memory array and logic accelerator are implemented on the binary RRAM crossbar, where the logic-memory pair can be distributed with the control bus protocol. Experimental results have shown that compared to the analog RRAM crossbar, the proposed binary RRAM crossbar can achieve significant area savings with better calculation accuracy. Moreover, significant speedup can be achieved for matrix-vector multiplication in neural network--based machine learning such that the overall training and testing time can be both reduced. In addition, large energy savings can be also achieved when compared to the traditional CMOS-based out-of-memory computing architecture.

...read moreread less

65 citations

Proceedings Article•DOI•

Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator

[...]

Jilan Lin¹, Zhenhua Zhu¹, Yu Wang¹, Yuan Xie²•Institutions (2)

Tsinghua University¹, University of California, Santa Barbara²

21 Jan 2019

TL;DR: A sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization and a crossbar-grained pruning algorithm to remove the crossbars with low utilization is proposed.

...read moreread less

Abstract: With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 -- 5X energy efficiency and more than 2.5 -- 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.

...read moreread less

64 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

Collapse