scispace - formally typeset
Search or ask a question
Author

J. Brandon Dwiel

Bio: J. Brandon Dwiel is an academic researcher from North Carolina State University. The author has contributed to research in topics: Computing with Memory & SIMD. The author has an hindex of 2, co-authored 2 publications receiving 7 citations.

Papers
More filters
Proceedings ArticleDOI
30 Nov 2015
TL;DR: 3D technologies offer significant potential to improve total performance and performance per unit of power, and the next frontier is to create sophisticated logic on logic solutions that promise further increases in performance/power beyond those attributable to memory interfaces alone.
Abstract: 3D technologies offer significant potential to improve total performance and performance per unit of power. After exploiting TSV technologies for cost reduction and increasing memory bandwidth, the next frontier is to create sophisticated logic on logic solutions that promise further increases in performance/power beyond those attributable to memory interfaces alone. These include heterogeneous integration for computing and exploitation of the high amounts of 3D interconnect available to reduce total interconnect power. Challenges include access for prototype quantities and the design of sophisticated static and dynamic thermal management methods and technologies, as well as test.

4 citations

Proceedings ArticleDOI
23 Nov 2015
TL;DR: The concept of Fast Thread Migration using 3DIC technologies is introduced and the design of a power optimized SIMD unit in which over half of the power is employed in the FP units is presented.
Abstract: 3DIC technology refers to stacking and interconnecting chips and substrates (“interposers”) with Through Silicon Vias (TSVs). Industry is gearing up for widespread introduction of this technology with the 22 nm node. We have been pursuing a range of approaches to enable low power computing. As well as 3DIC these include heterogeneous computing, powered optimized SIMD units, optimized memory hierarchies, and MPI with post-silicon customized interconnect. Heterogeneous computing refers to the concept of building a mix of CPUs and memories that in turn enable in-situ tuning of the compute load to the compute resources. We introduce the concept of Fast Thread Migration using 3DIC technologies. We present the design of a power optimized SIMD unit in which over half of the power is employed in the FP units. A parallel computer is built using an MPI paradigm. Codes are analyzed so that the MPI interconnect can be power optimized post-silicon. Emerging 3D memories have potential to be employed as Level 2 and Level 3 caches, and this is explored using the Tezzaron 3D memory. As scaling and power optimization occurs, the main memory increasingly dominates the power consumption. Possible extensions to Cortical Processing are discussed.

3 citations


Cited by
More filters
Proceedings ArticleDOI
28 May 2019
TL;DR: In this paper, a misalignment test structure was fabricated in a Wafer-to-Wafer (W2W) assembly configuration with a pitch of 3.42µm and 1.44 µm using a very small measurement step for an accurate mis alignment measurement (respectively 45nm and 22nm).
Abstract: Cu/oxide Hybrid Bonding (HB) technology is currently the ultimate fine pitch 3D interconnect solution to reach submicron pitches. It's an attractive technique to address the needs of several applications such as smart imagers, high-performance computing and memory-on-logic folding. But test and characterization of such fine-grained 3D interconnect is still an open issue; Cu-Cu interconnects are prone to many structural defects due to fabrication process, such as misalignment, which needs to be thoroughly tested to ensure the performance of 3D-ICs. In this work, we focus on testing and characterizing, on-wafer, misalignment defect induced at the bonding step. A misalignment test structure was fabricated in a Wafer-to-Wafer (W2W) assembly configuration with a pitch of 3.42µm and 1.44µm using a very small measurement step for an accurate misalignment measurement (respectively 45nm and 22nm). Electrical tests have been performed using five multi-pitch wafers with 71 measurements points per wafer. The experimental results show that the results of the proposed test structure are aligned with conventional overlay measurements. Finally, the impact of misalignment defect on resistance and capacitance parameters was demonstrated.

15 citations

Book ChapterDOI
Ravi Mahajan1, Bob Sankman1
01 Jan 2017
TL;DR: The advantages and limitations of 3D architectures are discussed to provide context for why 3D stacking has become a key area of interest for product architects, why it has generated broad industry attention, and why its adoption has been tenous.
Abstract: In this chapter, the advantages and limitations of 3D architectures are discussed to provide context for why 3D stacking has become a key area of interest for product architects, why it has generated broad industry attention, and why its adoption has been tenous. The primary focus of this chapter is on 3D architectures that use Through Silicon Vias (TSVs), while other System In Package (SIP) architectures that do not rely on TSVs are discussed for completeness. The key elements of a TSV-based 3D architecture are described, followed by a description of the three methods of manufacturing wafers with TSVs (i.e., Via-First, Via-Middle, and Via-Last). An analysis of the different assembly process flows for 3D structures, broadly classified as (a) Wafer-to-Wafer (W2W), (b) Die-to-Wafer (D2W), and (c) Die-to-Die (D2D) assembly processes, is covered. Key design, assembly process, test process, and materials considerations for each of these flows are described. The chapter concludes with a discussion of current and anticipated challenges for 3D architectures.

11 citations

Book
07 Dec 2018
TL;DR: A 3D multi-layer CMOSRRAM accelerator architecture for incremental machine learning is proposed, utilizing an incremental least-squares solver to perform fast learning on the neural network with significant speed-up and energyefficiency improvement.
Abstract: The Internet of things (IoT) is the networked interconnection of every object to provide intelligent service and improve economy benefit. The potential of IoT and its ubiquitous computation reality are staggering, but limited by many technical challenges. One challenge is to have a real-time response to the dynamic ambient change. Machine learning accelerator on IoT edge devices is one potential solution since a centralized system suffers long latency of processing in the back end. However, IoT edge devices are resource-constrained and machine learning algorithms are computational intensive. Therefore, optimized machine learning algorithms, such as compact machine learning for less memory usage on IoT devices, is greatly needed. In this thesis, we explore the development of fast and compact machine learning accelerators by developing leastsquares solver, tensor-solver and distributed-solver. Moreover, applications such as energy management system using such machine learning solver on IoT devices are also investigated. From the fast machine learning perspective, the target is to perform fast learning on the neural network. This thesis proposes a least-squares-solver for single hidden layer neural network. Furthermore, this thesis explores the CMOS FPGA based hardware accelerator and RRAM based hardware accelerator. A 3D multi-layer CMOSRRAM accelerator architecture for incremental machine learning is proposed. By utilizing an incremental least-squares solver, the whole training process can be mapped on the 3D multi-layer CMOS-RRAM accelerator with significant speed-up and energyefficiency improvement. Experimental results using the benchmark CIFAR-10 show that the proposed accelerator has 2.05× speed-up, 12.38× energy-saving and 1.28× areasaving compared to 3D-CMOS-ASIC hardware implementation; and 14.94× speed-up, 447.17× energy-saving and around 164.38× area-saving compared to CPU software implementation. Compared to GPU implementation, our work shows 3.07× speed-up and 162.86× energy-saving. In addition, a CMOS based FPGA realization of neural network with square-root-free Cholesky factorization is also investigated for training and inference. Experimental results have shown that our proposed accelerator on Xilinx Virtex-7 has a comparable accuracy with an average speed-up of 4.56× and 89.05×,

4 citations

Proceedings ArticleDOI
01 Nov 2016
TL;DR: A 3D multilayer CMOS-RRAM accelerator for an incremental least-squares based learning on neural network and results have shown that such a 3D accelerator can significantly reduce training time with acceptable accuracy.
Abstract: Incremental machine learning is required for future real-time data analytics. This paper introduces a 3D multilayer CMOS-RRAM accelerator for an incremental least-squares based learning on neural network. Given input of buffered data hold on the layer of a RRAM memory, intensive matrix-vector multiplication can be firstly accelerated on the layer of a digitized RRAM-crossbar. The remaining incremental leastsquares algorithmic operations for feature extraction and classifier training can be accelerated on the layer of CMOS ASIC, using an incremental Cholesky factorization accelerator realized with consideration of parallelism and pipeline. Experiment results have shown that such a 3D accelerator can significantly reduce training time with acceptable accuracy. Compared to 3D-CMOS-ASIC implementation, it can achieve 1.28x smaller area, 2.05x faster runtime and 12.4x energy reduction. Compared to GPU implementation, our work shows 3.07x speed-up and 162.86x energy-saving.

1 citations