Home
/
Authors
/
Jinwook Oh

Author

Jinwook Oh

Other affiliations: KAIST

Bio: Jinwook Oh is an academic researcher from IBM. The author has contributed to research in topics: Cognitive neuroscience of visual object recognition & Network on a chip. The author has an hindex of 14, co-authored 46 publications receiving 697 citations. Previous affiliations of Jinwook Oh include KAIST.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine

[...]

Joo-Young Kim¹, Minsu Kim¹, Seungjin Lee¹, Jinwook Oh¹, Kwanho Kim¹, Sejong Oh¹, Jeong-Ho Woo¹, Dong-Hyun Kim¹, Hoi-Jun Yoo¹ - Show less +5 more•Institutions (1)

KAIST¹

22 Dec 2009

TL;DR: In the proposed hardware architecture, three recognition tasks (visual perception, descriptor generation, and object decision) are directly mapped to the neural perception engine, 16 SIMD processors including 128 processing elements, and decision processor and executed in the pipeline to maximize throughput of the object recognition.

...read moreread less

Abstract: A 201.4 GOPS real-time multi-object recognition processor is presented with a three-stage pipelined architecture. Visual perception based multi-object recognition algorithm is applied to give multiple attentions to multiple objects in the input image. For human-like multi-object perception, a neural perception engine is proposed with biologically inspired neural networks and fuzzy logic circuits. In the proposed hardware architecture, three recognition tasks (visual perception, descriptor generation, and object decision) are directly mapped to the neural perception engine, 16 SIMD processors including 128 processing elements, and decision processor, respectively, and executed in the pipeline to maximize throughput of the object recognition. For efficient task pipelining, proposed task/power manager balances the execution times of the three stages based on intelligent workload estimations. In addition, a 118.4 GB/s multi-casting network-on-chip is proposed for communication architecture with incorporating overall 21 IP blocks. For low-power object recognition, workload-aware dynamic power management is performed in chip-level. The 49 mm2 chip is fabricated in a 0.13 ?m 8-metal CMOS process and contains 3.7 M gates and 396 KB on-chip SRAM. It achieves 60 frame/sec multi-object recognition up to 10 different objects for VGA (640 × 480) video input while dissipating 496 mW at 1.2 V. The obtained 8.2 mJ/frame energy efficiency is 3.2 times higher than the state-of-the-art recognition processor.

...read moreread less

126 citations

Proceedings Article•DOI•

A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference

[...]

Bruce M. Fleischer¹, Sunil Shukla¹, Matthew M. Ziegler¹, Joel Abraham Silberman¹, Jinwook Oh¹, Vijavalakshmi Srinivasan¹, Jungwook Choi¹, Silvia Melitta Mueller¹, Ankur Agrawal¹, Tina Babinsky¹, Nianzheng Cao¹, Chia-Yu Chen¹, Pierce Chuang¹, Thomas W. Fox¹, George D. Gristede¹, Michael A. Guillorn¹, Howard M. Haynie¹, Michael J. Klaiber¹, Dongsoo Lee¹, Shih-Hsien Lo¹, Gary W. Maier¹, Michael R. Scheuermann¹, Swagath Venkataramani¹, Christos Vezyrtzis¹, Naigang Wang¹, Fanchieh Yee¹, Ching Zhou¹, Pong-Fei Lu¹, Brian W. Curran¹, Lel Chang¹, Kailash Gopalakrishnan¹ - Show less +27 more•Institutions (1)

IBM¹

18 Jun 2018

TL;DR: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers by employing a dataflow architecture and an on-chip scratchpad hierarchy.

...read moreread less

Abstract: A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.

...read moreread less

103 citations

Journal Article•DOI•

A 345 mW Heterogeneous Many-Core Processor With an Intelligent Inference Engine for Robust Object Recognition

[...]

Seungjin Lee¹, Jinwook Oh¹, Jun-Young Park¹, Joonsoo Kwon¹, Minsu Kim¹, Hoi-Jun Yoo¹ - Show less +2 more•Institutions (1)

KAIST¹

14 Oct 2010

TL;DR: A heterogeneous many-core processor is presented that realizes the UVAM algorithm, which incorporates the familiarity map on top of the saliency map for the search of attentive points, to achieve fast and robust object recognition of cluttered video sequences.

...read moreread less

Abstract: Fast and robust object recognition of cluttered scenes presents two main challenges: (1) the large number of features to process requires high computational power, and (2) false matches from background clutter can degrade recognition accuracy. Previously, saliency based bottom-up visual attention [1,2] increased recognition speed by confining the recognition processing only to the salient regions. But these schemes had an inherent problem: the accuracy of the attention itself. If attention is paid to the false region, which is common when saliency cannot distinguish between clutter and object, recognition accuracy is degraded. In order to improve the attention accuracy, we previously reported an algorithm, the Unified Visual Attention Model (UVAM) [3], which incorporates the familiarity map on top of the saliency map for the search of attentive points. It can cross-check the accuracy of attention deployment by combining top-down attention, searching for “meaningful objects”, and bottom-up attention, just looking for conspicuous points. This paper presents a heterogeneous many-core (note: we use the term “many-core” instead of “multi-core” to emphasize the large number of cores) processor that realizes the UVAM algorithm to achieve fast and robust object recognition of cluttered video sequences.

...read moreread less

86 citations

Proceedings Article•DOI•

Approximate computing: Challenges and opportunities

[...]

Ankur Agrawal¹, Jungwook Choi¹, Kailash Gopalakrishnan¹, Suyog Gupta¹, Ravi Nair¹, Jinwook Oh¹, Daniel A. Prener¹, Sunil Shukla¹, Vijayalakshmi Srinivasan¹, Zehra Sura¹ - Show less +6 more•Institutions (1)

IBM¹

01 Oct 2016

TL;DR: It is shown that hot loops in the applications can be perforated by an average of 50% with proportional reduction in execution time, while still producing acceptable quality of results, and that benefits compounded when these techniques are applied concurrently.

...read moreread less

Abstract: Approximate computing is gaining traction as a computing paradigm for data analytics and cognitive applications that aim to extract deep insight from vast quantities of data. In this paper, we demonstrate that multiple approximation techniques can be applied to applications in these domains and can be further combined together to compound their benefits. In assessing the potential of approximation in these applications, we took the liberty of changing multiple layers of the system stack: architecture, programming model, and algorithms. Across a set of applications spanning the domains of DSP, robotics, and machine learning, we show that hot loops in the applications can be perforated by an average of 50% with proportional reduction in execution time, while still producing acceptable quality of results. In addition, the width of the data used in the computation can be reduced to 10–16 bits from the currently common 32/64 bits with potential for significant performance and energy benefits. For parallel applications we reduced execution time by 50% using relaxed synchronization mechanisms. Finally, our results also demonstrate that benefits compounded when these techniques are applied concurrently. Our results across different applications demonstrate that approximate computing is a widely applicable paradigm with potential for compounded benefits from applying multiple techniques across the system stack. In order to exploit these benefits it is essential to re-think multiple layers of the system stack to embrace approximations ground-up and to design tightly integrated approximate accelerators. Doing so will enable moving the applications into a world in which the architecture, programming model, and even the algorithms used to implement the application are all fundamentally designed for approximate computing.

...read moreread less

72 citations

Proceedings Article•DOI•

A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling

[...]

Ankur Agrawal¹, Sae Kyu Lee¹, Joel Abraham Silberman¹, Matthew M. Ziegler¹, Mingu Kang¹, Swagath Venkataramani¹, Nianzheng Cao¹, Bruce M. Fleischer¹, Michael A. Guillorn¹, Matthew Cohen¹, Silvia Melitta Mueller¹, Jinwook Oh¹, Martin Lutz¹, Jinwook Jung¹, Siyu Koswatta¹, Ching Zhou¹, Vidhi Zalani¹, James J. Bonanno¹, Robert Casatuta¹, Chia-Yu Chen¹, Jungwook Choi², Howard M. Haynie¹, Alyssa Herbert¹, Radhika Jain¹, Monodeep Kar¹, Kyu-hyoun Kim¹, Li Yulong¹, Zhibin Ren¹, Scot H. Rider¹, Marcel Schaal¹, Kerstin Schelm¹, Michael R. Scheuermann¹, Xiao Sun¹, Hung Tran¹, Naigang Wang¹, Wei Wang¹, Xin Zhang¹, Vinay Velji Shah¹, Brian W. Curran¹, Vijayalakshmi Srinivasan¹, Pong-Fei Lu¹, Sunil Shukla¹, Leland Chang¹, Kailash Gopalakrishnan¹ - Show less +40 more•Institutions (2)

IBM¹, Hanyang University²

13 Feb 2021

TL;DR: In this article, a 4-core AI chip in 7nm EUV technology is presented to exploit cutting-edge algorithmic advances for iso-accurate models in low-precision training and inference to achieve leading-edge power-performance.

...read moreread less

Abstract: Low-precision computation is the key enabling factor to achieve high compute densities (T0PS/W and T0PS/mm2) in AI hardware accelerators across cloud and edge platforms. However, robust deep learning (DL) model accuracy equivalent to high-precision computation must be maintained. Improvements in bandwidth, architecture, and power management are also required to harness the benefit of reduced precision by feeding and supporting more parallel engines to achieve high sustained utilization and optimize performance within a given product power envelope. In this work, we present a 4-core AI chip in 7nm EUV technology that exploits cutting-edge algorithmic advances for iso-accurate models in low-precision training and inference [1, 2] and aggressive circuit/architecture optimization to achieve leading-edge power-performance. The chip supports fp16 (DLFIoat16 [8]) and hybrid-fp8(hfp8) [1] formats for training and inference of DL models, as well as int4 and int2 formats for highly scaled inference.

...read moreread less

56 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

Computer vision : a modern approach = 计算机视觉 : 一种现代的方法

[...]

David Forsyth, Jean Ponce

01 Jan 2004

TL;DR: Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance and describes numerous important application areas such as image based rendering and digital libraries.

...read moreread less

Abstract: From the Publisher: The accessible presentation of this book gives both a general view of the entire computer vision enterprise and also offers sufficient detail to be able to build useful applications. Users learn techniques that have proven to be useful by first-hand experience and a wide range of mathematical methods. A CD-ROM with every copy of the text contains source code for programming practice, color images, and illustrative movies. Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance. Topics are discussed in substantial and increasing depth. Application surveys describe numerous important application areas such as image based rendering and digital libraries. Many important algorithms broken down and illustrated in pseudo code. Appropriate for use by engineers as a comprehensive reference to the computer vision enterprise.

...read moreread less

3,627 citations

Proceedings Article•DOI•

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

[...]

Tianshi Chen¹, Zidong Du¹, Ninghui Sun¹, Jia Wang¹, Chengyong Wu¹, Yunji Chen¹, Olivier Temam² - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, French Institute for Research in Computer Science and Automation²

24 Feb 2014

TL;DR: This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.

...read moreread less

Abstract: Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

...read moreread less

1,582 citations

Journal Article•DOI•

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

[...]

Ping Chi¹, Shuangchen Li¹, Cong Xu², Tao Zhang³, Jishen Zhao¹, Yongpan Liu⁴, Yu Wang⁴, Yuan Xie¹ - Show less +4 more•Institutions (4)

University of California¹, Hewlett-Packard², Nvidia³, Tsinghua University⁴

18 Jun 2016

TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.

...read moreread less

Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

...read moreread less

1,197 citations

Proceedings Article•DOI•

Project Adam: building an efficient and scalable deep learning training system

[...]

Trishul Chilimbi¹, Yutaka Suzue¹, Johnson T. Apacible¹, Karthik Kalyanaraman¹•Institutions (1)

Microsoft¹

06 Oct 2014

TL;DR: The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.

...read moreread less

Abstract: Large deep neural network models have recently demonstrated state-of-the-art accuracy on hard visual recognition tasks. Unfortunately such models are extremely time consuming to train and require large amount of compute cycles. We describe the design and implementation of a distributed system called Adam comprised of commodity server machines to train such models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks. Adam achieves high efficiency and scalability through whole system co-design that optimizes and balances workload computation and communication. We exploit asynchrony throughout the system to improve performance and show that it additionally improves the accuracy of trained models. Adam is significantly more efficient and scalable than was previously thought possible and used 30x fewer machines to train a large 2 billion connection model to 2x higher accuracy in comparable time on the ImageNet 22,000 category image classification task than the system that previously held the record for this benchmark. We also show that task accuracy improves with larger models. Our results provide compelling evidence that a distributed systems-driven approach to deep learning using current training algorithms is worth pursuing.

...read moreread less

737 citations

Posted Content•

A Survey of Neuromorphic Computing and Neural Networks in Hardware.

[...]

Catherine D. Schuman, Thomas E. Potok, Robert M. Patton, J. Douglas Birdwell, Mark Edward Dean, Garrett S. Rose, James S. Plank - Show less +3 more

19 May 2017-arXiv: Neural and Evolutionary Computing

TL;DR: An exhaustive review of the research conducted in neuromorphic computing since the inception of the term is provided to motivate further work by illuminating gaps in the field where new research is needed.

...read moreread less

Abstract: Neuromorphic computing has come to refer to a variety of brain-inspired computers, devices, and models that contrast the pervasive von Neumann computer architecture This biologically inspired approach has created highly connected synthetic neurons and synapses that can be used to model neuroscience theories as well as solve challenging machine learning problems The promise of the technology is to create a brain-like ability to learn and adapt, but the technical challenges are significant, starting with an accurate neuroscience model of how the brain works, to finding materials and engineering breakthroughs to build devices to support these models, to creating a programming framework so the systems can learn, to creating applications with brain-like capabilities In this work, we provide a comprehensive survey of the research and motivations for neuromorphic computing over its history We begin with a 35-year review of the motivations and drivers of neuromorphic computing, then look at the major research areas of the field, which we define as neuro-inspired models, algorithms and learning approaches, hardware and devices, supporting systems, and finally applications We conclude with a broad discussion on the major research topics that need to be addressed in the coming years to see the promise of neuromorphic computing fulfilled The goals of this work are to provide an exhaustive review of the research conducted in neuromorphic computing since the inception of the term, and to motivate further work by illuminating gaps in the field where new research is needed

...read moreread less

570 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

Collapse