scispace - formally typeset
Search or ask a question
Author

Larkhoon Leem

Other affiliations: Stanford University
Bio: Larkhoon Leem is an academic researcher from Intel. The author has contributed to research in topics: Spintronics & NAND gate. The author has an hindex of 10, co-authored 16 publications receiving 1091 citations. Previous affiliations of Larkhoon Leem include Stanford University.

Papers
More filters
Proceedings ArticleDOI
11 Nov 2006
TL;DR: This work has implemented a complete programming system, including a compiler and runtime systems for cell processor-based blade systems and distributed memory clusters, and demonstrates efficient performance running Sequoia programs on both of these platforms.
Abstract: We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.

482 citations

Proceedings ArticleDOI
Larkhoon Leem1, Hyungmin Cho1, Jason Bau1, Quinn Jacobson2, Subhasish Mitra1 
08 Mar 2010
TL;DR: Error Resilient System Architecture (ERSA) is presented, a low-cost robust system architecture for emerging killer probabilistic applications such as Recognition, Mining and Synthesis (RMS) applications and may be adapted for general-purpose applications that are less resilient to errors.
Abstract: There is a growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a low-cost robust system architecture for emerging killer probabilistic applications such as Recognition, Mining and Synthesis (RMS) applications. While resilience of such applications to errors in low-order bits of data is well-known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control errors (in addition to low-order bit errors) using a judicious combination of 3 key ideas: (1) asymmetric reliability in many-core architectures, (2) error-resilient algorithms at the core of probabilistic applications, and (3) intelligent software optimizations. Error injection experiments on a multi-core ERSA hardware prototype demonstrate that, even at very high error rates of 20,000 errors/second/core or 2×10−4 error/cycle/core (with errors injected in architecturally-visible registers), ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time, for probabilistic applications such as K-Means clustering, LDPC decoding and Bayesian networks. Moreover, we demonstrate the effectiveness of ERSA in tolerating high rates of static memory errors that are characteristic of emerging challenges such as Vccmin problems and erratic bit errors. Using the concept of configurable reliability, ERSA platforms may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs).

246 citations

Journal ArticleDOI
TL;DR: Error Resilient System Architecture (ERSA) is presented, a low-cost robust system architecture for emerging killer probabilistic applications such as Recognition, Mining and Synthesis (RMS) applications and may be adapted for general-purpose applications that are less resilient to errors.
Abstract: There is a growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a robust system architecture which targets emerging killer applications such as recognition, mining, and synthesis (RMS) with inherent error resilience, and ensures high degrees of resilience at low cost. Using the concept of configurable reliability, ERSA may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs). While resilience of RMS applications to errors in low-order bits of data is well-known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control flow errors (in addition to low-order bit errors) using a judicious combination of the following key ideas: 1) asymmetric reliability in many-core architectures; 2) error-resilient algorithms at the core of probabilistic applications; and 3) intelligent software optimizations. Error injection experiments on a multicore ERSA hardware prototype demonstrate that, even at very high error rates of 20 errors/flip-flop/108 cycles (equivalent to 25000 errors/core/s), ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time, for probabilistic applications such as K-Means clustering, LDPC decoding, and Bayesian network inference. In addition, we demonstrate the effectiveness of ERSA in tolerating high rates of static memory errors that are characteristic of emerging challenges related to SRAM Vccmin problems and erratic bit errors.

199 citations

Patent
Larkhoon Leem1, Xin Guo1, Ravi H. Motwani1, Rosanna Yee1, Scott E. Nelson1 
27 Sep 2013
TL;DR: In this paper, an apparatus, system, and method for performing an error recovery operation with respect to a read of a block of memory cells in a storage device is presented. But the method is limited to the case where the bit reliability metrics are decoded by combining the determined read value with at least one value saved during the previous iteration.
Abstract: Provided are an apparatus, system, and method for performing an error recovery operation with respect to a read of a block of memory cells in a storage device. A current iteration of a decoding operation is performed by applying at least one reference voltage for the current iteration to a block of the memory cells in the storage device to determine current read values in response to applying the reference voltage. A symbol is generated for each of the read memory cells by combining the determined current read value with at least one value saved during the previous iteration. The symbols are used to determine bit reliability metrics for the block of memory cells. The bit reliability metrics are decoded. In response to the decoding failing, an additional iteration of the decoding operation is performed.

59 citations

Journal ArticleDOI
TL;DR: The magnetic coupled spin-torque device (MCSTD) as discussed by the authors is a new spintronics device architecture that is free from the difficulties in spin transport and spin detection, it uses spintorque transfer technique and magnetic coupling to modulate its energy barrier.
Abstract: The magnetic coupled spin-torque device (MCSTD) is a new spintronics device architecture that is free from the difficulties in spin transport and spin detection. It uses spin-torque transfer technique and magnetic coupling to modulate its energy barrier. Requirements for logic device: fast switching speed, inherent gain, inversion capability, and spin interconnection techniques of MCSTD are presented. Logic functionalities such as NAND, NOR, and NOT are successfully demonstrated in micromagnetics simulations.

28 citations


Cited by
More filters
01 May 1993
TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.
Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

29,323 citations

Proceedings ArticleDOI
16 Jun 2013
TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Abstract: Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

1,074 citations

Proceedings ArticleDOI
27 May 2013
TL;DR: This paper reviews recent progress in the area, including design of approximate arithmetic blocks, pertinent error and quality measures, and algorithm-level techniques for approximate computing.
Abstract: Approximate computing has recently emerged as a promising approach to energy-efficient design of digital systems. Approximate computing relies on the ability of many systems and applications to tolerate some loss of quality or optimality in the computed result. By relaxing the need for fully precise or completely deterministic operations, approximate computing techniques allow substantially improved energy efficiency. This paper reviews recent progress in the area, including design of approximate arithmetic blocks, pertinent error and quality measures, and algorithm-level techniques for approximate computing.

921 citations

Journal ArticleDOI
TL;DR: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.
Abstract: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

920 citations

Journal ArticleDOI
TL;DR: This work proposes a spintronic device that uses spin at every stage of its operation and shows the five essential characteristics for logic applications: concatenability, nonlinearity, feedback elimination, gain and a complete set of Boolean operations.
Abstract: A spintronic device in which the input, output and internal states are all represented by spin, and that shows the five essential characteristics necessary for logic applications, is proposed.

769 citations