scispace - formally typeset
Search or ask a question
Author

Ji Young Park

Bio: Ji Young Park is an academic researcher from Stanford University. The author has contributed to research in topics: Memory management & Memory hierarchy. The author has an hindex of 4, co-authored 4 publications receiving 643 citations.

Papers
More filters
Proceedings ArticleDOI
11 Nov 2006
TL;DR: This work has implemented a complete programming system, including a compiler and runtime systems for cell processor-based blade systems and distributed memory clusters, and demonstrates efficient performance running Sequoia programs on both of these platforms.
Abstract: We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.

482 citations

Proceedings ArticleDOI
14 Mar 2007
TL;DR: A compiler for machines with an explicitly managed memory hierarchy is presented and it is suggested that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine.
Abstract: We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine. We evaluate the performance of our compiler using several benchmarks running on a Cell processor.

90 citations

Proceedings ArticleDOI
Manman Ren1, Ji Young Park1, Mike Houston1, Alex Aiken1, William J. Dally1 
25 Oct 2008
TL;DR: This paper presents a general framework for automatically tuning general applications to machines with software-managed memory hierarchies and evaluates its performance by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations.
Abstract: Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine's particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3's.

45 citations

Book
01 Jan 2008
TL;DR: This work presents a platform independent runtime interface for moving data and computation through parallel machines with multi-level memory hierarchies that can be used as a compiler target and can be implemented easily and efficiently on a variety of platforms.
Abstract: We present a platform independent runtime interface for moving data and computation through parallel machines with multi-level memory hierarchies. We show that this interface can be used as a compiler target and can be implemented easily and efficiently on a variety of platforms. The interface design allows us to compose multiple runtimes, achieving portability across machines with multiple memory levels. We demonstrate portability of programs across machines with two memory levels with runtime implementations for multi-core/SMP machines, the STI Cell Broadband Engine, a distributed memory cluster, and disk systems. We also demonstrate portability across machines with multiple memory levels by composing runtimes and running on a cluster of SMP nodes, out-of-core algorithms on a Sony Playstation 3 pulling data from disk, and a cluster of Sony Playstation 3's. With this uniform interface, we achieve good performance for our applications and maximize bandwidth and computational resources on these system configurations.

38 citations


Cited by
More filters
Proceedings ArticleDOI
16 Jun 2013
TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Abstract: Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

1,074 citations

Journal ArticleDOI
TL;DR: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.
Abstract: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

920 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Adaptive mapping is proposed, a fully automatic technique to map computations to processing elements on a CPU+GPU machine and it is shown that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduced in energy consumption than static mappings on average for a set of important computation benchmarks.
Abstract: Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.

565 citations

Proceedings ArticleDOI
10 Nov 2012
TL;DR: A runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism.
Abstract: Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model and runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express both locality and independence of program data, and tasks, functions that perform computations on regions. We describe a runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism. Legion also enables explicit, programmer controlled movement of data through the memory hierarchy and placement of tasks based on locality information via a novel mapping interface. We evaluate our Legion implementation on three applications: fluid-flow on a regular grid, a three-level AMR code solving a heat diffusion equation, and a circuit simulation.

500 citations

Proceedings ArticleDOI
12 Feb 2011
TL;DR: A novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users and significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance.
Abstract: Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.

363 citations