Home
/
Authors
/
Ji Young Park

Author

Ji Young Park

Bio: Ji Young Park is an academic researcher from Stanford University. The author has contributed to research in topics: Memory management & Memory hierarchy. The author has an hindex of 4, co-authored 4 publications receiving 643 citations.

Topics: Memory management, Memory hierarchy, Memory map, Memory ordering, Shared memory ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Sequoia: programming the memory hierarchy

[...]

Kayvon Fatahalian¹, Daniel Reiter Horn¹, Timothy James Knight¹, Larkhoon Leem¹, Mike Houston¹, Ji Young Park¹, Mattan Erez¹, Manman Ren¹, Alex Aiken¹, William J. Dally¹, Pat Hanrahan¹ - Show less +7 more•Institutions (1)

Stanford University¹

11 Nov 2006

TL;DR: This work has implemented a complete programming system, including a compiler and runtime systems for cell processor-based blade systems and distributed memory clusters, and demonstrates efficient performance running Sequoia programs on both of these platforms.

...read moreread less

Abstract: We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.

...read moreread less

482 citations

Proceedings Article•DOI•

Compilation for explicitly managed memory hierarchies

[...]

Timothy James Knight¹, Ji Young Park¹, Manman Ren¹, Mike Houston¹, Mattan Erez¹, Kayvon Fatahalian¹, Alex Aiken¹, William J. Dally¹, Pat Hanrahan¹ - Show less +5 more•Institutions (1)

Stanford University¹

14 Mar 2007

TL;DR: A compiler for machines with an explicitly managed memory hierarchy is presented and it is suggested that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine.

...read moreread less

Abstract: We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine. We evaluate the performance of our compiler using several benchmarks running on a Cell processor.

...read moreread less

90 citations

Proceedings Article•DOI•

A tuning framework for software-managed memory hierarchies

[...]

Manman Ren¹, Ji Young Park¹, Mike Houston¹, Alex Aiken¹, William J. Dally¹ - Show less +1 more•Institutions (1)

Stanford University¹

25 Oct 2008

TL;DR: This paper presents a general framework for automatically tuning general applications to machines with software-managed memory hierarchies and evaluates its performance by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations.

...read moreread less

Abstract: Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine's particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3's.

...read moreread less

45 citations

Book•

A portable runtime interface for multi-level memory hierarchies

[...]

Mike Houston¹, Ji Young Park¹, Manman Ren¹, Timothy James Knight¹, Kayvon Fatahalian¹, Alex Aiken¹, William J. Dally¹, Pat Hanrahan¹ - Show less +4 more•Institutions (1)

Stanford University¹

01 Jan 2008

TL;DR: This work presents a platform independent runtime interface for moving data and computation through parallel machines with multi-level memory hierarchies that can be used as a compiler target and can be implemented easily and efficiently on a variety of platforms.

...read moreread less

Abstract: We present a platform independent runtime interface for moving data and computation through parallel machines with multi-level memory hierarchies. We show that this interface can be used as a compiler target and can be implemented easily and efficiently on a variety of platforms. The interface design allows us to compose multiple runtimes, achieving portability across machines with multiple memory levels. We demonstrate portability of programs across machines with two memory levels with runtime implementations for multi-core/SMP machines, the STI Cell Broadband Engine, a distributed memory cluster, and disk systems. We also demonstrate portability across machines with multiple memory levels by composing runtimes and running on a cluster of SMP nodes, out-of-core algorithms on a Sony Playstation 3 pulling data from disk, and a cluster of Sony Playstation 3's. With this uniform interface, we achieve good performance for our applications and maximize bandwidth and computational resources on these system configurations.

...read moreread less

38 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

[...]

Jonathan Ragan-Kelley¹, Connelly Barnes², Andrew Adams¹, Sylvain Paris², Frédo Durand¹, Saman Amarasinghe¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, Adobe Systems²

16 Jun 2013

TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.

...read moreread less

Abstract: Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

...read moreread less

1,074 citations

Journal Article•DOI•

The future of microprocessors

[...]

Shekhar Borkar¹, Andrew A. Chien²•Institutions (2)

Intel¹, University of California, San Diego²

01 May 2011-Communications of The ACM

TL;DR: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

...read moreread less

Abstract: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

...read moreread less

920 citations

Proceedings Article•DOI•

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

[...]

Chi-Keung Luk¹, Sunpyo Hong², Hyesoon Kim²•Institutions (2)

Intel¹, Georgia Institute of Technology²

12 Dec 2009

TL;DR: Adaptive mapping is proposed, a fully automatic technique to map computations to processing elements on a CPU+GPU machine and it is shown that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduced in energy consumption than static mappings on average for a set of important computation benchmarks.

...read moreread less

Abstract: Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.

...read moreread less

565 citations

Proceedings Article•DOI•

Legion: expressing locality and independence with logical regions

[...]

Michael Bauer¹, Sean J. Treichler¹, Elliott Slaughter¹, Alex Aiken¹•Institutions (1)

Stanford University¹

10 Nov 2012

TL;DR: A runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism.

...read moreread less

Abstract: Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model and runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express both locality and independence of program data, and tasks, functions that perform computations on regions. We describe a runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism. Legion also enables explicit, programmer controlled movement of data through the memory hierarchy and placement of tasks based on locality information via a novel mapping interface. We evaluate our Legion implementation on three applications: fluid-flow on a regular grid, a three-level AMR code solving a heat diffusion equation, and a circuit simulation.

...read moreread less

500 citations

Proceedings Article•DOI•

Accelerating CUDA graph algorithms at maximum warp

[...]

Sungpack Hong¹, Sang Kyun Kim¹, Tayo Oguntebi¹, Kunle Olukotun¹•Institutions (1)

Stanford University¹

12 Feb 2011

TL;DR: A novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users and significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance.

...read moreread less

Abstract: Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.

...read moreread less

363 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115

Collapse