scispace - formally typeset
Search or ask a question
Author

Kazuhiko Komatsu

Bio: Kazuhiko Komatsu is an academic researcher from Tohoku University. The author has contributed to research in topics: Vector processor & Memory bandwidth. The author has an hindex of 9, co-authored 65 publications receiving 413 citations.


Papers
More filters
Proceedings ArticleDOI
08 Dec 2009
TL;DR: It is demonstrated that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs and also indicates that Che CUDA can migrate a process from one PC to another even if the process uses a GPU.
Abstract: In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.

91 citations

Proceedings ArticleDOI
11 Nov 2018
TL;DR: The potential of SX-Aurora TSUBASA is clarified through evaluations of practical applications and the effectiveness of the new execution model is examined by using a microbenchmark.
Abstract: A new SX-Aurora TSUBASA vector supercomputer has been released, and it features a new system architecture and a new execution model to achieve high sustained performance, especially for memory-intensive applications. In SX-Aurora TSUBASA, the vector host (VH) of a standard x86 Linux node is attached to the vector engine (VE) of the newly developed vector processor. An application is executed on the VE, and only system calls are offloaded to the VH. This new execution model can avoid redundant data transfers between the VH and VE that can easily become a bottleneck in the conventional execution model. This paper examines the potential of SX-Aurora TSUBASA. First, the basic performance is clarified by evaluating benchmark programs. Then, the effectiveness of the new execution model is examined by using a microbenchmark. Finally, the potential of SX-Aurora TSUBASA is clarified through evaluations of practical applications.

69 citations

Proceedings ArticleDOI
16 May 2011
TL;DR: A new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing, that can perform CPR on an OpenCL application program without any modification and recompilation of its code.
Abstract: In this paper, we propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing. CheCL can perform CPR on an OpenCL application program without any modification and recompilation of its code. A conventional check pointing system fails to checkpoint a process if the process uses OpenCL. Therefore, in CheCL, every API call is forwarded to another process called an API proxy, and the API proxy invokes the API function, two processes, an application process and an API proxy, are launched for an OpenCL application. In this case, as the application process is not an OpenCL process but a standard process, it can be safely check pointed. While CheCL intercepts all API calls, it records the information necessary for restoring OpenCL objects. The application process does not hold any OpenCL handles, but CheCL handles to keep such information. Those handles are automatically converted to OpenCL handles and then passed to API functions. Upon restart, OpenCL objects are automatically restored based on the recorded information. This paper demonstrates the feasibility of transparent check pointing of OpenCL programs including MPI applications, and quantitatively evaluates the runtime overheads. It is also discussed that CheCL can enable process migration of OpenCL applications among distinct nodes, and among different kinds of compute devices such as a CPU and a GPU.

61 citations

Journal ArticleDOI
TL;DR: Evaluation results clearly indicate that the high sustained memory performance per core enables the modern vector supercomputer to achieve outstanding performances that are unreachable by simply increasing the number of fine-grain scalar processor cores.
Abstract: Achieving a high sustained simulation performance is the most important concern in the HPC community. To this end, many kinds of HPC system architectures have been proposed, and the diversity of the HPC systems grows rapidly. Under this circumstance, a vector-parallel supercomputer SX-ACE has been designed to achieve a high sustained performance of memory-intensive applications by providing a high memory bandwidth commensurate with its high computational capability. This paper examines the potential of the modern vector-parallel supercomputer through the performance evaluation of SX-ACE using practical engineering and scientific applications. To improve the sustained simulation performances of practical applications, SX-ACE adopts an advanced memory subsystem with several new architectural features. This paper discusses how these features, such as MSHR, a large on-chip memory, and novel vector processing mechanisms, are beneficial to achieve a high sustained performance for large-scale engineering and scientific simulations. Evaluation results clearly indicate that the high sustained memory performance per core enables the modern vector supercomputer to achieve outstanding performances that are unreachable by simply increasing the number of fine-grain scalar processor cores. This paper also discusses the performance of the HPCG benchmark to evaluate the potentials of supercomputers with balanced memory and computational performance against heterogeneous and cutting-edge scalar parallel systems.

36 citations

Book ChapterDOI
01 Jan 2011
TL;DR: From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.
Abstract: Recently, Compute Unified Device Architecture (CUDA) has enabled Graphics Processing Units (GPUs) to accelerate various applications. However, to exploit the GPU’s computing power fully, a programmer has to carefully adjust some CUDA execution parameters even for simple stencil processing kernels. Hence, this paper develops an automatic parameter tuning mechanism based on profiling to predict the optimal execution parameters. This paper first discusses the scope of the parameter exploration space determined by GPU’s architectural restrictions. To find the optimal execution parameters, performance models are created by profiling execution times of kernel using each promising parameter configuration. The execution parameters are determined by using those performance models. This paper evaluates the performance improvement due to the proposed mechanism using two benchmark programs. From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.

17 citations


Cited by
More filters
Proceedings ArticleDOI
16 May 2011
TL;DR: On average, 16-33% of in-jected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU, which is higher than that measured in CPU programs.
Abstract: High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output cor-rectness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have devel-oped for commodity GPU devices. On average, 16-33% of in-jected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is signifi-cantly higher than that measured in CPU programs (

96 citations

Proceedings ArticleDOI
11 Nov 2018
TL;DR: The potential of SX-Aurora TSUBASA is clarified through evaluations of practical applications and the effectiveness of the new execution model is examined by using a microbenchmark.
Abstract: A new SX-Aurora TSUBASA vector supercomputer has been released, and it features a new system architecture and a new execution model to achieve high sustained performance, especially for memory-intensive applications. In SX-Aurora TSUBASA, the vector host (VH) of a standard x86 Linux node is attached to the vector engine (VE) of the newly developed vector processor. An application is executed on the VE, and only system calls are offloaded to the VH. This new execution model can avoid redundant data transfers between the VH and VE that can easily become a bottleneck in the conventional execution model. This paper examines the potential of SX-Aurora TSUBASA. First, the basic performance is clarified by evaluating benchmark programs. Then, the effectiveness of the new execution model is examined by using a microbenchmark. Finally, the potential of SX-Aurora TSUBASA is clarified through evaluations of practical applications.

69 citations

Proceedings ArticleDOI
16 May 2011
TL;DR: A new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing, that can perform CPR on an OpenCL application program without any modification and recompilation of its code.
Abstract: In this paper, we propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing. CheCL can perform CPR on an OpenCL application program without any modification and recompilation of its code. A conventional check pointing system fails to checkpoint a process if the process uses OpenCL. Therefore, in CheCL, every API call is forwarded to another process called an API proxy, and the API proxy invokes the API function, two processes, an application process and an API proxy, are launched for an OpenCL application. In this case, as the application process is not an OpenCL process but a standard process, it can be safely check pointed. While CheCL intercepts all API calls, it records the information necessary for restoring OpenCL objects. The application process does not hold any OpenCL handles, but CheCL handles to keep such information. Those handles are automatically converted to OpenCL handles and then passed to API functions. Upon restart, OpenCL objects are automatically restored based on the recorded information. This paper demonstrates the feasibility of transparent check pointing of OpenCL programs including MPI applications, and quantitatively evaluates the runtime overheads. It is also discussed that CheCL can enable process migration of OpenCL applications among distinct nodes, and among different kinds of compute devices such as a CPU and a GPU.

61 citations

Proceedings ArticleDOI
02 Jun 2018
TL;DR: The Locality Descriptor is proposed, a crossl-ayer abstraction to explicitly express and exploit data locality in GPUs that improves performance by 26.6% on average when exploiting reuse-based locality in the cache hierarchy, and by 53.7% when exploiting N UMA locality in a NUMA memory system.
Abstract: Exploiting data locality in GPUs is critical to making more efficient use of the existing caches and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU programming models are designed to explicitly express parallelism, there is no clear explicit way to express data locality---i.e., reuse-based locality to make efficient use of the caches, or NUMA locality to efficiently utilize a NUMA system. On the one hand, this lack of expressiveness makes it a very challenging task for the programmer to write code to get the best performance out of the memory hierarchy. On the other hand, hardware-only architectural techniques are often suboptimal as they miss key higher-level program semantics that are essential to effectively exploit data locality. In this work, we propose the Locality Descriptor, a cross-layer abstraction to explicitly express and exploit data locality in GPUs. The Locality Descriptor (i) provides the software a flexible and portable interface to optimize for data locality, requiring no knowledge of the underlying memory techniques and resources, and (ii) enables the architecture to leverage key program semantics and effectively coordinate a range of techniques (e.g., CTA scheduling, cache management, memory placement) to exploit locality in a programmer-transparent manner. We demonstrate that the Locality Descriptor improves performance by 26.6% on average (up to 46.6%) when exploiting reuse-based locality in the cache hierarchy, and by 53.7% (up to 2.8X) when exploiting NUMA locality in a NUMA memory system.

59 citations

Proceedings ArticleDOI
15 Oct 2016
TL;DR: By dynamically allocating resources and carefully oversubscribing them when necessary, Zorua improves or retains the performance of applications that are already highly tuned to best utilize the hardware resources.
Abstract: This paper introduces a new resource virtualization framework, Zorua, that decouples the programmer-specified resource usage of a GPU application from the actual allocation in the on-chip hardware resources. Zorua enables this decoupling by virtualizing each resource transparently to the programmer. The virtualization provided by Zorua builds on two key concepts — dynamic allocation of the on-chip resources and their oversubscription using a swap space in memory. Zorua provides a holistic GPU resource virtualization strategy, designed to (i) adaptively control the extent of oversubscription, and (ii) coordinate the dynamic management of multiple on-chip resources (i.e., registers, scratchpad memory, and thread slots), to maximize the effectiveness of virtualization. Zorua employs a hardware-software code-sign, comprising the compiler, a runtime system and hardware-based virtualization support. The runtime system leverages information from the compiler regarding resource requirements of each program phase to (i) dynamically allocate/deallocate the different resources in the physically available on-chip resources or their swap space, and (ii) manage the tradeoffbetween higher thread-level parallelism due to virtualization versus the latency and capacity overheads of swap space usage. We demonstrate that by providing the illusion of more resources than physically available via controlled and coordinated virtualization, Zorua offers several important benefits: (i) Programming Ease. Zorua eases the burden on the programmer to provide code that is tuned to efficiently utilize the physically available on-chip resources. (ii) Portability. Zorua alleviates the necessity of re-tuning an application's resource usage when porting the application across GPU generations. (iii) Performance. By dynamically allocating resources and carefully oversubscribing them when necessary, Zorua improves or retains the performance of applications that are already highly tuned to best utilize the hardware resources. The holistic virtualization provided by Zorua can also enable other uses, including fine-grained resource sharing among multiple kernels and low-latency preemption of GPU programs.

58 citations