scispace - formally typeset
Search or ask a question

Showing papers by "Michael Garland published in 2022"


Journal ArticleDOI
TL;DR: It is shown that the BaM infrastructure software running on GPUs can identify and communicate the fine-grain accesses at a sufficiently high rate to fully utilize the underlying storage devices, and even with consumer-grade SSDs, a BaM system can support application performance that is competitive against a much more expensive DRAM-only solution and the reduction in I/O amplification can yield significant performance benefit.
Abstract: —Accelerators like Graphics Processing Units (GPUs) have been increasingly deployed in modern data centers because of their compute capabilities and memory bandwidth. These accelerators have traditionally relied on the “application host code” and the OS running on the CPU to orchestrate their accesses to the data storage devices. CPU orchestration of storage data accesses works well for classic GPU applications, like dense neural network training, where data access patterns are predefined, regular, dense, and independent of the data values, enabling the CPU to partition the storage data into coarse-grain chunks and coordinate the storage device accesses and data transfers to the accelerators. Unfortunately, such a CPU-centric strategy causes excessive CPU-GPU synchronization overhead and/or I/O traffic amplification, diminishing the effective storage bandwidth for emerging applications with fine-grain data-dependent access patterns like graph and data analytics, recommender systems, and graph neural networks. In this work, we make a case for enabling GPUs to orchestrate high-throughput, fine-grain accesses into NVMe Solid State Drives (SSDs) in a new system architecture called BaM.

6 citations


Proceedings ArticleDOI
09 Mar 2022
TL;DR: BaM as mentioned in this paper proposes a fine-grained software cache to coalesce data storage requests while minimizing I/O traffic amplification, which is well suited for GPU applications with known data access patterns that enable partitioning of their dataset to be processed in a pipelined fashion.
Abstract: Graphics Processing Units (GPUs) have traditionally relied on the host CPU to initiate access to the data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning of their dataset to be processed in a pipelined fashion in the GPU. However, emerging applications such as graph and data analytics, recommender systems, or graph neural networks, require fine-grained, data-dependent access to storage. CPU initiation of storage access is unsuitable for these applications due to high CPU-GPU synchronization overheads, I/O traffic amplification, and long CPU processing latencies. GPU-initiated storage removes these overheads from the storage control path and, thus, can potentially support these applications at much higher speed. However, there is a lack of systems architecture and software stack that enable efficient GPU-initiated storage access. This work presents a novel system architecture, BaM, that fills this gap. BaM features a fine-grained software cache to coalesce data storage requests while minimizing I/O traffic amplification. This software cache communicates with the storage system via high-throughput queues that enable the massive number of concurrent threads in modern GPUs to make I/O requests at a high rate to fully utilize the storage devices and the system interconnect. Experimental results show that BaM delivers 1.0x and 1.49x end-to-end speed up for BFS and CC graph analytics benchmarks while reducing hardware costs by up to 21.7x over accessing the graph data from the host memory. Furthermore, BaM speeds up data-analytics workloads by 5.3x over CPU-initiated storage access on the same hardware.

2 citations


Journal ArticleDOI
TL;DR: A novel system named PLANER is introduced that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely-activated version of the original network that tries to meet the latency target while maintaining baseline accuracy.
Abstract: Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains including natural language processing and computer vision. To further improve their accuracy, recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert (MoE) layers. In this paper, we explore the introduction of MoE layers to optimize a different metric: inference latency. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely-activated version of the original network that tries to meet the latency target while maintaining baseline accuracy. We evaluate PLANER on two real-world language modeling tasks using the Transformer-XL network and achieve inference latency reductions of over 2x at iso-accuracy.

1 citations


Journal ArticleDOI
TL;DR: Brown as discussed by the authors challenged utilities and indus-tries to develop 6,000 megawatts (MW) of electricity in California during the 1980s through cogeneration, and also announced that the state will take the lead by beginning immediately to develop 400 MW ofcogeneration at state facilities.
Abstract: At the June 3, 1980, meeting of the Governor’s Cogeneration TaskForce, Governor Edmund G. Brown Jr., challenged utilities and indus-tries to develop 6,000 megawatts (MW) of electricity in California duringthe 1980s through cogeneration. The Governor also announced that thestate will take the lead by beginning immediately to develop 400 MW ofcogeneration at state facilities.As a first step, the Governor requested the Department of GeneralServices, the Office of Appropriate Technology, and the Department ofWater Resources to prepare a blueprint for developing this capacity. TheGovernor called for identification of feasible cogeneration projects thatcan be implemented without delay; establishment of an overall time-table for additional planning, feasibility studies, design and construc-tion; and a discussion of potential sources of funds.Since the Governor’s announcement, the California Energy Com-mission, the Department of General Services, the Office of AppropriateTechnology, the University of California, and the state university andcolleges, with the cooperation of the Departments of DevelopmentalServices, Mental Health, Corrections, and Health Services have initiated,continued work on, or completed feasibility studies or engineering de-sign work for state facilities totaling more than 177 MW of cogenerationcapacity. The Department of Water Resources has conducted preliminarysite investigations at thirteen additional state facilities.