scispace - formally typeset
Search or ask a question
Author

Zachary W. Parchman

Bio: Zachary W. Parchman is an academic researcher from Tennessee Technological University. The author has contributed to research in topics: Non-volatile random-access memory & Shared memory. The author has an hindex of 2, co-authored 5 publications receiving 13 citations.

Papers
More filters
Proceedings ArticleDOI
01 Aug 2017
TL;DR: This work proposes and develops the programming abstraction called SHARed data-structure centric Programming abstraction (SharP), a simple, usable, and portable abstraction for hierarchical-heterogeneous memory and a unified programming abstraction for Big-Compute and Big-Data applications.
Abstract: The pre-exascale systems are expected to have a significant amount of hierarchical and heterogeneous on-node memory, and this trend of system architecture in extreme-scale systems is expected to continue into the exascale era. Along with hierarchical-heterogeneous memory, the system typically has a high-performing network and a compute accelerator. This system architecture is not only effective for running traditional High Performance Computing (HPC) applications (Big-Compute), but also running data-intensive HPC applications and Big-Data applications. As a consequence, there is a growing desire to have a single system serve the needs of both Big-Compute and Big-Data applications. Though the system architecture supports the convergence of the Big-Compute and Big-Data, the programming models have yet to evolve to support either hierarchical-heterogeneous memory systems or the convergence. In this work, we propose and develop the programming abstraction called SHARed data-structure centric Programming abstraction (SharP) to address both of these goals, i.e., provide (1) a simple, usable, and portable abstraction for hierarchical-heterogeneous memory and (2) a unified programming abstraction for Big-Compute and Big-Data applications. To evaluate SharP, we implement a Stencil benchmark using SharP, port QMCPack, a petascale-capable application, and adapt Memcached ecosystem, a popular Big-Data framework, to use SharP, and quantify the performance and productivity advantages. Additionally, we demonstrate the simplicity of using SharP on different memories including DRAM, High-bandwidth Memory (HBM), and non-volatile random access memory (NVRAM).

9 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: SharP Hash's high performance is obtained through the use of high-performing networks and one-sided semantics and its performance characteristics are demonstrated with a synthetic micro-benchmark and implementation of a Key Value (KV) store, Memcached.
Abstract: A high-performing distributed hash is critical for achieving performance in many applications and system software using extreme-scale systems. It is also a central part of many Big-Data frameworks including Memcached, file systems, and job schedulers. However, there is a lack of high-performing distributed hash implementations. In this work, we propose, design, and implement, SharP Hash, a high-performing, RDMA-based distributed hash for extreme-scale systems. SharP Hash's high performance is obtained through the use of high-performing networks and one-sided semantics. We perform an evaluation of SharP Hash and demonstrate its performance characteristics with a synthetic micro-benchmark and implementation of a Key Value (KV) store, Memcached.

6 citations

Book ChapterDOI
27 Aug 2018
TL;DR: The Unified Memory Allocator (UMA) of the SHARed data-structure centric Programming abstraction (SharP) library is presented, which provides a unified interface for memory allocations across DRAM, HBM, and NVRAM and is extensible to support future memory types.
Abstract: The pre-exascale systems will soon be deployed with a deep, complex memory hierarchy composed of many heterogeneous memories. This presents multiple challenges for users including: how to allocate data objects with locality between memories and devices for the various memories in these systems, which includes DRAM, High-bandwidth Memory (HBM), and non-volatile random access memory (NVRAM), and how to perform these allocations while providing portability for their application. Currently, the user can make use of multiple, disjoint libraries to allocate data objects on these memories. However, it is difficult to obtain locality between memories and devices when using libraries that are unaware of each other. This paper presents the Unified Memory Allocator (UMA) of the SHARed data-structure centric Programming abstraction (SharP) library, which provides a unified interface for memory allocations across DRAM, HBM, and NVRAM and is extensible to support future memory types. In addition, the SharP UMA allows for portability between systems by supporting both explicit and implicit, intent-based memory allocations. To demonstrate the ease of use of the SharP UMA, we have extended both Open MPIand OpenSHMEM-Xto support SharP. We validate this work by evaluating the performance implications and intent-based approach with synthetic benchmarks as well as adaptations of the Graph500 benchmark.

1 citations

Proceedings ArticleDOI
01 Mar 2018
TL;DR: The need for taking a holistic approach towards data-centric abstractions is shown, how these approaches were realized in the SharP library, a data-structure centric programming abstraction, and these approaches are applied to a variety of applications that demonstrate its usefulness are applied.
Abstract: Extreme-scale applications (i.e., Big-Compute) are becoming increasingly data-intensive, i.e., producing and consuming increasingly large amounts of data. The HPC systems traditionally used for these applications are now used for Big-Data applications such as data analytics, social network analysis, machine learning, and genomics. As a consequence of these trends, the system architecture should be flexible and data-centric. This can already be witnessed in the pre-exascale systems with TBs of on-node hierarchical and heterogeneous memories, PBs of system memory, low-latency, high-throughput networks, and many threaded cores. As such, the pre-exascale systems suit the needs of both Big-Compute and Big-Data applications. Though the system architecture is flexible enough to support both Big-Compute and Big-Data, we argue there is a software gap. Particularly, we need data-centric abstractions to leverage the full potential of the system, i.e., there is a need for native support for data resilience, the ability to express data locality and affinity, mechanisms to reduce data movement, the ability to share data, and abstractions to express User's data usage and data access patterns. In this paper, we (i) show the need for taking a holistic approach towards data-centric abstractions, (ii) show how these approaches were realized in the SHARed data-structure centric Programming abstraction (SharP) library, a data-structure centric programming abstraction, and (iii) apply these approaches to a variety of applications that demonstrate its usefulness. Particularly, we apply these approaches to QMCPack and the Graph500 benchmark and demonstrate the advantages of this approach on extreme-scale systems.

1 citations

Proceedings ArticleDOI
31 May 2016
TL;DR: This work presents an application-level library to "checkpoint" and restore data, extensions of NPB benchmarks for fault tolerance based on different strategies, and some preliminary results that show the impact of such fault tolerant strategies on the application execution.
Abstract: In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In this work, we present a modification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to "checkpoint" and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.

Cited by
More filters
Journal ArticleDOI
TL;DR: A black-box Fork-Join model is proposed that covers a wide range of Fork- join structures for the prediction of tail and mean latency, called ForkTail and ForkMean, respectively, and can be used as a powerful tool to aid the design of tail-and-mean-latency guaranteed job scheduling and resource provisioning, especially at high load, for datacenter applications.
Abstract: The workflows of the predominant datacenter services are underlaid by various Fork-Join structures. Due to the lack of good understanding of the performance of Fork-Join structures in general, today's datacenters often operate under low resource utilization to meet stringent service level objectives (SLOs), e.g., in terms of tail and/or mean latency, for such services. Hence, to achieve high resource utilization, while meeting stringent SLOs, it is of paramount importance to be able to accurately predict the tail and/or mean latency for a broad range of Fork-Join structures of practical interests. In this article, we propose a black-box Fork-Join model that covers a wide range of Fork-Join structures for the prediction of tail and mean latency, called ForkTail and ForkMean, respectively. We derive highly computational effective, empirical expressions for tail and mean latency as functions of means and variances of task response times. Our extensive testing results based on model-based and trace-driven simulations, as well as a real-world case study in a cloud environment demonstrate that the models can consistently predict the tail and mean latency within 20 and 15 percent prediction errors at 80 and 90 percent load levels, respectively, for heavy-tailed workloads, and at any load levels for light-tailed workloads. Moreover, our sensitivity analysis demonstrates that such errors can be well compensated for with no more than 7 percent resource overprovisioning. Consequently, the proposed prediction model can be used as a powerful tool to aid the design of tail-and-mean-latency guaranteed job scheduling and resource provisioning, especially at high load, for datacenter applications.

19 citations

Journal ArticleDOI
TL;DR: It is shown that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack.
Abstract: The expected halt of traditional technology scaling is motivating increased heterogeneity in high-performance computing (HPC) systems with the emergence of numerous specialized accelerators. As heterogeneity increases, so does the risk of underutilizing expensive hardware resources if we preserve today’s rigid node configuration and reservation strategies. This has sparked interest in resource disaggregation to enable finer-grain allocation of hardware resources to applications. However, there is currently no data-driven study of what range of disaggregation is appropriate in HPC. To that end, we perform a detailed analysis of key metrics sampled in NERSC’s Cori, a production HPC system that executes a diverse open-science HPC workload. In addition, we profile a variety of deep-learning applications to represent an emerging workload. We show that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack. In addition, ideal intra-rack resource disaggregation in Cori could reduce memory and NIC resources by 5.36% to 69.01% and still satisfy the worst-case average rack utilization.

13 citations

Journal ArticleDOI
TL;DR: This paper sheds light upon how big data frameworks can be ported to HPC platforms as a preliminary step towards the convergence of big data and exascale computing ecosystem.
Abstract: The dawn of exascale computing and its convergence with big data analytics has greatly spurred research interests. The reasons are straightforward. Traditionally, high performance computing (HPC) systems have been used for scientific applications involving majority of compute-intensive tasks. At the same time, the proliferation of big data resulted into design of data-intensive processing paradigms like Apache big data stack. Big data generating at high pace necessitates faster processing mechanisms for getting insights at a real time. For this, the HPC systems may serve as panacea for solving the big data problems. Though the HPC systems have the capability to give the promising results for big data, directly integrating them with existing data-intensive frameworks like Apache big data stack is not straightforward due to challenges associated with them. This triggers a research on seamlessly integrating these two paradigms based on interoperable framework, programming model, and system architecture. The aim of this paper is to assess a progress made in HPC world as an effort to augment it with big data analytics support. As an outcome of this, the taxonomy showing the factors to be considered for augmenting HPC systems with big data support has been put forth. This paper sheds light upon how big data frameworks can be ported to HPC platforms as a preliminary step towards the convergence of big data and exascale computing ecosystem. The focus is given on research issues related to augmenting HPC paradigms with big data frameworks and corresponding approaches to address those issues. This paper also discusses data-intensive as well as compute-intensive processing paradigms, benchmark suites and workloads, and future directions in the domain of integrating HPC with big data analytics.

11 citations

Proceedings ArticleDOI
11 Nov 2018
TL;DR: BESPOKV is presented, an adaptive, extensible, and scale-out KV store framework that decouples theKV store design into the control plane for distributed management and the data plane for local data store, and transparently enables a scalable and fault-tolerant distributed KV Store service.
Abstract: Enterprise KV stores are not well suited for HPC applications, and entail customization and cumbersome end-to-end KV design to extract the HPC application needs. To this end, in this paper we present bespoKV, an adaptive, extensible, and scale-out KV store framework. bespoKV decouples the KV store design into the control plane for distributed management and the data plane for local data store. bespoKV takes as input a single-server KV store, called a datalet, and transparently enables a scalable and fault-tolerant distributed KV store service. The resulting distributed stores are also adaptive to consistency or topology requirement changes and can be easily extended for new types of services. Experiments show that bespoKV-enabled distributed KV stores scale horizontally to a large number of nodes, and performs comparably and sometimes better than the state-of-the-art systems.

10 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: SharP Hash's high performance is obtained through the use of high-performing networks and one-sided semantics and its performance characteristics are demonstrated with a synthetic micro-benchmark and implementation of a Key Value (KV) store, Memcached.
Abstract: A high-performing distributed hash is critical for achieving performance in many applications and system software using extreme-scale systems. It is also a central part of many Big-Data frameworks including Memcached, file systems, and job schedulers. However, there is a lack of high-performing distributed hash implementations. In this work, we propose, design, and implement, SharP Hash, a high-performing, RDMA-based distributed hash for extreme-scale systems. SharP Hash's high performance is obtained through the use of high-performing networks and one-sided semantics. We perform an evaluation of SharP Hash and demonstrate its performance characteristics with a synthetic micro-benchmark and implementation of a Key Value (KV) store, Memcached.

6 citations