https://absimage.aps.org/image/MAR14/MWS_MAR14-2014-020819.pdf

Van der Waals heterostructures

Complex topological configurations are fertile ground for exploring emergent phenomena and exotic phases in condensed-matter physics. For example, the recent discovery of polarization vortices and their associated complex-phase coexistence and response under applied electric fields in superlattices of (PbTiO3)n/(SrTiO3)n suggests the presence of a complex, multi-dimensional system capable of interesting physical responses, such as chirality, negative capacitance and large piezo-electric responses1–3. Here, by varying epitaxial constraints, we discover room-temperature polar-skyrmion bubbles in a lead titanate layer confined by strontium titanate layers, which are imaged by atomic-resolution scanning transmission electron microscopy. Phase-field modelling and second-principles calculations reveal that the polar-skyrmion bubbles have a skyrmion number of +1, and resonant soft-X-ray diffraction experiments show circular dichroism, confirming chirality. Such nanometre-scale polar-skyrmion bubbles are the electric analogues of magnetic skyrmions, and could contribute to the advancement of ferroelectrics towards functionalities incorporating emergent chirality and electrically controllable negative capacitance.Chiral polar-skyrmion bubbles are observed in superlattices of titanium-based perovskite oxides at room temperature.

Observation of room temperature polar skyrmions

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead.

https://dl.acm.org/doi/pdf/10.1145/3296957.3173169

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions from tens of micrometers to a few nanometers. In order to resolve sample features at smaller length scales, however, a higher radiation dose is required. Therefore, the limitation on the achievable resolution is set primarily by noise at these length scales. We present TomoGAN, a denoising technique based on generative adversarial networks, for improving the quality of reconstructed images for low-dose imaging conditions. We evaluate our approach in two photon-budget-limited experimental conditions: (1) sufficient number of low-dose projections (based on Nyquist sampling), and (2) insufficient or limited number of high-dose projections. In both cases, the angular sampling is assumed to be isotropic, and the photon budget throughout the experiment is fixed based on the maximum allowable radiation dose on the sample. Evaluation with both simulated and experimental datasets shows that our approach can significantly reduce noise in reconstructed images, improving the structural similarity score of simulation and experimental data from 0.18 to 0.9 and from 0.18 to 0.41, respectively. Furthermore, the quality of the reconstructed images with filtered back projection followed by our denoising approach exceeds that of reconstructions with the simultaneous iterative reconstruction technique, showing the computational superiority of our approach.

TomoGAN: low-dose synchrotron x-ray tomography with generative adversarial networks: discussion.

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Computed Tomographic (CT) image reconstruction is an important technique used in a wide range of applications. Among reconstruction methods, Model-Based Iterative Reconstruction (MBIR) is known to produce much higher quality CT images; however, the high computational requirements of MBIR greatly restrict their application. Currently, MBIR speed is primarily limited by irregular data access patterns, the difficulty of effective parallelization, and slow algorithmic convergence.This paper presents a new algorithm for MBIR, the Non-Uniform Parallel Super-Voxel (NU-PSV) algorithm, that regularizes the data access pattern, enables massive parallelism, and ensures fast convergence. We compare the NU-PSV algorithm with two state-of-the-art implementations on a 69632-core distributed system. Results indicate that the NU-PSV algorithm has an average speedup of 1665 compared to the fastest state-of-the-art implementations.

https://dl.acm.org/doi/pdf/10.1145/3126908.3126911

Massively parallel 3D image reconstruction

Accelerator-based heterogeneous computing is gaining momentum in High Performance Computing arena. However, the increased complexity of the accelerator architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle the problem. While the abstraction endowed by OpenACC offers productivity, it raises questions on its portability. This paper evaluates the performance portability obtained by OpenACC on twelve OpenACC programs on NVIDIA CUDA, AMD GCN, and Intel MIC architectures. We study the effects of various compiler optimizations and OpenACC program settings on these architectures to provide insights into the achieved performance portability.

Evaluating Performance Portability of OpenACC

The deluge of data has inspired big-data processing frameworks that span across large clusters. Frameworks for MapReduce, a state-of-the-art programming model, have primarily made use of the CPUs in distributed systems, leaving out computationally powerful accelerators such as GPUs. This paper presents HeteroDoop, a MapReduce framework that employs both CPUs and GPUs in a cluster. HeteroDoop offers the following novel features: (i) a small set of directives can be placed on an existing sequential, CPU-only program, expressing MapReduce semantics; (ii) an optimizing compiler translates the directive-augmented program into a GPU code; (iii) a runtime system assists the compiler in handling MapReduce semantics on the GPU; and (iv) a tail scheduling scheme minimizes job execution time in light of disparate processing capabilities of CPUs and GPUs. This paper addresses several challenges that need to be overcome in order to support these features. HeteroDoop is built on top of the state-of-the-art, CPU-only Hadoop MapReduce framework, inheriting its functionality. Evaluation results of HeteroDoop on recent hardware indicate that usage of even a single GPU per node can improve performance by up to 2.78x, with a geometric mean of 1.6x across our benchmarks, compared to a CPU-only Hadoop, running on a cluster with 20-core CPUs.

/pdf/heterodoop-a-mapreduce-programming-system-for-accelerator-57wl4vjsto.pdf

HeteroDoop: A MapReduce Programming System for Accelerator Clusters

Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run out-of-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.

/pdf/scaling-large-data-computations-on-multi-gpu-accelerators-1f9od8h8cj.pdf

Putt Sakdhnagool

Papers

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Massively parallel 3D image reconstruction

Evaluating Performance Portability of OpenACC

HeteroDoop: A MapReduce Programming System for Accelerator Clusters

Scaling large-data computations on multi-GPU accelerators