General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

/pdf/gpuwattch-enabling-energy-optimizations-in-gpgpus-2vu027vqw1.pdf

GPUWattch: enabling energy optimizations in GPGPUs

Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model and runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express both locality and independence of program data, and tasks, functions that perform computations on regions. We describe a runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism. Legion also enables explicit, programmer controlled movement of data through the memory hierarchy and placement of tasks based on locality information via a novel mapping interface. We evaluate our Legion implementation on three applications: fluid-flow on a regular grid, a three-level AMR code solving a heat diffusion equation, and a circuit simulation.

http://theory.stanford.edu/~aiken/publications/papers/sc12.pdf

Legion: expressing locality and independence with logical regions

Aims Comparative studies suggest that stem cells committed to a cardiac lineage are more effective for improving heart function than those featuring an extra-cardiac phenotype. We have therefore developed a population of human embryonic stem cell (ESC)-derived cardiac progenitor cells.

Methods and results Undifferentiated human ESCs (I6 line) were amplified and cardiac-committed by exposure to bone morphogenetic protein-2 and a fibroblast growth factor receptor inhibitor. Cells responding to these cardio-instructive cues express the cardiac transcription factor Isl-1 and the stage-specific embryonic antigen SSEA-1 which was then used to purify them by immunomagnetic sorting. The Isl-1 + SSEA-1+ cells were then embedded into a fibrin scaffold which was surgically delivered onto the infarct area in a 68-year-old patient suffering from severe heart failure [New York Heart Association [NYHA] functional Class III; left ventricular ejection fraction (LVEF): 26%]. A coronary artery bypass was performed concomitantly in a non-infarcted area. The implanted cells featured a high degree of purity (99% were SSEA-1+), had lost the expression of Sox-2 and Nanog , taken as markers for pluripotency, and strongly expressed Isl-1 . The intraoperative delivery of the patch was expeditious. The post-operative course was uncomplicated either. After 3 months, the patient is symptomatically improved (NYHA functional Class I; LVEF: 36%) and a new-onset contractility is echocardiographically evident in the previously akinetic cell/patch-treated, non-revascularized area. There have been no complications such as arrhythmias, tumour formation, or immunosuppression-related adverse events.

Conclusion This observation demonstrates the feasibility of generating a clinical-grade population of human ESC-derived cardiac progenitors and combining it within a tissue-engineered construct. While any conclusion pertaining to efficacy would be meaningless, the patient's functional outcome yet provides an encouraging hint. Beyond this case, the platform that has been set could be useful for generating different ESC-derived lineage-specific progenies.

/pdf/human-embryonic-stem-cell-derived-cardiac-progenitors-for-3znydslfgs.pdf

Human embryonic stem cell-derived cardiac progenitors for severe heart failure treatment: first clinical case report

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies.In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

/pdf/owl-cooperative-thread-array-aware-scheduling-techniques-for-37blh4iqko.pdf

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

The computational requirements for training deep neural networks (DNNs) have grown to the point that it is now standard practice to parallelize training. Existing deep learning systems commonly use data or model parallelism, but unfortunately, these strategies often result in suboptimal parallelization performance. 
In this paper, we define a more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions. We also propose FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy's performance and is three orders of magnitude faster than prior approaches that have to execute each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that FlexFlow can increase training throughput by up to 3.8x over state-of-the-art approaches, even when including its search time, and also improves scalability.

Beyond Data and Model Parallelism for Deep Neural Networks

As the computational power of GPUs continues to scale with Moore's Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37x on representative synthetic micro-benchmarks, and 1.15x–3.2x on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs.

https://ppl.stanford.edu/papers/sc11-bauer-slides.pdf

CudaDMA: optimizing GPU memory bandwidth via warp specialization

We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs with tasks (functions eligible for parallel execution) and logical regions (hierarchical collections of structured objects). Regent programs appear to execute sequentially, require no explicit synchronization, and are trivially deadlock-free. Regent's type system catches many common classes of mistakes and guarantees that a program with correct serial execution produces identical results on parallel and distributed machines. We present an optimizing compiler for Regent that translates Regent programs into efficient implementations for Legion, an asynchronous task-based model. Regent employs several novel compiler optimizations to minimize the dynamic overhead of the runtime system and enable efficient operation. We evaluate Regent on three benchmark applications and demonstrate that Regent achieves performance comparable to hand-tuned Legion.

/pdf/regent-a-high-productivity-programming-language-for-hpc-with-2m84md4mxn.pdf

Regent: a high-productivity programming language for HPC with logical regions

Rationale:Multiple progenitors derived from the heart and bone marrow (BM) have been used for cardiac repair. Despite this, not much is known about the molecular identity and relationship among these progenitors. To develop a robust stem cell therapy for the heart, it is critical to understand the molecular identity of the multiple cardiogenic progenitor cells. Objective:This study is the first report of high-throughput transcriptional profiling of cardiogenic progenitor cells carried out on an identical platform. Method and Results:Microarray-based transcriptional profiling was carried out for 3 cardiac (ckit+, Sca1+, and side population) and 2 BM (ckit+ and mesenchymal stem cell) progenitors, obtained from age- and sex-matched wild-type C57BL/6 mice. Analysis indicated that cardiac-derived ckit+ population was very distinct from Sca1+ and side population cells in the downregulation of genes encoding for cell–cell and cell–matrix adhesion proteins, and in the upregulation of developmental genes. Signific...

Dissecting the Molecular Relationship Among Various Cardiogenic Progenitor Cells

We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all runtime actions are non-blocking. Realm supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony is exposed via a light-weight event system capable of operating without central management. We describe an implementation of Realm that relies on a novel generational event data structure for efficiently handling large numbers of events in a distributed address space. Microbenchmark experiments show our implementation of Realm approaches the underlying hardware performance limits. We measure the performance of three real-world applications on the Keeneland supercomputer. Our results demonstrate that Realm confers considerable latency hiding to clients, attaining significant speedups over traditional bulk-synchronous and independently optimized MPI codes.

/pdf/realm-an-event-based-low-level-runtime-for-distributed-20bpatrszu.pdf

Michael Bauer

Papers

Legion: expressing locality and independence with logical regions

CudaDMA: optimizing GPU memory bandwidth via warp specialization

Regent: a high-productivity programming language for HPC with logical regions

Dissecting the Molecular Relationship Among Various Cardiogenic Progenitor Cells

Realm: an event-based low-level runtime for distributed memory architectures