Showing papers on "Massively parallel published in 2014"

PDF

Open Access

Journal Article•DOI•

cp2k: atomistic simulations of condensed matter systems

[...]

Juerg Hutter¹, Marcella Iannuzzi¹, Florian Schiffmann², Joost VandeVondele²•Institutions (2)

01 Jan 2014-Wiley Interdisciplinary Reviews: Computational Molecular Science

TL;DR: The main capabilities of cp2k are summarized, and with recent applications the science cp2K has enabled in the field of atomistic simulation are illustrated.

...read moreread less

Abstract: cp2k has become a versatile open-source tool for the simulation of complex systems on the nanometer scale. It allows for sampling and exploring potential energy surfaces that can be computed using a variety of empirical and first principles models. Excellent performance for electronic structure calculations is achieved using novel algorithms implemented for modern and massively parallel hardware. This review briefly summarizes the main capabilities and illustrates with recent applications the science cp2k has enabled in the field of atomistic simulation.

...read moreread less

2,114 citations

Journal Article•DOI•

The SpiNNaker Project

[...]

Steve Furber¹, Francesco Galluppi¹, Steve Temple¹, Luis A. Plana¹•Institutions (1)

University of Manchester¹

27 Feb 2014

TL;DR: SpiNNaker as discussed by the authors is a massively parallel million-core computer whose interconnect architecture is inspired by the connectivity characteristics of the mammalian brain, and which is suited to the modeling of large-scale spiking neural networks in biological real time.

...read moreread less

Abstract: The spiking neural network architecture (SpiNNaker) project aims to deliver a massively parallel million-core computer whose interconnect architecture is inspired by the connectivity characteristics of the mammalian brain, and which is suited to the modeling of large-scale spiking neural networks in biological real time. Specifically, the interconnect allows the transmission of a very large number of very small data packets, each conveying explicitly the source, and implicitly the time, of a single neural action potential or “spike.” In this paper, we review the current state of the project, which has already delivered systems with up to 2500 processors, and present the real-time event-driven programming model that supports flexible access to the resources of the machine and has enabled its use by a wide range of collaborators around the world.

...read moreread less

936 citations

Journal Article•DOI•

ExaBayes: Massively Parallel Bayesian Tree Inference for the Whole-Genome Era

[...]

Andre J. Aberer¹, Kassian Kobert¹, Alexandros Stamatakis¹•Institutions (1)

Heidelberg Institute for Theoretical Studies¹

01 Oct 2014-Molecular Biology and Evolution

TL;DR: A novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size is introduced and first experiences with Bayesian inferences at the whole-genome level are reported on.

...read moreread less

Abstract: Modern sequencing technology now allows biologists to collect the entirety of molecular evidence for reconstructing evolutionary trees. We introduce a novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size. Our software introduces a nonblocking parallelization of Metropolis-coupled chains, modifications for efficient analyses of data sets comprising thousands of partitions and memory saving techniques. We report on first experiences with Bayesian inferences at the whole-genome level using the SuperMUC supercomputer and simulated data.

...read moreread less

369 citations

Journal Article•DOI•

An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing

[...]

Paul Dlugosch¹, Dave Brown¹, Paul Glendenning¹, Michael C. Leventhal¹, Harold B Noyes¹ - Show less +1 more•Institutions (1)

Micron Technology¹

22 Jan 2014-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The design and development of the automata processor is presented, a massively parallel non-von Neumann semiconductor architecture that is purpose-built for automata processing that exceeds the capabilities of high-performance FPGA-based implementations of regular expression processors.

...read moreread less

Abstract: We present the design and development of the automata processor, a massively parallel non-von Neumann semiconductor architecture that is purpose-built for automata processing. This architecture can directly implement non-deterministic finite automata in hardware and can be used to implement complex regular expressions, as well as other types of automata which cannot be expressed as regular expressions. We demonstrate that this architecture exceeds the capabilities of high-performance FPGA-based implementations of regular expression processors. We report on the development of an XML-based language for describing automata for easy compilation targeted to the hardware. The automata processor can be effectively utilized in a diverse array of applications driven by pattern matching, such as cyber security and computational biology.

...read moreread less

234 citations

Journal Article•DOI•

Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline.

[...]

Jeffrey G. Reid¹, Andrew Carroll, Narayanan Veeraraghavan², Mahmoud Dahdouli², Andreas Sundquist, Adam C. English², Matthew N. Bainbridge², Simon D. M. White², William J Salerno², Christian J. Buhay², Fuli Yu², Donna M. Muzny², Richard Daly, Geoff Duyk, Richard A. Gibbs², Eric Boerwinkle², Eric Boerwinkle³ - Show less +13 more•Institutions (3)

Human Genome Sequencing Center¹, Baylor College of Medicine², University of Texas Health Science Center at Houston³

29 Jan 2014-BMC Bioinformatics

TL;DR: The Mercury analysis pipeline is developed and deployed in local hardware and the Amazon Web Services cloud via the DNAnexus platform and provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts.

...read moreread less

Abstract: Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results. To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts. By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.

...read moreread less

220 citations

Journal Article•DOI•

MilkyWay-2 supercomputer: system and application

[...]

Xiangke Liao¹, Liquan Xiao¹, Canqun Yang¹, Yutong Lu¹•Institutions (1)

National University of Defense Technology¹

01 Jun 2014-Frontiers of Computer Science

TL;DR: The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications and intelligent system administration.

...read moreread less

Abstract: On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16-core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.

...read moreread less

174 citations

Journal Article•DOI•

A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures

[...]

Cristobal A. Navarro¹, Nancy Hitschfeld-Kahler¹, Luis Mateu¹•Institutions (1)

University of Chile¹

01 Feb 2014-Communications in Computational Physics

TL;DR: By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.

...read moreread less

Abstract: Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions. The evolution of computer architectures ( multi-core and many-core ) towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm. In the last decade, the graphics processing unit, or GPU, has gained an important place in the field of high performance computing (HPC) because of its low cost and massive parallel processing power. Super-computing has become, for the first time, available to anyone at the price of a desktop computer. In this paper, we survey the concept of parallel computing and especially GPU computing. Achieving efficient parallel algorithms for the GPU is not a trivial task, there are several technical restrictions that must be satisfied in order to achieve the expected performance. Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it. Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model. In particular, we show how this new technology can help the field of computational physics, especially when the problem is data-parallel. We present four examples of computational physics problems; n-body, collision detection, Potts model and cellular automata simulations. These examples well represent the kind of problems that are suitable for GPU computing. By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.

...read moreread less

158 citations

Proceedings Article•DOI•

Time-critical computing on a single-chip massively parallel processor

[...]

Benoît Dupont de Dinechin, Duco van Amstel, Marc Poulhies, Guillaume Lager

24 Mar 2014

TL;DR: This paper illustrates how the problem of the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications has been addressed by suitably designing the architecture, implementation, and programming model, of the Kalray MPPA-256 single-chip many-core processor.

...read moreread less

Abstract: The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. We illustrate how this problem has been addressed by suitably designing the architecture, implementation, and programming model, of the Kalray MPPA®-256 single-chip many-core processor. The MPPA® -256 (Multi-Purpose Processing Array) processor integrates 256 processing engine (PE) cores and 32 resource management (RM) cores on a single 28nm CMOS chip. These VLIW cores are distributed across 16 compute clusters and 4 I/O subsystems, each with a locally shared memory. On-chip communication and synchronization are supported by an explicitly addressed dual network-on-chip (NoC), with one node per compute cluster and 4 nodes per I/O subsystem. Off-chip interfaces include DDR, PCI and Ethernet, and a direct access to the NoC for low-latency processing of data streams. The key architectural features that support time-critical applications are timing compositional cores, independent memory banks inside the compute clusters, and the data NoC whose guaranteed services are determined by network calculus. The programming model provides communicators that effectively support distributed computing primitives such as remote writes, barrier synchronizations, active messages, and communication by sampling. POSIX time functions expose synchronous clocks inside compute clusters and mesosynchronous clocks across the MPPA®-256 processor.

...read moreread less

158 citations

Patent•

Massively parallel single cell analysis

[...]

Christina Fan, Stephen P. A. Fodor, Glenn K. Fu, Geoffrey Richard Facer, Julie Wilhelmy - Show less +1 more

28 Aug 2014

TL;DR: In this article, the authors provide methods, compositions, and kits for multiplex nucleic acid analysis of single cells, which may be used for massively parallel single-cell sequencing.

...read moreread less

Abstract: The disclosure provides for methods, compositions, and kits for multiplex nucleic acid analysis of single cells. The methods, compositions and systems may be used for massively parallel single cell sequencing. The methods, compositions and systems may be used to analyze thousands of cells concurrently. The thousands of cells may comprise a mixed population of cells (e.g., cells of different types or subtypes, different sizes).

...read moreread less

129 citations

Proceedings Article•DOI•

Direct simulation Monte Carlo: The quest for speed

[...]

Michail A. Gallis, John Robert Torczynski, Steven J. Plimpton, Daniel J. Rader, Timothy P. Koehler - Show less +1 more

01 Jul 2014

TL;DR: In the 50 years since its invention, the acceptance and applicability of the DSMC method have increased significantly, whereas the increase in computer speed has been the main factor behind its greater applicability.

...read moreread less

Abstract: In the 50 years since its invention, the acceptance and applicability of the DSMC method have increased significantly. Extensive verification and validation efforts have led to its greater acceptance, whereas the increase in computer speed has been the main factor behind its greater applicability. As the performance of a single processor reaches its limit, massively parallel computing is expected to play an even stronger role in its future development.

...read moreread less

124 citations

Proceedings Article•

Creating a massively parallel Bible corpus

[...]

Thomas Mayer¹, Michael Cysouw¹•Institutions (1)

University of Marburg¹

01 May 2014

TL;DR: This work presents the ongoing effort to create a massively parallel Bible corpus, with over 900 translations in more than 830 language varieties, and reports on the current status of the corpus.

...read moreread less

Abstract: We present our ongoing effort to create a massively parallel Bible corpus. While an ever-increasing number of Bible translations is available in electronic form on the internet, there is no large-scale parallel Bible corpus that allows language researchers to easily get access to the texts and their parallel structure for a large variety of different languages. We report on the current status of the corpus, with over 900 translations in more than 830 language varieties. All translations are tokenized (e.g., separating punctuation marks) and Unicode normalized. Mainly due to copyright restrictions only portions of the texts are made publicly available. However, we provide co-occurrence information for each translation in a (sparse) matrix format. All word forms in the translation are given together with their frequency and the verses in which they occur.

...read moreread less

Journal Article•DOI•

An Open Source, Massively Parallel Code for Non-LTE Synthesis and Inversion of Spectral Lines and Zeeman-induced Stokes Profiles

[...]

Hector Socas-Navarro¹, Hector Socas-Navarro², J. de la Cruz Rodríguez³, A. Asensio Ramos¹, A. Asensio Ramos², J. Trujillo Bueno², J. Trujillo Bueno¹, B. Ruiz Cobo², B. Ruiz Cobo¹ - Show less +5 more•Institutions (3)

University of La Laguna¹, Spanish National Research Council², Stockholm University³

26 Aug 2014-arXiv: Solar and Stellar Astrophysics

TL;DR: NICOLE as mentioned in this paper is a non-LTE radiative transfer code for spectral lines and Zeeman-induced polarization profiles, spanning a wide range of atmospheric heights, from the photosphere to the chromosphere.

...read moreread less

Abstract: With the advent of a new generation of solar telescopes and instrumentation, the interpretation of chromospheric observations (in particular, spectro-polarimetry) requires new, suitable diagnostic tools. This paper describes a new code, NICOLE, that has been designed for Stokes non-LTE radiative transfer, both for synthesis and inversion of spectral lines and Zeeman-induced polarization profiles, spanning a wide range of atmospheric heights, from the photosphere to the chromosphere. The code fosters a number of unique features and capabilities and has been built from scratch with a powerful parallelization scheme that makes it suitable for application on massive datasets using large supercomputers. The source code is being publicly released, with the idea of facilitating future branching by other groups to augment its capabilities.

...read moreread less

Journal Article•DOI•

Massively parallel grid generation on HPC systems

[...]

Andreas Lintermann¹, Stephan Schlimpert¹, Jerry H. Grimmen¹, Claudia Günther¹, Matthias Meinke¹, Wolfgang Schröder¹ - Show less +2 more•Institutions (1)

RWTH Aachen University¹

01 Aug 2014-Computer Methods in Applied Mechanics and Engineering

TL;DR: A new robust algorithm to automatically generate hierarchical Cartesian meshes on distributed multicore HPC systems with multiple levels of refinement is presented and the efficiency of the approach is demonstrated by considering human nasal cavity and internal combustion engine flow problems.

...read moreread less

Journal Article•DOI•

ls1 mardyn: The Massively Parallel Molecular Dynamics Code for Large Systems

[...]

Christoph Niethammer, Stefan Becker¹, Martin Bernreuther, Martin Buchholz², Wolfgang Eckhardt², Alexander Heinecke², Stephan Werth¹, Hans-Joachim Bungartz², Colin W. Glass, Hans Hasse¹, Jadran Vrabec³, Martin Horsch¹ - Show less +8 more•Institutions (3)

Kaiserslautern University of Technology¹, Technische Universität München², University of Paderborn³

01 Oct 2014-Journal of Chemical Theory and Computation

TL;DR: ls1 mardyn as discussed by the authors is a highly scalable molecular dynamics simulation code, optimized for massively parallel execution on supercomputing architectures and currently holds the world record for the largest molecular simulation with over four trillion particles.

...read moreread less

Abstract: The molecular dynamics simulation code ls1 mardyn is presented. It is a highly scalable code, optimized for massively parallel execution on supercomputing architectures and currently holds the world record for the largest molecular simulation with over four trillion particles. It enables the application of pair potentials to length and time scales that were previously out of scope for molecular dynamics simulation. With an efficient dynamic load balancing scheme, it delivers high scalability even for challenging heterogeneous configurations. Presently, multicenter rigid potential models based on Lennard-Jones sites, point charges, and higher-order polarities are supported. Due to its modular design, ls1 mardyn can be extended to new physical models, methods, and algorithms, allowing future users to tailor it to suit their respective needs. Possible applications include scenarios with complex geometries, such as fluids at interfaces, as well as nonequilibrium molecular dynamics simulation of heat and mass transfer.

...read moreread less

Proceedings Article•DOI•

Simulating DRAM controllers for future system architecture exploration

[...]

Andreas Hansson, Neha Agarwal¹, Aasheesh Kolli¹, Thomas F. Wenisch¹, Aniruddha N. Udipi - Show less +1 more•Institutions (1)

University of Michigan¹

23 Mar 2014

TL;DR: This work presents a high-level memory controller model, specifically designed for full-system exploration of future system architectures, that captures the most important DRAM timing constraints for current and emerging DRAM interfaces, e.g. DDR3, LPDDR3 and WideIO.

...read moreread less

Abstract: Compute requirements are increasing rapidly in systems ranging from mobile devices to servers. These, often massively parallel architectures, put increasing requirements on memory bandwidth and latency. The memory system greatly impacts both system performance and power, and it is key to capture the complex behaviour of the DRAM controller when evaluating CPU and GPU performance. By using full-system simulation, the interactions between the system components is captured. However, traditional DRAM controller models focus on modelling interactions between the controller and the DRAM rather than the interactions with the system. Moreover, the DRAM interactions are modelled on a cycle-by-cycle basis, leading to inflexibility and poor simulation performance. In this work, we present a high-level memory controller model, specifically designed for full-system exploration of future system architectures. Our event-based model is tailored to match a contemporary controller architecture, and captures the most important DRAM timing constraints for current and emerging DRAM interfaces, e.g. DDR3, LPDDR3 and WideIO. We show how our controller leverages the open-source gem5 simulation framework, and compare it to a state-of-the-art DRAM controller simulator. Our results show that our model is 7x faster on average, while maintaining the fidelity of the simulation. To highlight the capabilities of our model, we show that it can be used to evaluate a multi-processor memory system.

...read moreread less

Journal Article•DOI•

The parallel system for integrating impact models and sectors (pSIMS)

[...]

Joshua Elliott¹, David Kelly¹, James P. Chryssanthacopoulos², Michael Glotter³, Kanika Jhunjhnuwala⁴, Neil Best³, Michael Wilde¹, Ian Foster³ - Show less +4 more•Institutions (4)

Argonne National Laboratory¹, Columbia University², University of Chicago³, Landcare Research⁴

01 Dec 2014-Environmental Modelling and Software

TL;DR: The pSIMS design and use example assessments to demonstrate its multi-model, multi-scale, and multi-sector versatility and to assess the efficiency gain attained.

...read moreread less

Abstract: We present a framework for massively parallel climate impact simulations: the parallel System for Integrating Impact Models and Sectors (pSIMS). This framework comprises a) tools for ingesting and converting large amounts of data to a versatile datatype based on a common geospatial grid; b) tools for translating this datatype into custom formats for site-based models; c) a scalable parallel framework for performing large ensemble simulations, using any one of a number of different impacts models, on clusters, supercomputers, distributed grids, or clouds; d) tools and data standards for reformatting outputs to common datatypes for analysis and visualization; and e) methodologies for aggregating these datatypes to arbitrary spatial scales such as administrative and environmental demarcations. By automating many time-consuming and error-prone aspects of large-scale climate impacts studies, pSIMS accelerates computational research, encourages model intercomparison, and enhances reproducibility of simulation results. We present the pSIMS design and use example assessments to demonstrate its multi-model, multi-scale, and multi-sector versatility. Open-source framework for efficient massively parallel climate impact simulations.Enables analysis of dozens of crop and tree species with DSSAT, APSIM, and CenW.Multi-model multi-scale assessment of maize yield in Africa using DSSAT and APSIM.High-resolution climate impact assessment of New Zealand forest productivity.Computational scaling behavior of the framework to assess the efficiency gain attained.

...read moreread less

Journal Article•DOI•

Scalable Evaluation of Polarization Energy and Associated Forces in Polarizable Molecular Dynamics: I. Toward Massively Parallel Direct Space Computations

[...]

Filippo Lipparini¹, Louis Lagardère¹, Benjamin Stamm², Benjamin Stamm¹, Eric Cancès¹, Michael J. Schnieders³, Pengyu Ren⁴, Yvon Maday⁵, Yvon Maday⁶, Yvon Maday¹, Jean-Philip Piquemal¹, Jean-Philip Piquemal² - Show less +8 more•Institutions (6)

University of Paris¹, Centre national de la recherche scientifique², University of Iowa³, University of Texas at Austin⁴, Brown University⁵, Institut Universitaire de France⁶

14 Mar 2014-Journal of Chemical Theory and Computation

TL;DR: It is shown that the classical Jacobi Over-Relaxation method (JOR) should not be used as its convergence requires a proper value of the relaxation parameter, whereas other strategies should be preferred.

...read moreread less

Abstract: In this paper, we investigate various numerical strategies to compute the direct space polarization energy and associated forces in the context of the point dipole approximation (including damping) used in polarizable molecular dynamics. We present a careful mathematical analysis of the algorithms that have been implemented in popular production packages and applied to large test systems. We show that the classical Jacobi Over-Relaxation method (JOR) should not be used as its convergence requires a proper value of the relaxation parameter, whereas other strategies should be preferred. On a single node, Preconditioned Conjugate Gradient methods (PCG) and Jacobi algorithm coupled with the Direct Inversion in the Iterative Subspace (JI/DIIS) provide reliable stability/convergence and are roughly twice as fast as JOR. Moreover, both algorithms are suitable for massively parallel implementations. The lower requirements in terms of processes communications make JI/DIIS the method of choice for MPI and hybrid Op...

...read moreread less

Journal Article•DOI•

GPU accelerated computational homogenization based on a variational approach in a reduced basis framework

[...]

Felix Fritzen¹, Max Hodapp², Matthias Leuschner¹•Institutions (2)

Karlsruhe Institute of Technology¹, École Polytechnique Fédérale de Lausanne²

15 Aug 2014-Computer Methods in Applied Mechanics and Engineering

TL;DR: This contribution introduces a massively parallel GPU implementation of the hybrid computational homogenization method for visco-plastic materials using a reduced basis approach in a mixed variational formulation and allows for heterogeneous hardening variables instead of piecewise constant fields.

...read moreread less

Journal Article•DOI•

Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach

[...]

Frank Hannig¹, Vahid Lari¹, Srinivas Boppu¹, Alexandru Tanase¹, Oliver Reiche¹ - Show less +1 more•Institutions (1)

University of Erlangen-Nuremberg¹

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This work introduces a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs) and presents a seamless mapping flow for TCPAs, based on a domain-specific language, and outlines a complete symbolic mapping approach.

...read moreread less

Abstract: We introduce a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs). The presented processor class is a highly parameterizable template which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in way that they support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release the resources. Furthermore, we show how invasive computing can be used as an enabler for power management. For the first time, we present a seamless mapping flow for TCPAs, based on a domain-specific language. Moreover, we outline a complete symbolic mapping approach. Finally, we support our claims by comparing a TCPA against an ARM Mali-T604 GPU in terms of performance and energy efficiency.

...read moreread less

Journal Article•DOI•

STAG: spintronic-tape architecture for GPGPU cache hierarchies

[...]

Rangharajan Venkatesan¹, Shankar Ganesh Ramasubramanian¹, Swagath Venkataramani¹, Kaushik Roy¹, Anand Raghunathan¹ - Show less +1 more•Institutions (1)

Purdue University¹

14 Jun 2014

TL;DR: STAG is proposed, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM), which inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire.

...read moreread less

Abstract: General-purpose Graphics Processing Units (GPGPUs) are widely used for executing massively parallel workloads from various application domains. Feeding data to the hundreds to thousands of cores that current GPGPUs integrate places great demands on the memory hierarchy, fueling an ever-increasing demand for on-chip memory.In this work, we propose STAG, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM). DWMs inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing "shift" operations, resulting in variable and potentially higher access latencies. To address this challenge, STAG utilizes a number of architectural techniques : (i) a hybrid cache organization that employs different DWM bit-cells to realize the different memory arrays within the GPGPU cache hierarchy, (ii) a clustered, bit-interleaved organization, in which the bits in a cache block are spread across a cluster of DWM tapes, allowing parallel access, (iii) tape head management policies that predictively configure DWM arrays to reduce the expected number of shift operations for subsequent accesses, and (iv) a shift aware pro- motion buffer (SaPB), in which accesses to the DWM cache are predicted based on intra-warp locality, and locations that would incur a large shift penalty are promoted to a smaller buffer. Over a wide range of benchmarks from the Rodinia, IS- PASS and Parboil suites, STAG achieves significant benefits in performance (12.1% over SRAM and 5.8% over STT-MRAM) and energy (3.3X over SRAM and 2.6X over STT-MRAM)

...read moreread less

Journal Article•DOI•

Scalable replica-exchange framework for Wang-Landau sampling

[...]

Thomas Vogel¹, Ying Wai Li², Thomas Wüst³, David P. Landau¹•Institutions (3)

University of Georgia¹, National Center for Computational Sciences², ETH Zurich³

05 Aug 2014-Physical Review E

TL;DR: It is shown how the parallel framework facilitates simulations of such processes and, without any loss of accuracy or precision, gives a significant speedup and allows for the study of much larger systems and much wider temperature ranges than possible with single-walker methods.

...read moreread less

Abstract: We investigate a generic, parallel replica-exchange framework for Monte Carlo simulations based on the Wang-Landau method. To demonstrate its advantages and general applicability for massively parallel simulations of complex systems, we apply it to lattice spin models, the self-assembly process in amphiphilic solutions, and the adsorption of molecules on surfaces. While of general current interest, the latter phenomena are challenging to study computationally because of multiple structural transitions occurring over a broad temperature range. We show how the parallel framework facilitates simulations of such processes and, without any loss of accuracy or precision, gives a significant speedup and allows for the study of much larger systems and much wider temperature ranges than possible with single-walker methods.

...read moreread less

Journal Article•DOI•

The use of imprecise processing to improve accuracy in weather & climate prediction

[...]

Peter Düben¹, Hugh McNamara¹, Tim Palmer¹•Institutions (1)

University of Oxford¹

15 Aug 2014-Journal of Computational Physics

TL;DR: Simulation results from the Lorenz '96 simulations suggest that inexact calculations at the small scale could reduce computation and power costs without adversely affecting the quality of the simulations, which would allow higher resolution models to be run at the same computational cost.

...read moreread less

Journal Article•DOI•

Scalable Implicit Flow Solver for Realistic Wing Simulations with Flow Control

[...]

Michel Rasquin¹, Cameron W. Smith², Kedar C. Chitale², E. Seegyoung Seol², Benjamin A. Matthews³, Jeff Martin³, Onkar Sahni², Raymond M. Loy¹, Mark S. Shephard², Kenneth E. Jansen³ - Show less +6 more•Institutions (3)

Argonne National Laboratory¹, Rensselaer Polytechnic Institute², University of Colorado Boulder³

02 Dec 2014

TL;DR: The article describes the active flow control application; then summarizes the main features in the implementation of a massively parallel turbulent flow solver, PHASTA; and finally demonstrates the methods strong scalability at extreme scale.

...read moreread less

Abstract: Massively parallel computation provides an enormous capacity to perform simulations on a timescale that can change the paradigm of how scientists, engineers, and other practitioners use simulations to address discovery and design. This work considers an active flow control application on a realistic and complex wing design that could be leveraged by a scalable, fully implicit, unstructured flow solver and access to high-performance computing resources. The article describes the active flow control application; then summarizes the main features in the implementation of a massively parallel turbulent flow solver, PHASTA; and finally demonstrates the methods strong scalability at extreme scale. Scaling studies performed with unstructured meshes of 11 and 92 billion elements on the Argonne Leadership Computing Facility's Blue Gene/Q Mira machine with up to 786,432 cores and 3,145,728 MPI processes.

...read moreread less

Journal Article•DOI•

Implementation and scaling of the fully coupled Terrestrial Systems Modeling Platform (TerrSysMP v1.0) in a massively parallel supercomputing environment - a case study on JUQUEEN (IBM Blue Gene/Q)

[...]

Fabian Gasper¹, Klaus Goergen¹, Klaus Goergen², Prabhakar Shrestha², Mauro Sulis², J. Rihani², Markus Geimer¹, Stefan Kollet¹ - Show less +4 more•Institutions (2)

Forschungszentrum Jülich¹, University of Bonn²

29 Oct 2014-Geoscientific Model Development

TL;DR: In massively parallel supercomputer environments, the coupler OASIS-MCT is recommended, which resolves memory limitations that may be significant in case of very large computational domains and exchange fields as they occur in these specific test cases and in many applications in terrestrial research.

...read moreread less

Abstract: . Continental-scale hyper-resolution simulations constitute a grand challenge in characterizing nonlinear feedbacks of states and fluxes of the coupled water, energy, and biogeochemical cycles of terrestrial systems. Tackling this challenge requires advanced coupling and supercomputing technologies for earth system models that are discussed in this study, utilizing the example of the implementation of the newly developed Terrestrial Systems Modeling Platform (TerrSysMP v1.0) on JUQUEEN (IBM Blue Gene/Q) of the Julich Supercomputing Centre, Germany. The applied coupling strategies rely on the Multiple Program Multiple Data (MPMD) paradigm using the OASIS suite of external couplers, and require memory and load balancing considerations in the exchange of the coupling fields between different component models and the allocation of computational resources, respectively. Using the advanced profiling and tracing tool Scalasca to determine an optimum load balancing leads to a 19% speedup. In massively parallel supercomputer environments, the coupler OASIS-MCT is recommended, which resolves memory limitations that may be significant in case of very large computational domains and exchange fields as they occur in these specific test cases and in many applications in terrestrial research. However, model I/O and initialization in the petascale range still require major attention, as they constitute true big data challenges in light of future exascale computing resources. Based on a factor-two speedup due to compiler optimizations, a refactored coupling interface using OASIS-MCT and an optimum load balancing, the problem size in a weak scaling study can be increased by a factor of 64 from 512 to 32 768 processes while maintaining parallel efficiencies above 80% for the component models.

...read moreread less

Report•DOI•

Programming Abstractions for Data Locality

[...]

Adrian Tate¹, Amir Kamil², Anshu Dubey², Armin Groblinger³, Brad Chamberlain¹, Brice Goglin⁴, Harold C. Edwards⁵, Chris J. Newburn⁶, David Padua⁷, Didem Unat⁸, Emmanuel Jeannot⁴, Frank Hannig⁹, Gysi Tobias¹⁰, Hatem Ltaief¹¹, James C. Sexton¹², Jesús Labarta¹³, John Shalf², Karl Fuerlinger¹⁴, Kathryn M. O'Brien¹², Leonidas Linardakis¹⁵, Maciej Besta¹⁰, Marie-Christine Sawley⁶, Mark Abraham¹⁶, Mauro Bianco, Miquel Pericas¹⁷, Naoya Maruyama¹⁸, Paul H. J. Kelly¹⁹, Peter Messmer²⁰, Robert Ross²¹, Romain Ciedat⁶, Satoshi Matsuoka²², Thomas C. Schulthess, Torsten Hoefler¹⁰, Vitus J. Leung⁵ - Show less +30 more•Institutions (22)

01 Nov 2014

TL;DR: The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models that can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code.

...read moreread less

Abstract: The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.

...read moreread less

Journal Article•DOI•

Massively Parallel Approximate Gaussian Process Regression

[...]

Robert B. Gramacy, Jarad Niemi¹, Robin M. Weiss²•Institutions (2)

Iowa State University¹, University of Chicago²

30 Sep 2014-SIAM/ASA Journal on Uncertainty Quantification

TL;DR: This paper explores how the big-three computing paradigms---symmetric multiprocessor, graphical processing units (GPUs), and cluster computing---can together be brought to bear on large-data Gaussian processes (GP) regression problems via a careful implementation of a newly developed local approximation scheme.

...read moreread less

Abstract: We explore how the big-three computing paradigms---symmetric multiprocessor, graphical processing units (GPUs), and cluster computing---can together be brought to bear on large-data Gaussian processes (GP) regression problems via a careful implementation of a newly developed local approximation scheme. Our methodological contribution focuses primarily on GPU computation, as this requires the most care and also provides the largest performance boost. However, in our empirical work we study the relative merits of all three paradigms to determine how best to combine them. The paper concludes with two case studies. One is a real data fluid-dynamics computer experiment which benefits from the local nature of our approximation; the second is a synthetic example designed to find the largest data set for which (accurate) GP emulation can be performed on a commensurate predictive set in under an hour.

...read moreread less

Journal Article•DOI•

Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

[...]

M Mawson¹, Alistair Revell¹•Institutions (1)

University of Manchester¹

01 Oct 2014-Computer Physics Communications

TL;DR: This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as ‘Kepler’, and shows that the more simple approach is most efficient.

...read moreread less

Proceedings Article•DOI•

The Parallel Research Kernels

[...]

Rob F. Van der Wijngaart¹, Timothy G. Mattson¹•Institutions (1)

Intel¹

01 Sep 2014

TL;DR: This set of kernels covers the most common patterns of communication, computation and synchronization encountered in parallel HPC applications and can be used to design an effective parallel computer system without needing to make predictions about the nature of future workloads.

...read moreread less

Abstract: We present the Parallel Research Kernels; a collection of kernels supporting research on parallel computer systems. This set of kernels covers the most common patterns of communication, computation and synchronization encountered in parallel HPC applications. By focusing on these kernels instead of specific workloads, one can design an effective parallel computer system without needing to make predictions about the nature of future workloads.

...read moreread less

Journal Article•DOI•

EMPIRE: a highly parallel semiempirical molecular orbital program: 1: self-consistent field calculations

[...]

Matthias Hennemann¹, Timothy Clark¹•Institutions (1)

University of Erlangen-Nuremberg¹

20 Jun 2014-Journal of Molecular Modeling

TL;DR: An adamantane nanocrystal that is easily calculated with EMPIRE is shown, and it is shown that the concentration of adamantane in Na6(CO3(SO4)(SO4) is higher than in the case of Na2SO4, so its value can be calculated using EMPIRE.

...read moreread less

Abstract: EMPIRE is a massively parallel semiempirical (NDDO) molecular-orbital program designed to scale well both on single multi-core nodes (using open MP) and on large clusters (using a hybrid open MP/MPI model). The program design and performance are discussed for single self-consistent-field calculations on up to 55,000 (the adamantane crystal shown in the graphic) atoms and on both single- and multi-node machines using either Windows 7 or Linux. EMPIRE currently carries out the full SCF calculation with no local approximations or other linear-scaling techniques. The single-node version is available free of charge to bona fide academic groups.

...read moreread less

Book Chapter•DOI•

Adaptive sequential posterior simulators for massively parallel computing environments

[...]

Garland Durham, John Geweke¹•Institutions (1)

University of Technology, Sydney¹

01 Jan 2014-arXiv: Computation

TL;DR: In this paper, the authors present a sequential posterior simulator designed to operate efficiently in parallel computing environments, which makes fewer analytical and programming demands on investigators, and is faster, more reliable, and more complete than conventional posterior simulators.

...read moreread less

Abstract: Massively parallel desktop computing capabilities now well within the reach of individual academics modify the environment for posterior simulation in fundamental and potentially quite advantageous ways. But to fully exploit these benefits algorithms that conform to parallel computing environments are needed. This paper presents a sequential posterior simulator designed to operate efficiently in this context. The simulator makes fewer analytical and programming demands on investigators, and is faster, more reliable, and more complete than conventional posterior simulators. The paper extends existing sequential Monte Carlo methods and theory to provide a thorough and practical foundation for sequential posterior simulation that is well suited to massively parallel computing environments. It provides detailed recommendations on implementation, yielding an algorithm that requires only code for simulation from the prior and evaluation of prior and data densities and works well in a variety of applications representative of serious empirical work in economics and finance. The algorithm facilitates Bayesian model comparison by producing marginal likelihood approximations of unprecedented accuracy as an incidental by-product, is robust to pathological posterior distributions, and provides estimates of numerical standard error and relative numerical efficiency intrinsically. The paper concludes with an application that illustrates the potential of these simulators for applied Bayesian inference.

...read moreread less

Collapse