scispace - formally typeset
Search or ask a question

Showing papers on "Massively parallel published in 2014"


Journal ArticleDOI
TL;DR: The main capabilities of cp2k are summarized, and with recent applications the science cp2K has enabled in the field of atomistic simulation are illustrated.
Abstract: cp2k has become a versatile open-source tool for the simulation of complex systems on the nanometer scale. It allows for sampling and exploring potential energy surfaces that can be computed using a variety of empirical and first principles models. Excellent performance for electronic structure calculations is achieved using novel algorithms implemented for modern and massively parallel hardware. This review briefly summarizes the main capabilities and illustrates with recent applications the science cp2k has enabled in the field of atomistic simulation.

2,114 citations


Journal ArticleDOI
27 Feb 2014
TL;DR: SpiNNaker as discussed by the authors is a massively parallel million-core computer whose interconnect architecture is inspired by the connectivity characteristics of the mammalian brain, and which is suited to the modeling of large-scale spiking neural networks in biological real time.
Abstract: The spiking neural network architecture (SpiNNaker) project aims to deliver a massively parallel million-core computer whose interconnect architecture is inspired by the connectivity characteristics of the mammalian brain, and which is suited to the modeling of large-scale spiking neural networks in biological real time. Specifically, the interconnect allows the transmission of a very large number of very small data packets, each conveying explicitly the source, and implicitly the time, of a single neural action potential or “spike.” In this paper, we review the current state of the project, which has already delivered systems with up to 2500 processors, and present the real-time event-driven programming model that supports flexible access to the resources of the machine and has enabled its use by a wide range of collaborators around the world.

936 citations


Journal ArticleDOI
TL;DR: A novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size is introduced and first experiences with Bayesian inferences at the whole-genome level are reported on.
Abstract: Modern sequencing technology now allows biologists to collect the entirety of molecular evidence for reconstructing evolutionary trees. We introduce a novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size. Our software introduces a nonblocking parallelization of Metropolis-coupled chains, modifications for efficient analyses of data sets comprising thousands of partitions and memory saving techniques. We report on first experiences with Bayesian inferences at the whole-genome level using the SuperMUC supercomputer and simulated data.

369 citations


Journal ArticleDOI
TL;DR: The design and development of the automata processor is presented, a massively parallel non-von Neumann semiconductor architecture that is purpose-built for automata processing that exceeds the capabilities of high-performance FPGA-based implementations of regular expression processors.
Abstract: We present the design and development of the automata processor, a massively parallel non-von Neumann semiconductor architecture that is purpose-built for automata processing. This architecture can directly implement non-deterministic finite automata in hardware and can be used to implement complex regular expressions, as well as other types of automata which cannot be expressed as regular expressions. We demonstrate that this architecture exceeds the capabilities of high-performance FPGA-based implementations of regular expression processors. We report on the development of an XML-based language for describing automata for easy compilation targeted to the hardware. The automata processor can be effectively utilized in a diverse array of applications driven by pattern matching, such as cyber security and computational biology.

234 citations


Journal ArticleDOI
TL;DR: The Mercury analysis pipeline is developed and deployed in local hardware and the Amazon Web Services cloud via the DNAnexus platform and provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts.
Abstract: Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results. To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts. By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.

220 citations


Journal ArticleDOI
TL;DR: The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications and intelligent system administration.
Abstract: On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16-core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.

174 citations


Journal ArticleDOI
TL;DR: By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.
Abstract: Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions. The evolution of computer architectures ( multi-core and many-core ) towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm. In the last decade, the graphics processing unit, or GPU, has gained an important place in the field of high performance computing (HPC) because of its low cost and massive parallel processing power. Super-computing has become, for the first time, available to anyone at the price of a desktop computer. In this paper, we survey the concept of parallel computing and especially GPU computing. Achieving efficient parallel algorithms for the GPU is not a trivial task, there are several technical restrictions that must be satisfied in order to achieve the expected performance. Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it. Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model. In particular, we show how this new technology can help the field of computational physics, especially when the problem is data-parallel. We present four examples of computational physics problems; n-body, collision detection, Potts model and cellular automata simulations. These examples well represent the kind of problems that are suitable for GPU computing. By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.

158 citations


Proceedings ArticleDOI
24 Mar 2014
TL;DR: This paper illustrates how the problem of the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications has been addressed by suitably designing the architecture, implementation, and programming model, of the Kalray MPPA-256 single-chip many-core processor.
Abstract: The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. We illustrate how this problem has been addressed by suitably designing the architecture, implementation, and programming model, of the Kalray MPPA®-256 single-chip many-core processor. The MPPA® -256 (Multi-Purpose Processing Array) processor integrates 256 processing engine (PE) cores and 32 resource management (RM) cores on a single 28nm CMOS chip. These VLIW cores are distributed across 16 compute clusters and 4 I/O subsystems, each with a locally shared memory. On-chip communication and synchronization are supported by an explicitly addressed dual network-on-chip (NoC), with one node per compute cluster and 4 nodes per I/O subsystem. Off-chip interfaces include DDR, PCI and Ethernet, and a direct access to the NoC for low-latency processing of data streams. The key architectural features that support time-critical applications are timing compositional cores, independent memory banks inside the compute clusters, and the data NoC whose guaranteed services are determined by network calculus. The programming model provides communicators that effectively support distributed computing primitives such as remote writes, barrier synchronizations, active messages, and communication by sampling. POSIX time functions expose synchronous clocks inside compute clusters and mesosynchronous clocks across the MPPA®-256 processor.

158 citations


Patent
28 Aug 2014
TL;DR: In this article, the authors provide methods, compositions, and kits for multiplex nucleic acid analysis of single cells, which may be used for massively parallel single-cell sequencing.
Abstract: The disclosure provides for methods, compositions, and kits for multiplex nucleic acid analysis of single cells. The methods, compositions and systems may be used for massively parallel single cell sequencing. The methods, compositions and systems may be used to analyze thousands of cells concurrently. The thousands of cells may comprise a mixed population of cells (e.g., cells of different types or subtypes, different sizes).

129 citations


Proceedings ArticleDOI
01 Jul 2014
TL;DR: In the 50 years since its invention, the acceptance and applicability of the DSMC method have increased significantly, whereas the increase in computer speed has been the main factor behind its greater applicability.
Abstract: In the 50 years since its invention, the acceptance and applicability of the DSMC method have increased significantly. Extensive verification and validation efforts have led to its greater acceptance, whereas the increase in computer speed has been the main factor behind its greater applicability. As the performance of a single processor reaches its limit, massively parallel computing is expected to play an even stronger role in its future development.

124 citations


Proceedings Article
01 May 2014
TL;DR: This work presents the ongoing effort to create a massively parallel Bible corpus, with over 900 translations in more than 830 language varieties, and reports on the current status of the corpus.
Abstract: We present our ongoing effort to create a massively parallel Bible corpus. While an ever-increasing number of Bible translations is available in electronic form on the internet, there is no large-scale parallel Bible corpus that allows language researchers to easily get access to the texts and their parallel structure for a large variety of different languages. We report on the current status of the corpus, with over 900 translations in more than 830 language varieties. All translations are tokenized (e.g., separating punctuation marks) and Unicode normalized. Mainly due to copyright restrictions only portions of the texts are made publicly available. However, we provide co-occurrence information for each translation in a (sparse) matrix format. All word forms in the translation are given together with their frequency and the verses in which they occur.

Journal ArticleDOI
TL;DR: NICOLE as mentioned in this paper is a non-LTE radiative transfer code for spectral lines and Zeeman-induced polarization profiles, spanning a wide range of atmospheric heights, from the photosphere to the chromosphere.
Abstract: With the advent of a new generation of solar telescopes and instrumentation, the interpretation of chromospheric observations (in particular, spectro-polarimetry) requires new, suitable diagnostic tools. This paper describes a new code, NICOLE, that has been designed for Stokes non-LTE radiative transfer, both for synthesis and inversion of spectral lines and Zeeman-induced polarization profiles, spanning a wide range of atmospheric heights, from the photosphere to the chromosphere. The code fosters a number of unique features and capabilities and has been built from scratch with a powerful parallelization scheme that makes it suitable for application on massive datasets using large supercomputers. The source code is being publicly released, with the idea of facilitating future branching by other groups to augment its capabilities.

Journal ArticleDOI
TL;DR: A new robust algorithm to automatically generate hierarchical Cartesian meshes on distributed multicore HPC systems with multiple levels of refinement is presented and the efficiency of the approach is demonstrated by considering human nasal cavity and internal combustion engine flow problems.

Journal ArticleDOI
TL;DR: ls1 mardyn as discussed by the authors is a highly scalable molecular dynamics simulation code, optimized for massively parallel execution on supercomputing architectures and currently holds the world record for the largest molecular simulation with over four trillion particles.
Abstract: The molecular dynamics simulation code ls1 mardyn is presented. It is a highly scalable code, optimized for massively parallel execution on supercomputing architectures and currently holds the world record for the largest molecular simulation with over four trillion particles. It enables the application of pair potentials to length and time scales that were previously out of scope for molecular dynamics simulation. With an efficient dynamic load balancing scheme, it delivers high scalability even for challenging heterogeneous configurations. Presently, multicenter rigid potential models based on Lennard-Jones sites, point charges, and higher-order polarities are supported. Due to its modular design, ls1 mardyn can be extended to new physical models, methods, and algorithms, allowing future users to tailor it to suit their respective needs. Possible applications include scenarios with complex geometries, such as fluids at interfaces, as well as nonequilibrium molecular dynamics simulation of heat and mass transfer.

Proceedings ArticleDOI
23 Mar 2014
TL;DR: This work presents a high-level memory controller model, specifically designed for full-system exploration of future system architectures, that captures the most important DRAM timing constraints for current and emerging DRAM interfaces, e.g. DDR3, LPDDR3 and WideIO.
Abstract: Compute requirements are increasing rapidly in systems ranging from mobile devices to servers. These, often massively parallel architectures, put increasing requirements on memory bandwidth and latency. The memory system greatly impacts both system performance and power, and it is key to capture the complex behaviour of the DRAM controller when evaluating CPU and GPU performance. By using full-system simulation, the interactions between the system components is captured. However, traditional DRAM controller models focus on modelling interactions between the controller and the DRAM rather than the interactions with the system. Moreover, the DRAM interactions are modelled on a cycle-by-cycle basis, leading to inflexibility and poor simulation performance. In this work, we present a high-level memory controller model, specifically designed for full-system exploration of future system architectures. Our event-based model is tailored to match a contemporary controller architecture, and captures the most important DRAM timing constraints for current and emerging DRAM interfaces, e.g. DDR3, LPDDR3 and WideIO. We show how our controller leverages the open-source gem5 simulation framework, and compare it to a state-of-the-art DRAM controller simulator. Our results show that our model is 7x faster on average, while maintaining the fidelity of the simulation. To highlight the capabilities of our model, we show that it can be used to evaluate a multi-processor memory system.

Journal ArticleDOI
TL;DR: The pSIMS design and use example assessments to demonstrate its multi-model, multi-scale, and multi-sector versatility and to assess the efficiency gain attained.
Abstract: We present a framework for massively parallel climate impact simulations: the parallel System for Integrating Impact Models and Sectors (pSIMS). This framework comprises a) tools for ingesting and converting large amounts of data to a versatile datatype based on a common geospatial grid; b) tools for translating this datatype into custom formats for site-based models; c) a scalable parallel framework for performing large ensemble simulations, using any one of a number of different impacts models, on clusters, supercomputers, distributed grids, or clouds; d) tools and data standards for reformatting outputs to common datatypes for analysis and visualization; and e) methodologies for aggregating these datatypes to arbitrary spatial scales such as administrative and environmental demarcations. By automating many time-consuming and error-prone aspects of large-scale climate impacts studies, pSIMS accelerates computational research, encourages model intercomparison, and enhances reproducibility of simulation results. We present the pSIMS design and use example assessments to demonstrate its multi-model, multi-scale, and multi-sector versatility. Open-source framework for efficient massively parallel climate impact simulations.Enables analysis of dozens of crop and tree species with DSSAT, APSIM, and CenW.Multi-model multi-scale assessment of maize yield in Africa using DSSAT and APSIM.High-resolution climate impact assessment of New Zealand forest productivity.Computational scaling behavior of the framework to assess the efficiency gain attained.

Journal ArticleDOI
TL;DR: It is shown that the classical Jacobi Over-Relaxation method (JOR) should not be used as its convergence requires a proper value of the relaxation parameter, whereas other strategies should be preferred.
Abstract: In this paper, we investigate various numerical strategies to compute the direct space polarization energy and associated forces in the context of the point dipole approximation (including damping) used in polarizable molecular dynamics. We present a careful mathematical analysis of the algorithms that have been implemented in popular production packages and applied to large test systems. We show that the classical Jacobi Over-Relaxation method (JOR) should not be used as its convergence requires a proper value of the relaxation parameter, whereas other strategies should be preferred. On a single node, Preconditioned Conjugate Gradient methods (PCG) and Jacobi algorithm coupled with the Direct Inversion in the Iterative Subspace (JI/DIIS) provide reliable stability/convergence and are roughly twice as fast as JOR. Moreover, both algorithms are suitable for massively parallel implementations. The lower requirements in terms of processes communications make JI/DIIS the method of choice for MPI and hybrid Op...

Journal ArticleDOI
TL;DR: This contribution introduces a massively parallel GPU implementation of the hybrid computational homogenization method for visco-plastic materials using a reduced basis approach in a mixed variational formulation and allows for heterogeneous hardening variables instead of piecewise constant fields.

Journal ArticleDOI
TL;DR: This work introduces a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs) and presents a seamless mapping flow for TCPAs, based on a domain-specific language, and outlines a complete symbolic mapping approach.
Abstract: We introduce a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs). The presented processor class is a highly parameterizable template which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in way that they support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release the resources. Furthermore, we show how invasive computing can be used as an enabler for power management. For the first time, we present a seamless mapping flow for TCPAs, based on a domain-specific language. Moreover, we outline a complete symbolic mapping approach. Finally, we support our claims by comparing a TCPA against an ARM Mali-T604 GPU in terms of performance and energy efficiency.

Journal ArticleDOI
14 Jun 2014
TL;DR: STAG is proposed, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM), which inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire.
Abstract: General-purpose Graphics Processing Units (GPGPUs) are widely used for executing massively parallel workloads from various application domains. Feeding data to the hundreds to thousands of cores that current GPGPUs integrate places great demands on the memory hierarchy, fueling an ever-increasing demand for on-chip memory.In this work, we propose STAG, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM). DWMs inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing "shift" operations, resulting in variable and potentially higher access latencies. To address this challenge, STAG utilizes a number of architectural techniques : (i) a hybrid cache organization that employs different DWM bit-cells to realize the different memory arrays within the GPGPU cache hierarchy, (ii) a clustered, bit-interleaved organization, in which the bits in a cache block are spread across a cluster of DWM tapes, allowing parallel access, (iii) tape head management policies that predictively configure DWM arrays to reduce the expected number of shift operations for subsequent accesses, and (iv) a shift aware pro- motion buffer (SaPB), in which accesses to the DWM cache are predicted based on intra-warp locality, and locations that would incur a large shift penalty are promoted to a smaller buffer. Over a wide range of benchmarks from the Rodinia, IS- PASS and Parboil suites, STAG achieves significant benefits in performance (12.1% over SRAM and 5.8% over STT-MRAM) and energy (3.3X over SRAM and 2.6X over STT-MRAM)

Journal ArticleDOI
TL;DR: It is shown how the parallel framework facilitates simulations of such processes and, without any loss of accuracy or precision, gives a significant speedup and allows for the study of much larger systems and much wider temperature ranges than possible with single-walker methods.
Abstract: We investigate a generic, parallel replica-exchange framework for Monte Carlo simulations based on the Wang-Landau method. To demonstrate its advantages and general applicability for massively parallel simulations of complex systems, we apply it to lattice spin models, the self-assembly process in amphiphilic solutions, and the adsorption of molecules on surfaces. While of general current interest, the latter phenomena are challenging to study computationally because of multiple structural transitions occurring over a broad temperature range. We show how the parallel framework facilitates simulations of such processes and, without any loss of accuracy or precision, gives a significant speedup and allows for the study of much larger systems and much wider temperature ranges than possible with single-walker methods.

Journal ArticleDOI
TL;DR: Simulation results from the Lorenz '96 simulations suggest that inexact calculations at the small scale could reduce computation and power costs without adversely affecting the quality of the simulations, which would allow higher resolution models to be run at the same computational cost.

Journal ArticleDOI
02 Dec 2014
TL;DR: The article describes the active flow control application; then summarizes the main features in the implementation of a massively parallel turbulent flow solver, PHASTA; and finally demonstrates the methods strong scalability at extreme scale.
Abstract: Massively parallel computation provides an enormous capacity to perform simulations on a timescale that can change the paradigm of how scientists, engineers, and other practitioners use simulations to address discovery and design. This work considers an active flow control application on a realistic and complex wing design that could be leveraged by a scalable, fully implicit, unstructured flow solver and access to high-performance computing resources. The article describes the active flow control application; then summarizes the main features in the implementation of a massively parallel turbulent flow solver, PHASTA; and finally demonstrates the methods strong scalability at extreme scale. Scaling studies performed with unstructured meshes of 11 and 92 billion elements on the Argonne Leadership Computing Facility's Blue Gene/Q Mira machine with up to 786,432 cores and 3,145,728 MPI processes.

Journal ArticleDOI
TL;DR: In massively parallel supercomputer environments, the coupler OASIS-MCT is recommended, which resolves memory limitations that may be significant in case of very large computational domains and exchange fields as they occur in these specific test cases and in many applications in terrestrial research.
Abstract: . Continental-scale hyper-resolution simulations constitute a grand challenge in characterizing nonlinear feedbacks of states and fluxes of the coupled water, energy, and biogeochemical cycles of terrestrial systems. Tackling this challenge requires advanced coupling and supercomputing technologies for earth system models that are discussed in this study, utilizing the example of the implementation of the newly developed Terrestrial Systems Modeling Platform (TerrSysMP v1.0) on JUQUEEN (IBM Blue Gene/Q) of the Julich Supercomputing Centre, Germany. The applied coupling strategies rely on the Multiple Program Multiple Data (MPMD) paradigm using the OASIS suite of external couplers, and require memory and load balancing considerations in the exchange of the coupling fields between different component models and the allocation of computational resources, respectively. Using the advanced profiling and tracing tool Scalasca to determine an optimum load balancing leads to a 19% speedup. In massively parallel supercomputer environments, the coupler OASIS-MCT is recommended, which resolves memory limitations that may be significant in case of very large computational domains and exchange fields as they occur in these specific test cases and in many applications in terrestrial research. However, model I/O and initialization in the petascale range still require major attention, as they constitute true big data challenges in light of future exascale computing resources. Based on a factor-two speedup due to compiler optimizations, a refactored coupling interface using OASIS-MCT and an optimum load balancing, the problem size in a weak scaling study can be increased by a factor of 64 from 512 to 32 768 processes while maintaining parallel efficiencies above 80% for the component models.

ReportDOI
01 Nov 2014
TL;DR: The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models that can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code.
Abstract: The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.

Journal ArticleDOI
TL;DR: This paper explores how the big-three computing paradigms---symmetric multiprocessor, graphical processing units (GPUs), and cluster computing---can together be brought to bear on large-data Gaussian processes (GP) regression problems via a careful implementation of a newly developed local approximation scheme.
Abstract: We explore how the big-three computing paradigms---symmetric multiprocessor, graphical processing units (GPUs), and cluster computing---can together be brought to bear on large-data Gaussian processes (GP) regression problems via a careful implementation of a newly developed local approximation scheme. Our methodological contribution focuses primarily on GPU computation, as this requires the most care and also provides the largest performance boost. However, in our empirical work we study the relative merits of all three paradigms to determine how best to combine them. The paper concludes with two case studies. One is a real data fluid-dynamics computer experiment which benefits from the local nature of our approximation; the second is a synthetic example designed to find the largest data set for which (accurate) GP emulation can be performed on a commensurate predictive set in under an hour.

Journal ArticleDOI
TL;DR: This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as ‘Kepler’, and shows that the more simple approach is most efficient.

Proceedings ArticleDOI
01 Sep 2014
TL;DR: This set of kernels covers the most common patterns of communication, computation and synchronization encountered in parallel HPC applications and can be used to design an effective parallel computer system without needing to make predictions about the nature of future workloads.
Abstract: We present the Parallel Research Kernels; a collection of kernels supporting research on parallel computer systems. This set of kernels covers the most common patterns of communication, computation and synchronization encountered in parallel HPC applications. By focusing on these kernels instead of specific workloads, one can design an effective parallel computer system without needing to make predictions about the nature of future workloads.

Journal ArticleDOI
TL;DR: An adamantane nanocrystal that is easily calculated with EMPIRE is shown, and it is shown that the concentration of adamantane in Na6(CO3(SO4)(SO4) is higher than in the case of Na2SO4, so its value can be calculated using EMPIRE.
Abstract: EMPIRE is a massively parallel semiempirical (NDDO) molecular-orbital program designed to scale well both on single multi-core nodes (using open MP) and on large clusters (using a hybrid open MP/MPI model). The program design and performance are discussed for single self-consistent-field calculations on up to 55,000 (the adamantane crystal shown in the graphic) atoms and on both single- and multi-node machines using either Windows 7 or Linux. EMPIRE currently carries out the full SCF calculation with no local approximations or other linear-scaling techniques. The single-node version is available free of charge to bona fide academic groups.

Book ChapterDOI
TL;DR: In this paper, the authors present a sequential posterior simulator designed to operate efficiently in parallel computing environments, which makes fewer analytical and programming demands on investigators, and is faster, more reliable, and more complete than conventional posterior simulators.
Abstract: Massively parallel desktop computing capabilities now well within the reach of individual academics modify the environment for posterior simulation in fundamental and potentially quite advantageous ways. But to fully exploit these benefits algorithms that conform to parallel computing environments are needed. This paper presents a sequential posterior simulator designed to operate efficiently in this context. The simulator makes fewer analytical and programming demands on investigators, and is faster, more reliable, and more complete than conventional posterior simulators. The paper extends existing sequential Monte Carlo methods and theory to provide a thorough and practical foundation for sequential posterior simulation that is well suited to massively parallel computing environments. It provides detailed recommendations on implementation, yielding an algorithm that requires only code for simulation from the prior and evaluation of prior and data densities and works well in a variety of applications representative of serious empirical work in economics and finance. The algorithm facilitates Bayesian model comparison by producing marginal likelihood approximations of unprecedented accuracy as an incidental by-product, is robust to pathological posterior distributions, and provides estimates of numerical standard error and relative numerical efficiency intrinsically. The paper concludes with an application that illustrates the potential of these simulators for applied Bayesian inference.