scispace - formally typeset
Search or ask a question
Author

Duncan Poole

Bio: Duncan Poole is an academic researcher from Nvidia. The author has contributed to research in topics: Message Passing Interface & CUDA. The author has an hindex of 8, co-authored 9 publications receiving 3324 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: An implementation of explicit solvent all atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled GPUs, providing results that are statistically indistinguishable from the traditional CPU version of the software and with performance that exceeds that achievable by the CPUs running on all conventional CPU-based clusters and supercomputers.
Abstract: We present an implementation of explicit solvent all atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled GPUs. First released publicly in April 2010 as part of version 11 of the AMBER MD package and further improved and optimized over the last two years, this implementation supports the three most widely used statistical mechanical ensembles (NVE, NVT, and NPT), uses particle mesh Ewald (PME) for the long-range electrostatics, and runs entirely on CUDA-enabled NVIDIA graphics processing units (GPUs), providing results that are statistically indistinguishable from the traditional CPU version of the software and with performance that exceeds that achievable by the CPU version of AMBER software running on all conventional CPU-based clusters and supercomputers. We briefly discuss three different precision models developed specifically for this work (SPDP, SPFP, and DPDP) and highlight the technical details of the approach as it extends beyond previously reported work [Gotz et al., J. Chem. Theory Comput. 2012, DOI: 10.1021/ct200909j; Le Grand et al., Comp. Phys. Comm. 2013, DOI: 10.1016/j.cpc.2012.09.022].We highlight the substantial improvements in performance that are seen over traditional CPU-only machines and provide validation of our implementation and precision models. We also provide evidence supporting our decision to deprecate the previously described fully single precision (SPSP) model from the latest release of the AMBER software package.

2,418 citations

Journal ArticleDOI
TL;DR: An implementation of generalized Born implicit solvent all-atom classical molecular dynamics within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs) and shows performance that is on par with, and in some cases exceeds, that of traditional supercomputers.
Abstract: We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discuss the algorithms that are used to exploit the processing power of the GPUs and show the performance that can be achieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solvent MD simulations. We show results for a range of tests including the accuracy of single point force evaluations and energy conservation as well as structural properties pertainining to protein dynamics. The numerical noise due to rounding errors within the SPSP precision model is sufficiently large to lead to an accumulation of errors which can result in unphysical trajectories for long time scale simulations. We recommend the use of the mixed-precision SPDP model since the numerical results obtained are comparable with those of the full double precision DPDP model and the reference double precision CPU implementation but at significantly reduced computational cost. Our implementation provides performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers.

1,645 citations

Proceedings ArticleDOI
26 Aug 2015
TL;DR: This paper presents Unified Communication X, a set of network APIs and their implementations for high throughput computing, and implements the APIs and protocols, and measures the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries.
Abstract: This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.

129 citations

Journal ArticleDOI
TL;DR: A supercomputer-driven pipeline for in silico drug discovery using enhanced sampling molecular dynamics (MD) and ensemble docking is presented, including the use of quantum mechanical, machine learning, and artificial intelligence methods to cluster MD trajectories and rescore docking poses.
Abstract: We present a supercomputer-driven pipeline for in silico drug discovery using enhanced sampling molecular dynamics (MD) and ensemble docking. Ensemble docking makes use of MD results by docking compound databases into representative protein binding-site conformations, thus taking into account the dynamic properties of the binding sites. We also describe preliminary results obtained for 24 systems involving eight proteins of the proteome of SARS-CoV-2. The MD involves temperature replica exchange enhanced sampling, making use of massively parallel supercomputing to quickly sample the configurational space of protein drug targets. Using the Summit supercomputer at the Oak Ridge National Laboratory, more than 1 ms of enhanced sampling MD can be generated per day. We have ensemble docked repurposing databases to 10 configurations of each of the 24 SARS-CoV-2 systems using AutoDock Vina. Comparison to experiment demonstrates remarkably high hit rates for the top scoring tranches of compounds identified by our ensemble approach. We also demonstrate that, using Autodock-GPU on Summit, it is possible to perform exhaustive docking of one billion compounds in under 24 h. Finally, we discuss preliminary results and planned improvements to the pipeline, including the use of quantum mechanical (QM), machine learning, and artificial intelligence (AI) methods to cluster MD trajectories and rescore docking poses.

120 citations

Proceedings ArticleDOI
13 Sep 2011
TL;DR: A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System.
Abstract: The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIA's CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support.

83 citations


Cited by
More filters
01 May 1993
TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.
Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

29,323 citations

Journal ArticleDOI
TL;DR: An implementation of explicit solvent all atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled GPUs, providing results that are statistically indistinguishable from the traditional CPU version of the software and with performance that exceeds that achievable by the CPUs running on all conventional CPU-based clusters and supercomputers.
Abstract: We present an implementation of explicit solvent all atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled GPUs. First released publicly in April 2010 as part of version 11 of the AMBER MD package and further improved and optimized over the last two years, this implementation supports the three most widely used statistical mechanical ensembles (NVE, NVT, and NPT), uses particle mesh Ewald (PME) for the long-range electrostatics, and runs entirely on CUDA-enabled NVIDIA graphics processing units (GPUs), providing results that are statistically indistinguishable from the traditional CPU version of the software and with performance that exceeds that achievable by the CPU version of AMBER software running on all conventional CPU-based clusters and supercomputers. We briefly discuss three different precision models developed specifically for this work (SPDP, SPFP, and DPDP) and highlight the technical details of the approach as it extends beyond previously reported work [Gotz et al., J. Chem. Theory Comput. 2012, DOI: 10.1021/ct200909j; Le Grand et al., Comp. Phys. Comm. 2013, DOI: 10.1016/j.cpc.2012.09.022].We highlight the substantial improvements in performance that are seen over traditional CPU-only machines and provide validation of our implementation and precision models. We also provide evidence supporting our decision to deprecate the previously described fully single precision (SPSP) model from the latest release of the AMBER software package.

2,418 citations

Journal ArticleDOI
TL;DR: The most recent developments, since version 9 was released in April 2006, of the Amber and AmberTools MD software packages are outlined, referred to here as simply the Amber package.
Abstract: Molecular dynamics (MD) allows the study of biological and chemical systems at the atomistic level on timescales from femtoseconds to milliseconds. It complements experiment while also offering a way to follow processes difficult to discern with experimental techniques. Numerous software packages exist for conducting MD simulations of which one of the widest used is termed Amber. Here, we outline the most recent developments, since version 9 was released in April 2006, of the Amber and AmberTools MD software packages, referred to here as simply the Amber package. The latest release represents six years of continued development, since version 9, by multiple research groups and the culmination of over 33 years of work beginning with the first version in 1979. The latest release of the Amber package, version 12 released in April 2012, includes a substantial number of important developments in both the scientific and computer science arenas. We present here a condensed vision of what Amber currently supports and where things are likely to head over the coming years. Figure 1 shows the performance in ns/day of the Amber package version 12 on a single-core AMD FX-8120 8-Core 3.6GHz CPU, the Cray XT5 system, and a single GPU GTX680. © 2012 John Wiley & Sons, Ltd.

1,734 citations

Journal ArticleDOI
TL;DR: OpenMM is a molecular dynamics simulation toolkit with a unique focus on extensibility, which makes it an ideal tool for researchers developing new simulation methods, and also allows those new methods to be immediately available to the larger community.
Abstract: OpenMM is a molecular dynamics simulation toolkit with a unique focus on extensibility. It allows users to easily add new features, including forces with novel functional forms, new integration algorithms, and new simulation protocols. Those features automatically work on all supported hardware types (including both CPUs and GPUs) and perform well on all of them. In many cases they require minimal coding, just a mathematical description of the desired function. They also require no modification to OpenMM itself and can be distributed independently of OpenMM. This makes it an ideal tool for researchers developing new simulation methods, and also allows those new methods to be immediately available to the larger community.

1,364 citations

Journal ArticleDOI
TL;DR: The AMBER lipid force field has been updated to create Lipid14, allowing tensionless simulation of a number of lipid types with the AMBER MD package, and is compatible with theAMBER protein, nucleic acid, carbohydrate, and small molecule force fields.
Abstract: The AMBER lipid force field has been updated to create Lipid14, allowing tensionless simulation of a number of lipid types with the AMBER MD package. The modular nature of this force field allows numerous combinations of head and tail groups to create different lipid types, enabling the easy insertion of new lipid species. The Lennard-Jones and torsion parameters of both the head and tail groups have been revised and updated partial charges calculated. The force field has been validated by simulating bilayers of six different lipid types for a total of 0.5 μs each without applying a surface tension; with favorable comparison to experiment for properties such as area per lipid, volume per lipid, bilayer thickness, NMR order parameters, scattering data, and lipid lateral diffusion. As the derivation of this force field is consistent with the AMBER development philosophy, Lipid14 is compatible with the AMBER protein, nucleic acid, carbohydrate, and small molecule force fields.

973 citations