scispace - formally typeset
Search or ask a question

Showing papers on "Massively parallel published in 2021"


Journal ArticleDOI
TL;DR: Quantum ESPRESSO as mentioned in this paper is an open-source distribution of computer codes for quantum-mechanical materials modeling, based on density-functional theory, pseudopotentials, and plane waves.
Abstract: Quantum ESPRESSO is an open-source distribution of computer codes for quantum-mechanical materials modeling, based on density-functional theory, pseudopotentials, and plane waves, and renowned for its performance on a wide range of hardware architectures, from laptops to massively parallel computers, as well as for the breadth of its applications. In this paper we present a motivation and brief review of the ongoing effort to port Quantum ESPRESSO onto heterogeneous architectures based on hardware accelerators, which will overcome the energy constraints that are currently hindering the way towards exascale computing.

356 citations


Journal ArticleDOI
TL;DR: This Review surveys the basic principles, recent advances and promising future directions for wave-based-metamaterial analogue computing systems, and describes some of the most exciting applications suggested for these Computing metamaterials, including image processing, edge detection, equation solving and machine learning.
Abstract: Despite their widespread use for performing advanced computational tasks, digital signal processors suffer from several restrictions, including low speed, high power consumption and complexity, caused by costly analogue-to-digital converters. For this reason, there has recently been a surge of interest in performing wave-based analogue computations that avoid analogue-to-digital conversion and allow massively parallel operation. In particular, novel schemes for wave-based analogue computing have been proposed based on artificially engineered photonic structures, that is, metamaterials. Such kinds of computing systems, referred to as computational metamaterials, can be as fast as the speed of light and as small as its wavelength, yet, impart complex mathematical operations on an incoming wave packet or even provide solutions to integro-differential equations. These much-sought features promise to enable a new generation of ultra-fast, compact and efficient processing and computing hardware based on light-wave propagation. In this Review, we discuss recent advances in the field of computational metamaterials, surveying the state-of-the-art metastructures proposed to perform analogue computation. We further describe some of the most exciting applications suggested for these computing systems, including image processing, edge detection, equation solving and machine learning. Finally, we provide an outlook for the possible directions and the key problems for future research. Metamaterials provide a platform to leverage optical signals for performing specific-purpose computational tasks with ultra-fast speeds. This Review surveys the basic principles, recent advances and promising future directions for wave-based-metamaterial analogue computing systems.

175 citations


Journal ArticleDOI
TL;DR: A collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications, is presented, aiming to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.
Abstract: Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

83 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed an algorithm based on classical mechanics, which is obtained by modifying a previously proposed algorithm called simulated bifurcation, to achieve not only high speed by parallel computing but also high solution accuracy for problems with up to one million binary variables.
Abstract: Quickly obtaining optimal solutions of combinatorial optimization problems has tremendous value but is extremely difficult. Thus, various kinds of machines specially designed for combinatorial optimization have recently been proposed and developed. Toward the realization of higher-performance machines, here, we propose an algorithm based on classical mechanics, which is obtained by modifying a previously proposed algorithm called simulated bifurcation. Our proposed algorithm allows us to achieve not only high speed by parallel computing but also high solution accuracy for problems with up to one million binary variables. Benchmarking shows that our machine based on the algorithm achieves high performance compared to recently developed machines, including a quantum annealer using a superconducting circuit, a coherent Ising machine using a laser, and digital processors based on various algorithms. Thus, high-performance combinatorial optimization is realized by massively parallel implementations of the proposed algorithm based on classical mechanics.

56 citations


Journal ArticleDOI
26 Feb 2021-Science
TL;DR: In this paper, the authors demonstrate a method for ultrafast generation of hundreds of random bit streams in parallel with a single laser diode using spatiotemporal interference of many lasing modes in a specially designed cavity.
Abstract: Random numbers are widely used for information security, cryptography, stochastic modeling, and quantum simulations. Key technical challenges for physical random number generation are speed and scalability. We demonstrate a method for ultrafast generation of hundreds of random bit streams in parallel with a single laser diode. Spatiotemporal interference of many lasing modes in a specially designed cavity is introduced as a scheme for greatly accelerated random bit generation. Spontaneous emission, caused by quantum fluctuations, produces stochastic noise that makes the bit streams unpredictable. We achieve a total bit rate of 250 terabits per second with off-line postprocessing, which is more than two orders of magnitude higher than the current postprocessing record. Our approach is robust, compact, and energy-efficient, with potential applications in secure communication and high-performance computation.

48 citations


Book
15 Dec 2021
TL;DR: Develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines.
Abstract: Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today?s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines. Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems Download source code, examples, and instructor support materials on the book's companion website Table of Contents 1 Introduction 2 Multicore and Parallel Program Design 3 Shared-memory programming: Threads 4 Shared-memory programming: OpenMP 5 Distributed memory programming 6 GPU Programming 7 The Thrust Template Library 8 Load Balancing A Compiling Qt programs B RunningMPI Programs: Preparatory and Configuration Steps C Time Measurement D Boost.MPI E Setting up CUDA F DLTlib

45 citations


Posted Content
TL;DR: In this paper, an optical neural network achieves 99% accuracy on handwritten-digit classification using 3.2 detected photons per weight multiplication and 90% accuracy using ~0.64 photons (~$2.4 \times 10^{-19}$ J of optical energy).
Abstract: Deep learning has rapidly become a widespread tool in both scientific and commercial endeavors. Milestones of deep learning exceeding human performance have been achieved for a growing number of tasks over the past several years, across areas as diverse as game-playing, natural-language translation, and medical-image analysis. However, continued progress is increasingly hampered by the high energy costs associated with training and running deep neural networks on electronic processors. Optical neural networks have attracted attention as an alternative physical platform for deep learning, as it has been theoretically predicted that they can fundamentally achieve higher energy efficiency than neural networks deployed on conventional digital computers. Here, we experimentally demonstrate an optical neural network achieving 99% accuracy on handwritten-digit classification using ~3.2 detected photons per weight multiplication and ~90% accuracy using ~0.64 photons (~$2.4 \times 10^{-19}$ J of optical energy) per weight multiplication. This performance was achieved using a custom free-space optical processor that executes matrix-vector multiplications in a massively parallel fashion, with up to ~0.5 million scalar (weight) multiplications performed at the same time. Using commercially available optical components and standard neural-network training methods, we demonstrated that optical neural networks can operate near the standard quantum limit with extremely low optical powers and still achieve high accuracy. Our results provide a proof-of-principle for low-optical-power operation, and with careful system design including the surrounding electronics used for data storage and control, open up a path to realizing optical processors that require only $10^{-16}$ J total energy per scalar multiplication -- which is orders of magnitude more efficient than current digital processors.

43 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a scalable massively parallel computing scheme by exploiting a continuous-time data representation and frequency multiplexing in a nanoscale crossbar array, which enables the parallel reading of stored data and the one-shot operation of matrix-matrix multiplications in the cross-bar array.
Abstract: The growth of connected intelligent devices in the Internet of Things has created a pressing need for real-time processing and understanding of large volumes of analogue data. The difficulty in boosting the computing speed renders digital computing unable to meet the demand for processing analogue information that is intrinsically continuous in magnitude and time. By utilizing a continuous data representation in a nanoscale crossbar array, parallel computing can be implemented for the direct processing of analogue information in real time. Here, we propose a scalable massively parallel computing scheme by exploiting a continuous-time data representation and frequency multiplexing in a nanoscale crossbar array. This computing scheme enables the parallel reading of stored data and the one-shot operation of matrix-matrix multiplications in the crossbar array. Furthermore, we achieve the one-shot recognition of 16 letter images based on two physically interconnected crossbar arrays and demonstrate that the processing and modulation of analogue information can be simultaneously performed in a memristive crossbar array.

33 citations


Journal ArticleDOI
TL;DR: A parallel explicit solver exploiting the advantages of balanced octree meshes and a recently proposed mass lumping technique is extended to 3D yielding a well-conditioned diagonal mass matrix to efficiently compute the nodal displacements without the need for solving a system of linear equations.

32 citations


Journal ArticleDOI
TL;DR: In this brief review, the recent progress in two niche applications are presented: neural network accelerators and numerical computing units, mainly focusing on the advances in hardware demonstrations.
Abstract: Memristors are now becoming a prominent candidate to serve as the building blocks of non-von Neumann in-memory computing architectures. By mapping analog numerical matrices into memristor crossbar arrays, efficient multiply accumulate operations can be performed in a massively parallel fashion using the physics mechanisms of Ohm’s law and Kirchhoff’s law. In this brief review, we present the recent progress in two niche applications: neural network accelerators and numerical computing units, mainly focusing on the advances in hardware demonstrations. The former one is regarded as soft computing since it can tolerant some degree of the device and array imperfections. The acceleration of multiple layer perceptrons, convolutional neural networks, generative adversarial networks, and long short-term memory neural networks are described. The latter one is hard computing because the solving of numerical problems requires high-precision devices. Several breakthroughs in memristive equation solvers with improved computation accuracies are highlighted. Besides, other nonvolatile devices with the capability of analog computing are also briefly introduced. Finally, we conclude the review with discussions on the challenges and opportunities for future research toward realizing memristive analog computing machines.

31 citations


Proceedings ArticleDOI
13 Feb 2021
TL;DR: In this article, the authors present a 65nm QQVGA convolutional imager SoC codenamed SleepSpotter capable of feature extraction and region-of-interest (RoI) detection based on in-sensor current-domain MAC operations.
Abstract: Mixed-signal vision chips are becoming increasingly popular for low-power embedded computer vision applications on smartphones, wearables and IoT nodes, as they meet stringent power and area constraints while maintaining a sufficient level of accuracy for low- to medium-level image processing tasks. On the one hand, in-sensor processing [1, 2] enables massively parallel operation but relies on pixel-level processing elements that degrade the pixel pitch and restrict the convolutional receptive field to neighboring pixels [1], precluding multi-scale operation. On the other hand, near-sensor processing [3–5] can operate at multiple scales by pixel downsampling [3] or binning [4] but entails significant power and area overhead as an analog memory is required to store pixel values awaiting processing. In addition, previous near-sensor processing SoCs are generally application-specific and thus suffer from limited versatility. In this paper, we present a 65nm QQVGA convolutional imager SoC codenamed SleepSpotter capable of versatile feature extraction and region-of-interest (RoI) detection based on in-sensor current-domain MAC operations. It operates at 6 different scales, features programmable filter size (F), stride (S), and ternary filter weights (1.5b). It reaches a minimum energy of 2.5pJ/pixel•frame•filter and a peak efficiency of 3.6TOPS/W, with 29% pixel area overhead for enabling the convolution and without the need for an analog memory.

Journal ArticleDOI
TL;DR: This paper proposes a parallel implementation of the iterative type-2 fuzzy C-mean (IT2FCM) algorithm on a massively parallel SIMD architecture to segment different MRI images, and compares it to another parallel method from the literature.
Abstract: Fuzzy C-mean (FCM) is an algorithm for data segmentation and classification, robust and very popular within the scientific community. It is used in several fields such as computer vision, medical imaging and remote control. The purpose of this paper is to propose a parallel implementation of the iterative type-2 fuzzy C-mean (IT2FCM) algorithm on a massively parallel SIMD architecture to segment different MRI images. IT2FCM is an FCM standard variant; its objective is identical to that of FCM, except that the first has a higher accuracy level than the second. However, it is expensive in terms of time processing. Therefore, it is practically important to reduce its execution time while preserving the quality of the segmentation. This implementation is then compared with the sequential versions in the C language and Python using the Numpy and Numba libraries, and then, we compared it to another parallel method from the literature. The execution time obtained is faster than the sequential versions by about 15 × and 4 × for the second parallel version. The results achieved are very satisfactory compared to those taken from the literature.

Posted Content
TL;DR: This work introduces pLUTo (processing-in-memory with lookup table [LUT] operations), a new DRAM substrate that leverages the high area density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs).
Abstract: Data movement between main memory and the processor is a significant contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM), which enables computation inside the memory chip. However, existing PiM architectures often lack support for complex operations, since supporting these operations increases design complexity, chip area, and power consumption. We introduce pLUTo (processing-in-memory with lookup table [LUT] operations), a new DRAM substrate that leverages the high area density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The use of LUTs enables the efficient execution of complex operations in-memory, which has been a long-standing challenge in the domain of PiM. When running a state-of-the-art binary neural network in a single DRAM subarray, pLUTo outperforms the baseline CPU and GPU implementations by $33\times$ and $8\times$, respectively, while simultaneously achieving energy savings of $110\times$ and $80\times$.

Journal ArticleDOI
TL;DR: In this paper, a review explores the challenges and some of the solutions in transforming software from the terascale to the petascale and now to the upcoming exascale computers, highlighting the early codesign projects to take advantage of massively parallel computers and emerging software standards to enable large scientific challenges to be tackled.
Abstract: Since the advent of the first computers, chemists have been at the forefront of using computers to understand and solve complex chemical problems. As the hardware and software have evolved, so have the theoretical and computational chemistry methods and algorithms. Parallel computers clearly changed the common computing paradigm in the late 1970s and 80s, and the field has again seen a paradigm shift with the advent of graphical processing units. This review explores the challenges and some of the solutions in transforming software from the terascale to the petascale and now to the upcoming exascale computers. While discussing the field in general, NWChem and its redesign, NWChemEx, will be highlighted as one of the early codesign projects to take advantage of massively parallel computers and emerging software standards to enable large scientific challenges to be tackled.

Journal ArticleDOI
TL;DR: In this paper, a multi-layer perceptron architecture is proposed to compute the channel state information (CSI) of all pairwise channels simultaneously via a deep learning approach, which scales with large antenna arrays as opposed to traditional estimation methods like least square (LS) and linear minimum mean square error (LMMSE).
Abstract: Massive multiple-input multiple-output (mMIMO) is a critical component in upcoming 5G wireless deployment as an enabler for high data rate communications. mMIMO is effective when each corresponding antenna pair of the respective transmitter-receiver arrays experiences an independent channel. While increasing the number of antenna elements increases the achievable data rate, at the same time computing the channel state information (CSI) becomes prohibitively expensive. In this article, we propose to use deep learning via a multi-layer perceptron architecture that exceeds the performance of traditional CSI processing methods like least square (LS) and linear minimum mean square error (LMMSE) estimation, thus leading to a beyond fifth generation (B5G) networking paradigm wherein machine learning fully drives networking optimization. By computing the CSI of all pairwise channels simultaneously via our deep learning approach, our method scales with large antenna arrays as opposed to traditional estimation methods. The key insight here is to design the learning architecture such that it is implementable on massively parallel architectures, such as GPU or FPGA. We validate our approach by simulating a 32-element array base station and a user equipment with a 4-element array operating on millimeter-wave frequency band. Results reveal an improvement up to five and two orders of magnitude in BER with respect to fastest LS estimation and optimal LMMSE, respectively, substantially improving the end-to-end system performance and providing higher spatial diversity for lower SNR regions, achieving up to 4 dB gain in received power signal compared to performance obtained through LMMSE estimation.

Journal ArticleDOI
TL;DR: In this paper, the first attempt to exploit the super-computer platform for quantum chemical density matrix renormalization group (QC-DMRG) calculations is presented, to the best of our knowledge.
Abstract: We present, to the best of our knowledge, the first attempt to exploit the super-computer platform for quantum chemical density matrix renormalization group (QC-DMRG) calculations. We have developed the parallel scheme based on the in-house MPI global memory library, which combines operator and symmetry sector parallelisms, and tested its performance on three different molecules, all typical candidates for QC-DMRG calculations. In case of the largest calculation, which is the nitrogenase FeMo cofactor cluster with the active space comprising 113 electrons in 76 orbitals and bond dimension equal to 6000, our parallel approach scales up to approximately 2000 CPU cores.

Journal ArticleDOI
TL;DR: The GPU acceleration of the open-source code CaNS for very fast massively-parallel simulations of canonical fluid flows is presented and the wall-clock time per time step of the GPU-accelerated implementation is impressively small when compared to its CPU implementation on state-of-the-art many-CPU clusters.
Abstract: This work presents the GPU acceleration of the open-source code CaNS for very fast massively-parallel simulations of canonical fluid flows. The distinct feature of the many-CPU Navier–Stokes solver in CaNS is its fast direct solver for the second-order finite-difference Poisson equation, based on the method of eigenfunction expansions. The solver implements all the boundary conditions valid for this type of problems in a unified framework. Here, we extend the solver for GPU-accelerated clusters using CUDA Fortran. The porting makes extensive use of CUF kernels and has been greatly simplified by the unified memory feature of CUDA Fortran, which handles the data migration between host (CPU) and device (GPU) without defining new arrays in the source code. The overall implementation has been validated against benchmark data for turbulent channel flow and its performance assessed on a NVIDIA DGX-2 system (16 T V100 32Gb, connected with NVLink via NVSwitch). The wall-clock time per time step of the GPU-accelerated implementation is impressively small when compared to its CPU implementation on state-of-the-art many-CPU clusters, as long as the domain partitioning is sufficiently small that the data resides mostly on the GPUs. The implementation has been made freely available and open source under the terms of an MIT license.

Posted Content
TL;DR: Sparse Ising Machines (SIM) as mentioned in this paper exploit the sparsity of the resulting problem graphs to achieve ideal parallelism, achieving up to 6 orders of magnitude faster than a CPU implementing standard Gibbs sampling.
Abstract: Inspired by the developments in quantum computing, building quantum-inspired classical hardware to solve computationally hard problems has been receiving increasing attention. By introducing systematic sparsification techniques, we propose and demonstrate a massively parallel architecture, termed sIM or the sparse Ising Machine. Exploiting the sparsity of the resultant problem graphs, the sIM achieves ideal parallelism: the key figure of merit $-$ flips per second $-$ scales linearly with the total number of probabilistic bits (p-bit) in the system. This makes sIM up to 6 orders of magnitude faster than a CPU implementing standard Gibbs sampling. When compared to optimized implementations in TPUs and GPUs, the sIM delivers up to ~ 5 - 18x measured speedup. In benchmark combinatorial optimization problems such as integer factorization, the sIM can reliably factor semi-primes up to 32-bits, far larger than previous attempts from D-Wave and other probabilistic solvers. Strikingly, the sIM beats competition-winning SAT solvers (by up to ~ 4 - 700x in runtime to reach 95% accuracy) in solving hard instances of the 3SAT problem. A surprising observation is that even when the asynchronous sampling is made inexact with simultaneous updates using faster clocks, sIM can find the correct ground state with further speedup. The problem encoding and sparsification techniques we introduce can be readily applied to other Ising Machines (classical and quantum) and the asynchronous architecture we present can be used for scaling the demonstrated 5,000$-$10,000 p-bits to 1,000,000 or more through CMOS or emerging nanodevices.

Proceedings ArticleDOI
14 Nov 2021
TL;DR: In this article, the authors present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback, and discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time.
Abstract: The advancement of machine learning techniques and the heterogeneous architectures of most current supercomputers are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components and map them to relevant resources to solve complex problems at multiple scales. Nevertheless, despite the recent progress in workflow technologies, current capabilities are limited to coupling two scales. In the first-ever demonstration of using three scales of resolution, we present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback. We expand upon the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a recent, award-winning workflow, and generalize the framework beyond its original design. We discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time. We present innovations to enable several orders of magnitude scaling, including simultaneously coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and, with an upcoming open-source release, discuss how the presented framework may be used for new applications.

Journal ArticleDOI
TL;DR: In this paper, a polarization-insensitive metasurface processor was proposed to perform spatial asymmetric filtering of an incident beam, thereby allowing for real-time parallel analog processing.
Abstract: We present a polarization-insensitive metasurface processor to perform spatial asymmetric filtering of an incident beam, thereby allowing for real-time parallel analog processing. To enable massive parallel processing, we introduce a multiple-input multiple-output (MIMO) computational metasurface with asymmetric response that can perform spatial differentiation on two distinct input signals regardless of their polarization. In our scenario, two distinct signals set in $x$ and $y$ directions, parallel and perpendicular to the incident plane, illuminate simultaneously the metasurface processor, and the resulting differentiated signals are separated from each other via appropriate spatial low-pass filters. By leveraging generalized sheet transition conditions and surface susceptibility tensors, we design an asymmetric meta-atom augmented with normal susceptibilities to reach asymmetric response at normal beam illumination. Proof-of-principle simulations are also reported along with the successful realization of signal processing functions. The proposed metasurface overcomes major shortcomings imposed by previous studies, such as large architectures arising from the need for additional subblocks, slow responses, and, most importantly, supporting only a single input with a given polarization. Our results set the path for future developments of material-based analog computing using efficient and easy-to-fabricate MIMO processors for compact, fast, and integrable computing elements without any Fourier lens.

Journal ArticleDOI
TL;DR: It is demonstrated that GPUs can be very efficiently used for simulating collisional plasmas and argued that their further use will enable performing more accurate simulations in shorter time, increase research productivity and help in advancing the science of plasma simulation.

Posted Content
TL;DR: In this article, a new kind of physics-driven learning network comprised of identical self-adjusting variable resistors is presented, where each edge of the network adjusts its own resistance in parallel using only local information, effectively training itself to generate the desired output.
Abstract: The brain is a physical system capable of producing immensely complex collective properties on demand. Unlike a typical man-made computer, which is limited by the bottleneck of one or more central processors, self-adjusting elements of the brain (neurons and synapses) operate entirely in parallel. Here we report laboratory demonstration of a new kind of physics-driven learning network comprised of identical self-adjusting variable resistors. Our system is a realization of coupled learning, a recently-introduced theoretical framework specifying properties that enable adaptive learning in physical networks. The inputs are electrical stimuli and physics performs the output `computations' through the natural tendency to minimize energy dissipation across the system. Thus the outputs are analog physical responses to the input stimuli, rather than digital computations of a central processor. When exposed to training data, each edge of the network adjusts its own resistance in parallel using only local information, effectively training itself to generate the desired output. Our physical learning machine learns a range of tasks, including regression and classification, and switches between them on demand. It operates using well-understood physical principles and is made using standard, commercially available electronic components. As a result our system is massively scalable and amenable to theoretical understanding, combining the simplicity of artificial neural networks with the distributed parallel computation and robustness to damage boasted by biological neuron networks.

Journal ArticleDOI
TL;DR: The proposed techniques to improve the performance of number theoretic transform implementation include register-based twiddle factors storage and multi-stream asynchronous computation, which leverage on the features offered in new GPU architectures.
Abstract: In scientific computing and cryptography, there are many applications that involve large integer multiplication, which is a time-consuming operation. To reduce the computational complexity, number theoretic transform is widely used, wherein the multiplication can be performed in the frequency domain with reduced complexity. However, the speed performance of large integer multiplication is still not satisfactory if the operand size is very large (e.g., more than 100K-bit). In view of that, several researchers had proposed to accelerate the implementation of number theoretic transform using massively parallel GPU architecture. In this paper, we proposed several techniques to improve the performance of number theoretic transform implementation, which is faster than the state-of-the-art work by Dai et al. The proposed techniques include register-based twiddle factors storage and multi-stream asynchronous computation, which leverage on the features offered in new GPU architectures. The proposed number theoretic transform implementation was applied to CMNT fully homomorphic encryption scheme proposed by Coron et al. With the proposed implementation technique, homomorphic multiplications in CMNT take 0.27 ms on GTX1070 desktop GPU and 7.49 ms in Jetson TX1 embedded system, respectively. This shows that the proposed implementation is suitable for practical applications in server environment as well as embedded system.

Proceedings ArticleDOI
09 Jun 2021
TL;DR: MG-Join this paper is a scalable partitioned hash join implementation on multiple GPUs of a single machine, which adaptively chooses the efficient route for each data flow to minimize congestion.
Abstract: The recent scale-up of GPU hardware through the integration of multiple GPUs into a single machine and the introduction of higher bandwidth interconnects like NVLink 2.0 has enabled new opportunities of relational query processing on multiple GPUs. However, due to the unique characteristics of GPUs and the interconnects, existing hash join implementations spend up to 66% of their execution time moving the data between the GPUs and achieve lower than 50% utilization of the newer high bandwidth interconnects. This leads to extremely poor scalablity of hash join performance on multiple GPUs, which can be slower than the performance on a single GPU. In this paper, we propose MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine. In order to effectively improve the bandwidth utilization, we develop a novel multi-hop routing for cross-GPU communication that adaptively chooses the efficient route for each data flow to minimize congestion. Our experiments on the DGX-1 machine show that MG-Join helps significantly reduce the communication overhead and achieves up to 97% utilization of the bisection bandwidth of the interconnects, resulting in significantly better scalability. Overall, MG-Join outperforms the state-of-the-art hash join implementations by up to 2.5x. MG-Join further helps improve the overall performance of TPC-H queries by up to 4.5x over multi-GPU version of an open-source commercial GPU database Omnisci.

Journal ArticleDOI
TL;DR: It is shown that the RA-PCDM uses substantially less hardware area than a traditional rate adaptation scheme based on low-density parity-check (LDPC) codes, while offering finer rate granularity and shaping gain, which enables high-throughput optical communications to approach channel capacity more closely at a lower cost than RA-LDPC.
Abstract: In this work, we implement a rate-adaptable (RA) prefix-free code distribution matching (PCDM) encoder in a field-programmable gate array (FPGA). The implemented RA-PCDM encoder supports a wide range of information rates from 1.6 to 4.8 bit/symbol with fine granularity of 0.2 bit/symbol in probabilistic constellation shaping (PCS) systems. The shaping performance of the implemented RA-PCDM is within 0.55 dB of the theoretical limit. Massively parallel encoding of RA-PCDM is demonstrated using a sliding window processor, in a universal architecture for all PCDM codebooks. Given that the procedure and complexity of RA-PCDM decoding is almost the same as those of RA-PCDM encoding, it is shown that the RA-PCDM uses substantially less hardware area than a traditional rate adaptation scheme based on low-density parity-check (LDPC) codes, while offering finer rate granularity and shaping gain. The RA-PCDM is therefore a practical PCS technology that enables high-throughput optical communications to approach channel capacity more closely at a lower cost than RA-LDPC.

Journal ArticleDOI
Sang Hyun Sung1, Tae Jin Kim1, Hera Shin1, Hoon Namkung1, Tae Hong Im1, Hee Seung Wang1, Keon Jae Lee1 
TL;DR: Memory-centric neuromorphic computing (MNC) has been proposed for the efficient processing of unstructured data, bypassing the von Neumann bottleneck of current computing architecture as mentioned in this paper.
Abstract: The unstructured data such as visual information, natural language, and human behaviors opens up a wide array of opportunities in the field of artificial intelligence (AI). The memory-centric neuromorphic computing (MNC) has been proposed for the efficient processing of unstructured data, bypassing the von Neumann bottleneck of current computing architecture. The development of MNC would provide massively parallel processing of unstructured data, realizing the cognitive AI in edge and wearable systems. In this review, recent advances in memory-centric neuromorphic devices are discussed in terms of emerging nonvolatile memories, volatile switches, synaptic plasticity, neuronal models, and memristive neural network.

Journal ArticleDOI
TL;DR: A highly versatile computational framework for the simulation of cellular blood flow focusing on extreme performance without compromising accuracy or complexity is proposed, suitable for upcoming exascale architectures.
Abstract: We propose a highly versatile computational framework for the simulation of cellular blood flow focusing on extreme performance without compromising accuracy or complexity. The tool couples the lattice Boltzmann solver Palabos for the simulation of blood plasma, a novel finite-element method (FEM) solver for the resolution of deformable blood cells, and an immersed boundary method for the coupling of the two phases. The design of the tool supports hybrid CPU-GPU executions (fluid, fluid-solid interaction on CPUs, deformable bodies on GPUs), and is non-intrusive, as each of the three components can be replaced in a modular way. The FEM-based kernel for solid dynamics outperforms other FEM solvers and its performance is comparable to state-of-the-art mass-spring systems. We perform an exhaustive performance analysis on Piz Daint at the Swiss National Supercomputing Centre and provide case studies focused on platelet transport, implicitly validating the accuracy of our tool. The tests show that this versatile framework combines unprecedented accuracy with massive performance, rendering it suitable for upcoming exascale architectures.

Journal ArticleDOI
TL;DR: In this paper, the authors train a 3 × 3 array of ECRAM devices that learns to discriminate several elementary logic gates (AND, OR, NAND) during parallel in situ (on-line) training, with outer product updates.
Abstract: In-memory computing based on non-volatile resistive memory can significantly improve the energy efficiency of artificial neural networks. However, accurate in situ training has been challenging due to the nonlinear and stochastic switching of the resistive memory elements. One promising analog memory is the electrochemical random-access memory (ECRAM), also known as the redox transistor. Its low write currents and linear switching properties across hundreds of analog states enable accurate and massively parallel updates of a full crossbar array, which yield rapid and energy-efficient training. While simulations predict that ECRAM based neural networks achieve high training accuracy at significantly higher energy efficiency than digital implementations, these predictions have not been experimentally achieved. In this work, we train a 3 × 3 array of ECRAM devices that learns to discriminate several elementary logic gates (AND, OR, NAND). We record the evolution of the network's synaptic weights during parallel in situ (on-line) training, with outer product updates. Due to linear and reproducible device switching characteristics, our crossbar simulations not only accurately simulate the epochs to convergence, but also quantitatively capture the evolution of weights in individual devices. The implementation of the first in situ parallel training together with strong agreement with simulation results provides a significant advance toward developing ECRAM into larger crossbar arrays for artificial neural network accelerators, which could enable orders of magnitude improvements in energy efficiency of deep neural networks.

Proceedings ArticleDOI
14 Jun 2021
TL;DR: Sieve as mentioned in this paper proposes three DRAM-based in-situ k-mer matching accelerator designs (one optimized for area, one optimized for throughput, and one that strikes a balance between hardware cost and performance), which leverage a novel data mapping scheme to allow for simultaneous comparisons of millions of DNA base pairs.
Abstract: The rapid influx of biosequence data, coupled with the stagnation of the processing power of modern computing systems, highlights the critical need for exploring high-performance accelerators that can meet the ever-increasing throughput demands of modern bioinformatics applications. This work argues that processing in memory (PIM) is an effective solution to enhance the performance of k-mer matching, a critical bottleneck stage in standard bioinformatics pipelines, that is characterized by random access patterns and low computational intensity. This work proposes three DRAM-based in-situ k-mer matching accelerator designs (one optimized for area, one optimized for throughput, and one that strikes a balance between hardware cost and performance), dubbed Sieve, that leverage a novel data mapping scheme to allow for simultaneous comparisons of millions of DNA base pairs, lightweight matching circuitry for fast pattern matching, and an early termination mechanism that prunes unnecessary DRAM row activation to reduce latency and save energy. Evaluation of Sieve using state-of-the-art workloads with real-world datasets shows that the most aggressive design provides an average of 326x/32x speedup and 74X/48x energy savings over multi-core-CPU/GPU baselines for k-mer matching.

Journal ArticleDOI
TL;DR: This work is based on two key enablers: containers, to isolate Spark's parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to govern resource allocation at runtime and obtain required precision and speed.
Abstract: Many big-data applications are batch applications that exploit dedicated frameworks to perform massively parallel computations across clusters of machines. The time needed to process the entirety of the inputs represents the application's response time, which can be subject to deadlines. Spark, probably the most famous incarnation of these frameworks today, allocates resources to applications statically at the beginning of the execution and deviations are not managed: to meet the applications’ deadlines, resources must be allocated carefully. This paper proposes an extension to Spark, called dynaSpark, that is able to allocate and redistribute resources to applications dynamically to meet deadlines and cope with the execution of unanticipated applications. This work is based on two key enablers: containers, to isolate Spark's parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to govern resource allocation at runtime and obtain required precision and speed. Our evaluation shows that dynaSpark can (i) allocate resources efficiently to execute single applications with respect to set deadlines and (ii) reduce deadline violations (w.r.t. Spark) when executing multiple concurrent applications.