scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2018"


Proceedings Article
09 Apr 2018
TL;DR: The design of AccelNet is presented, including the hardware/software codesign model, performance results on key workloads, and experiences and lessons learned from developing and deploying Accel net on FPGA-based Azure SmartNICs.
Abstract: Modern cloud architectures rely on each server running its own networking stack to implement policies such as tunneling for virtual networks, security, and load balancing. However, these networking stacks are becoming increasingly complex as features are added and as network speeds increase. Running these stacks on CPU cores takes away processing power from VMs, increasing the cost of running cloud services, and adding latency and variability to network performance. We present Azure Accelerated Networking (AccelNet), our solution for offloading host networking to hardware, using custom Azure SmartNICs based on FPGAs. We define the goals of AccelNet, including programmability comparable to software, and performance and efficiency comparable to hardware. We show that FPGAs are the best current platform for offloading our networking stack as ASICs do not provide sufficient programmability, and embedded CPU cores do not provide scalable performance, especially on single network flows. Azure SmartNICs implementing AccelNet have been deployed on all new Azure servers since late 2015 in a fleet of >1M hosts. The AccelNet service has been available for Azure customers since 2016, providing consistent <15µs VM-VM TCP latencies and 32Gbps throughput, which we believe represents the fastest network available to customers in the public cloud. We present the design of AccelNet, including our hardware/software codesign model, performance results on key workloads, and experiences and lessons learned from developing and deploying AccelNet on FPGA-based Azure SmartNICs.

378 citations


Journal ArticleDOI
TL;DR: This paper presents a detailed analysis of Colaboratory regarding hardware resources, performance, and limitations and shows that the performance reached using this cloud service is equivalent to the performance of the dedicated testbeds, given similar resources.
Abstract: Google Colaboratory (also known as Colab) is a cloud service based on Jupyter Notebooks for disseminating machine learning education and research. It provides a runtime fully configured for deep learning and free-of-charge access to a robust GPU. This paper presents a detailed analysis of Colaboratory regarding hardware resources, performance, and limitations. This analysis is performed through the use of Colaboratory for accelerating deep learning for computer vision and other GPU-centric applications. The chosen test-cases are a parallel tree-based combinatorial search and two computer vision applications: object detection/classification and object localization/segmentation. The hardware under the accelerated runtime is compared with a mainstream workstation and a robust Linux server equipped with 20 physical cores. Results show that the performance reached using this cloud service is equivalent to the performance of the dedicated testbeds, given similar resources. Thus, this service can be effectively exploited to accelerate not only deep learning but also other classes of GPU-centric applications. For instance, it is faster to train a CNN on Colaboratory’s accelerated runtime than using 20 physical cores of a Linux server. The performance of the GPU made available by Colaboratory may be enough for several profiles of researchers and students. However, these free-of-charge hardware resources are far from enough to solve demanding real-world problems and are not scalable. The most significant limitation found is the lack of CPU cores. Finally, several strengths and limitations of this cloud service are discussed, which might be useful for helping potential users.

360 citations


Posted ContentDOI
05 Feb 2018-bioRxiv
TL;DR: This work greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture and highlights how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling.
Abstract: General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling.

152 citations


Posted Content
TL;DR: eRPC is a new general-purpose remote procedure call (RPC) library that offers performance comparable to specialized systems, while running on commodity CPUs in traditional datacenter networks based on either lossy Ethernet or lossless fabrics.
Abstract: It is commonly believed that datacenter networking software must sacrifice generality to attain high performance. The popularity of specialized distributed systems designed specifically for niche technologies such as RDMA, lossless networks, FPGAs, and programmable switches testifies to this belief. In this paper, we show that such specialization is not necessary. eRPC is a new general-purpose remote procedure call (RPC) library that offers performance comparable to specialized systems, while running on commodity CPUs in traditional datacenter networks based on either lossy Ethernet or lossless fabrics. eRPC performs well in three key metrics: message rate for small messages; bandwidth for large messages; and scalability to a large number of nodes and CPU cores. It handles packet loss, congestion, and background request execution. In microbenchmarks, one CPU core can handle up to 10 million small RPCs per second, or send large messages at 75 Gbps. We port a production-grade implementation of Raft state machine replication to eRPC without modifying the core Raft source code. We achieve 5.5 microseconds of replication latency on lossy Ethernet, which is faster than or comparable to specialized replication systems that use programmable switches, FPGAs, or RDMA.

140 citations


Journal ArticleDOI
TL;DR: A model of vector parallel ACO for multi-core SIMD CPU architecture is presented and a new fitness proportionate selection approach named Vector-based Roulette Wheel (VRW) in the tour construction stage is proposed, demonstrating the strong potential of CPU-based parallel ACOs.

111 citations


Proceedings ArticleDOI
11 Jul 2018
TL;DR: It is shown that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes.
Abstract: There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 13 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We provide a publicly-available benchmark suite containing our implementations.

100 citations


Proceedings ArticleDOI
08 Jul 2018
TL;DR: CARLsim 4, a user-friendly SNN library written in C++ that can simulate large biologically detailed neural networks, is released, improving on the efficiency and scalability of earlier releases and adding new features, such as leaky-integrate-and-fire (LIF), 9-parameter Izhikevich, multi-compartment neuron models, and fourth order Runge Kutta integration.
Abstract: Large-scale spiking neural network (SNN) simulations are challenging to implement, due to the memory and computation required to iteratively process the large set of neural state dynamics and updates. To meet these challenges, we have developed CARLsim 4, a user-friendly SNN library written in C++ that can simulate large biologically detailed neural networks. Improving on the efficiency and scalability of earlier releases, the present release allows for the simulation using multiple GPUs and multiple CPU cores concurrently in a heterogeneous computing cluster. Benchmarking results demonstrate simulation of 8.6 million neurons and 0.48 billion synapses using 4 GPUs and up to 60x speedup for multi-GPU implementations over a single-threaded CPU implementation, making CARLsim 4 wellsuited for large-scale SNN models in the presence of real-time constraints. Additionally, the present release adds new features, such as leaky-integrate-and-fire (LIF), 9-parameter Izhikevich, multi-compartment neuron models, and fourth order Runge Kutta integration.

96 citations


Journal ArticleDOI
26 Oct 2018
TL;DR: This paper focuses on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code.
Abstract: In this paper, we address the question of how to automatically map computational kernels to highly efficient code for a wide range of computing platforms and establish the correctness of the synthesized code. More specifically, we focus on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code. The problem is approached as follows: We develop a formal framework to capture computational algorithms, computing platforms, and program transformations of interest, using a unifying mathematical formalism we call operator language (OL). Then we cast the problem of synthesizing highly optimized computational kernels for a given machine as a strongly constrained optimization problem that is solved by search and a multistage rewriting system. Since all rewrite steps are semantics preserving, our approach establishes equivalence between the kernel specification and the synthesized program. This approach is implemented in the SPIRAL system, and we demonstrate it with a selection of computational kernels from the signal and image processing domain, software-defined radio, and robotic vehicle control. Our target platforms range from mobile devices, desktops, and server multicore processors to large-scale high-performance and supercomputing systems, and we demonstrate performance comparable to expertly hand-tuned code across kernels and platforms.

84 citations


Journal ArticleDOI
TL;DR: This paper presents a DTPM algorithm, which uses a practical temperature prediction methodology based on system identification and successfully regulates the maximum temperature and decreases the temperature violations by one order of magnitude while also reducing the total power consumption on average by 7% compared with the default solution.
Abstract: State-of-the-art mobile platforms are powered by heterogeneous system-on-chips that integrate multiple CPU cores, a GPU, and many specialized processors. Competitive performance on these platforms comes at the expense of increased power density due to their small form factor. Consequently, the skin temperature, which can degrade the experience, becomes a limiting factor. Since using a fan is not a viable solution for hand-held devices, there is a strong need for dynamic thermal and power management (DTPM) algorithms that can regulate temperature with minimal performance impact. This paper presents a DTPM algorithm, which uses a practical temperature prediction methodology based on system identification. The proposed algorithm dynamically computes a power budget using the predicted temperature. This budget is used to throttle the frequency and number of cores to avoid temperature violations with minimal impact on the system performance. Our experimental measurements on two different octa-core big.LITTLE processors and common Android applications demonstrate that the proposed technique predicts the temperature with less than 5% error across all benchmarks. Using this prediction, the proposed DTPM algorithm successfully regulates the maximum temperature and decreases the temperature violations by one order of magnitude while also reducing the total power consumption on average by 7% compared with the default solution.

68 citations


Journal ArticleDOI
TL;DR: A computationally efficient parallel implementation for a spectral-spatial classification method based on spatially adaptive Markov random fields (MRFs) that exploits the massively parallel nature of GPUs to achieve significant acceleration factors with regards to the serial and multicore versions of the same classifier on an NVIDIA Tesla K20C platform.
Abstract: Image classification is a very important tool for remotely sensed hyperspectral image processing. Techniques able to exploit the rich spectral information contained in the data, as well as its spatial-contextual information, have shown success in recent years. Due to the high dimensionality of hyperspectral data, spectral-spatial classification techniques are quite demanding from a computational viewpoint. In this paper, we present a computationally efficient parallel implementation for a spectral-spatial classification method based on spatially adaptive Markov random fields (MRFs). The method learns the spectral information from a sparse multinomial logistic regression classifier, and the spatial information is characterized by modeling the potential function associated with a weighted MRF as a spatially adaptive vector total variation function. The parallel implementation has been carried out using commodity graphics processing units (GPUs) and the NVIDIA's Compute Unified Device Architecture. It optimizes the work allocation and input/output transfers between the central processing unit and the GPU, taking full advantages of the computational power of GPUs as well as the high bandwidth and low latency of shared memory. As a result, the algorithm exploits the massively parallel nature of GPUs to achieve significant acceleration factors (higher than 70x) with regards to the serial and multicore versions of the same classifier on an NVIDIA Tesla K20C platform.

63 citations


Journal ArticleDOI
01 Jul 2018
TL;DR: A run-time management approach that first selects thread-to-core mapping based on the performance requirements and resource availability and applies online adaptation by adjusting the voltage-frequency levels to achieve energy optimization, without trading-off application performance is proposed.
Abstract: Heterogeneous multi-core platforms that contain different types of cores, organized as clusters, are emerging, e.g., ARM’s big.LITTLE architecture. These platforms often need to deal with multiple applications, having different performance requirements, executing concurrently. This leads to the generation of varying and mixed workloads (e.g., compute and memory intensive) due to resource sharing. Run-time management is required for adapting to such performance requirements and workload variabilities and to achieve energy efficiency. Moreover, the management becomes challenging when the applications are multi-threaded and the heterogeneity needs to be exploited. The existing run-time management approaches do not efficiently exploit cores situated in different clusters simultaneously (referred to as inter-cluster exploitation) and DVFS potential of cores, which is the aim of this paper. Such exploitation might help to satisfy the performance requirement while achieving energy savings at the same time. Therefore, in this paper, we propose a run-time management approach that first selects thread-to-core mapping based on the performance requirements and resource availability. Then, it applies online adaptation by adjusting the voltage-frequency ( V-f ) levels to achieve energy optimization, without trading-off application performance. For thread-to-core mapping, offline profiled results are used, which contain performance and energy characteristics of applications when executed on the heterogeneous platform by using different types of cores in various possible combinations. For an application, thread-to-core mapping process defines the number of used cores and their type, which are situated in different clusters. The online adaptation process classifies the inherent workload characteristics of concurrently executing applications, incurring a lower overhead than existing learning-based approaches as demonstrated in this paper. The classification of workload is performed using the metric Memory Reads Per Instruction (MRPI). The adaptation process pro-actively selects an appropriate V-f pair for a predicted workload. Subsequently, it monitors the workload prediction error and performance loss, quantified by instructions per second (IPS), and adjusts the chosen V-f to compensate. We validate the proposed run-time management approach on a hardware platform, the Odroid-XU3, with various combinations of multi-threaded applications from PARSEC and SPLASH benchmarks. Results show an average improvement in energy efficiency up to 33 percent compared to existing approaches while meeting the performance requirements.

Proceedings ArticleDOI
23 Apr 2018
TL;DR: This paper proposes Dynamic Cache Allocation with Partial Sharing (DCAPS), a framework that dynamically monitors and predicts a multi-programmed workload's cache demand, and reallocates LLC given a performance target and is able to optimize for a wide range of performance targets and can scale to a large core count.
Abstract: In a multicore system, effective management of shared last level cache (LLC), such as hardware/software cache partitioning, has attracted significant research attention. Some eminent progress is that Intel introduced Cache Allocation Technology (CAT) to its commodity processors recently. CAT implements way partitioning and provides software interface to control cache allocation. Unfortunately, CAT can only allocate at way level, which does not scale well for a large thread or program count to serve their various performance goals effectively. This paper proposes Dynamic Cache Allocation with Partial Sharing (DCAPS), a framework that dynamically monitors and predicts a multi-programmed workload's cache demand, and reallocates LLC given a performance target. Further, DCAPS explores partial sharing of a cache partition among programs and thus practically achieves cache allocation at a finer granularity. DCAPS consists of three parts: (1) Online Practical Miss Rate Curve (OPMRC), a low-overhead software technique to predict online miss rate curves (MRCs) of individual programs of a workload; (2) a prediction model that estimates the LLC occupancy of each individual program under any CAT allocation scheme; (3) a simulated annealing algorithm that searches for a near-optimal CAT scheme given a specific performance goal. Our experimental results show that DCAPS is able to optimize for a wide range of performance targets and can scale to a large core count.

Proceedings ArticleDOI
10 Feb 2018
TL;DR: It is concluded that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems.
Abstract: We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs Our representation, which we call HPVM, is a hierarchical dataflow graph with shared memory and vector instructions HPVM supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling; previous systems focus on only one of these capabilities As a compiler IR, HPVM aims to enable effective code generation and optimization for heterogeneous systems As a virtual ISA, it can be used to ship executable programs, in order to achieve both functional portability and performance portability across such systems At runtime, HPVM enables flexible scheduling policies, both through the graph structure and the ability to compile individual nodes in a program to any of the target devices on a system We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel's AVX vector units, and to multicore X86-64 processors Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities Overall, we conclude that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems

Proceedings ArticleDOI
23 Apr 2018
TL;DR: dCat is proposed, a new dynamic cache management technology to provide strong cache isolation with better performance and requires no modifications to applications so that it can be applied to all cloud workloads.
Abstract: In the modern multi-tenant cloud, resource sharing increases utilization but causes performance interference between tenants. More generally, performance isolation is also relevant in any multi-workload scenario involving shared resources. Last level cache (LLC) on processors is shared by all CPU cores in x86, thus the cloud tenants inevitably suffer from the cache flush by their noisy neighbors running on the same socket. Intel Cache Allocation Technology (CAT) provides a mechanism to assign cache ways to cores to enable cache isolation, but its static configuration can result in underutilized cache when a workload cannot benefit from its allocated cache capacity, and/or lead to sub-optimal performance for workloads that do not have enough assigned capacity to fit their working set. In this work, we propose a new dynamic cache management technology (dCat) to provide strong cache isolation with better performance. For each workload, we target a consistent, minimum performance bound irrespective of others on the socket and dependent only on its rightful share of the LLC capacity. In addition, when there is spare capacity on the socket, or when some workloads are not obtaining beneficial performance from their cache allocation, dCat dynamically reallocates cache space to cache-intensive workloads. We have implemented dCat in Linux on top of CAT to dynamically adjust cache mappings. dCat requires no modifications to applications so that it can be applied to all cloud workloads. Based on our evaluation, we see an average of 25% improvement over shared cache and 15.7% over static CAT for selected, memory intensive, SPEC CPU2006 workloads. For typical cloud workloads, with Redis we see 57.6% improvement (over shared LLC) and 26.6% improvement (over static partition) and with ElasticSearch we see 11.9% improvement over both.

Posted Content
TL;DR: This book focuses on the use of algorithmic high-level synthesis (HLS) to build application-specific FPGA systems and gives the reader an appreciation of the process of creating an optimized hardware design using HLS.
Abstract: This book focuses on the use of algorithmic high-level synthesis (HLS) to build application-specific FPGA systems. Our goal is to give the reader an appreciation of the process of creating an optimized hardware design using HLS. Although the details are, of necessity, different from parallel programming for multicore processors or GPUs, many of the fundamental concepts are similar. For example, designers must understand memory hierarchy and bandwidth, spatial and temporal locality of reference, parallelism, and tradeoffs between computation and storage. This book is a practical guide for anyone interested in building FPGA systems. In a university environment, it is appropriate for advanced undergraduate and graduate courses. At the same time, it is also useful for practicing system designers and embedded programmers. The book assumes the reader has a working knowledge of C/C++ and includes a significant amount of sample code. In addition, we assume familiarity with basic computer architecture concepts (pipelining, speedup, Amdahl's Law, etc.). A knowledge of the RTL-based FPGA design flow is helpful, although not required.

Proceedings ArticleDOI
21 May 2018
TL;DR: This paper presents a novel parallel implementation for training Gradient Boosting Decision Trees (GBDTs) on Graphics Processing Units (GPUs), and proposes various novel techniques (including Run-length Encoding compression and thread/block workload dynamic allocation, and reusing intermediate training results for efficient gradient computation).
Abstract: In this paper, we present a novel parallel implementation for training Gradient Boosting Decision Trees (GBDTs) on Graphics Processing Units (GPUs). Thanks to the wide use of the open sourced XGBoost library, GBDTs have become very popular in recent years and won many awards in machine learning and data mining competitions. Although GPUs have demonstrated their success in accelerating many machine learning applications, there are a series of key challenges of developing a GPU-based GBDT algorithm, including irregular memory accesses, many small sorting operations and varying data parallel granularities in tree construction. To tackle these challenges on GPUs, we propose various novel techniques (including Run-length Encoding compression and thread/block workload dynamic allocation, and reusing intermediate training results for efficient gradient computation). Our experimental results show that our algorithm named GPU-GBDT is often 10 to 20 times faster than the sequential version of XGBoost, and achieves 1.5 to 2 times speedup over a 40 threaded XGBoost running on a relatively high-end workstation of 20 CPU cores. Moreover, GPU-GBDT outperforms its CPU counterpart by 2 to 3 times in terms of performance-price ratio.

Journal ArticleDOI
TL;DR: A new model-based data partitioning algorithm is proposed, which minimizes the execution time of computations in the parallel execution of the application, and is proved the correctness of the algorithm and its complexity of the cardinality of the input discrete speed functions.
Abstract: Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc. This has resulted in severe resource contention and Non-Uniform Memory Access (NUMA) that have posed serious challenges to model and algorithm developers. Moreover, the accelerators feature limited main memory compared to the multicore CPU host and are connected to it via limited bandwidth PCI-E links thereby requiring support for efficient out-of-card execution. To summarize, the complexities (resource contention, NUMA, accelerator-specific limitations, etc.) have introduced new challenges to optimization of data-parallel applications on these platforms for performance. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance. We then propose a new model-based data partitioning algorithm, which minimizes the execution time of computations in the parallel execution of the application. This algorithm takes as input a set of $p$ discrete speed functions corresponding to $p$ available heterogeneous processors. It does not make any assumptions about the shapes of these functions. We prove the correctness of the algorithm and its complexity of $O(m^3 \times p^3)$ , where $m$ is the cardinality of the input discrete speed functions. We experimentally demonstrate the optimality and efficiency of our algorithm using two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous cluster of nodes where each node contains an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor.

Journal ArticleDOI
TL;DR: A novel cellular core network architecture, SoftBox is proposed, combining software-defined networking and network function virtualization to achieve greater flexibility, efficiency, and scalability compared to today’s cellular core.
Abstract: We propose a novel cellular core network architecture, SoftBox, combining software-defined networking and network function virtualization to achieve greater flexibility, efficiency, and scalability compared to today’s cellular core. Aligned with 5G use cases, SoftBox enables the creation of customized, low latency, and signaling-efficient services on a per user equipment (UE) basis. SoftBox consolidates network policies needed for processing each UE’s data and signaling traffic into a light-weight, in-network, and per-UE agent. We design a number of mobility-aware techniques to further optimize : 1) resource usage of agents; 2) forwarding rules and updates needed for steering a UE’s traffic through its agent; 3) migration costs of agents needed to ensure their proximity to mobile UEs; and 4) complexity of distributing the LTE mobility function on agents. Extensive evaluations demonstrate the scalability, performance, and flexibility of the SoftBox design. For example, basic SoftBox has 86%, 51%, and 87% lower signaling overheads, data plane delay, and CPU core usage, respectively, than two open source EPC systems. Moreover, our optimizations efficiently cut different types of data and control plane loads in the basic SoftBox by 51%–98%.

Journal ArticleDOI
TL;DR: The parallel EVPFFT code developed in this work can run massive voxel-based microstructural RVEs taking the advantages of thousands of logical cores provided by more advanced clusters and results achieved on a distributed memory Cray supercomputer show promising strong and weak scalability and in some cases even super scalability.

Proceedings ArticleDOI
20 Oct 2018
TL;DR: A new SSD simulation framework, SimpleSSD 2.0, is introduced, namely Amber, that models embedded CPU cores, DRAMs, and various flash technologies (within an SSD), and operate under the full system simulation environment by enabling a data transfer emulation.
Abstract: SSDs become a major storage component in modern memory hierarchies, and SSD research demands exploring future simulation-based studies by integrating SSD subsystems into a full-system environment. However, several challenges exist to model SSDs under a full-system simulations; SSDs are composed upon their own complete system and architecture, which employ all necessary hardware, such as CPUs, DRAM and interconnect network. Employing the hardware components, SSDs also require to have multiple device controllers, internal caches and software modules that respect a wide spectrum of storage interfaces and protocols. These SSD hardware and software are all necessary to incarnate storage subsystems under full-system environment, which can operate in parallel with the host system. In this work, we introduce a new SSD simulation framework, SimpleSSD 2.0, namely Amber, that models embedded CPU cores, DRAMs, and various flash technologies (within an SSD), and operate under the full system simulation environment by enabling a data transfer emulation. Amber also includes full firmware stack, including DRAM cache logic, flash firmware, such as FTL and HIL, and obey diverse standard protocols by revising the host DMA engines and system buses of a popular full system simulator's all functional and timing CPU models (gem5). The proposed simulator can capture the details of dynamic performance and power of embedded cores, DRAMs, firmware and flash under the executions of various OS systems and hardware platforms. Using Amber, we characterize several system-level challenges by simulating different types of full-systems, such as mobile devices and general-purpose computers, and offer comprehensive analyses by comparing passive storage and active storage architectures.

Proceedings ArticleDOI
02 Jul 2018
TL;DR: DHL is presented, a novel CPU-FPGA co-design framework for NFV with both high performance and flexibility and a prototype of DHL with Intel DPDK, which greatly reduces the programming efforts to access FPGA, brings significantly higher throughput and lower latency over CPU-only implementation, and minimizes the CPU resources.
Abstract: Network function virtualization (NFV) aims to run software network functions (NFs) in commodity servers. As CPU is general-purpose hardware, one has to use many CPU cores to handle complex packet processing at line rate. Owing to its performance and programmability, FPGA has emerged as a promising platform for NFV. However, the programmable logic blocks on an FPGA board are limited and expensive. Implementing the entire NFs on FPGA is thus resource-demanding. Further, FPGA needs to be reprogrammed when the NF logic changes which can take hours to synthesize the code. It is thus inflexible to use FPGA to implement the entire NFV service chain. We present dynamic hardware library (DHL), a novel CPU-FPGA co-design framework for NFV with both high performance and flexibility. DHL employs FPGA as accelerators only for complex packet processing. It abstracts accelerator modules in FPGA as a hardware function library, and provides a set of transparent APIs for developers. DHL supports running multiple concurrent software NFs with distinct accelerator functions on the same FPGA and provides data isolation among them. We implement a prototype of DHL with Intel DPDK. Experimental results demonstrate that DHL greatly reduces the programming efforts to access FPGA, brings significantly higher throughput and lower latency over CPU-only implementation, and minimizes the CPU resources.

Journal ArticleDOI
TL;DR: An analytic multi-core processor-performance model that takes as inputs a parametric microarchitecture-independent characterization of the target workload, and a hardware configuration of the core and the memory hierarchy makes automated design-space exploration of exascale systems possible, which is shown using a real-world case study from radio astronomy.
Abstract: Simulators help computer architects optimize system designs. The limited performance of simulators even of moderate size and detail makes the approach infeasible for design-space exploration of future exascale systems. Analytic models, in contrast, offer very fast turn-around times. In this paper we propose an analytic multi-core processor-performance model that takes as inputs a) a parametric microarchitecture-independent characterization of the target workload, and b) a hardware configuration of the core and the memory hierarchy. The processor-performance model considers instruction-level parallelism (ILP) per type, models single instruction, multiple data (SIMD) features, and considers cache and memory-bandwidth contention between cores. We validate our model by comparing its performance estimates with measurements from hardware performance counters on Intel Xeon and ARM Cortex-A15 systems. We estimate multi-core contention with a maximum error of 11.4 percent. The average single-thread error increases from 25 percent for a state-of-the-art simulator to 59 percent for our model, but the correlation is still 0.8, a high relative accuracy, while we achieve a speedup of several orders of magnitude. With a much higher capacity than simulators and more reliable insights than back-of-the-envelope calculations it makes automated design-space exploration of exascale systems possible, which we show using a real-world case study from radio astronomy.

Proceedings ArticleDOI
21 Feb 2018
TL;DR: The use of the Raspberry Pi single-board computer (SBC) to provide students with hands-on multicore learning experiences and assessment results that indicate such devices are effective at achieving CS2013 PDC learning outcomes are presented.
Abstract: With the requirement that parallel & distributed computing (PDC) topics be covered in the core computer science curriculum, educators are exploring new ways to engage students in this area of computing. In this paper, we discuss the use of the Raspberry Pi single-board computer (SBC) to provide students with hands-on multicore learning experiences. We discuss how the authors use the Raspberry Pi to teach parallel computing, and present assessment results that indicate such devices are effective at achieving CS2013 PDC learning outcomes, as well as motivating further study of parallelism. We believe our results are of significant interest to CS educators looking to integrate parallelism in their classrooms, and support the use of other SBCs for teaching parallel computing.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: This work presents the design and implementation of a hardware dynamic information flow tracking (DIFT) architecture for RISC-V processor cores that protects bare-metal applications against memory corruption attacks such as buffer overflows and format strings, without causing false alarms when running real-world benchmarks.
Abstract: Security for Internet-of-Things devices is an increasingly critical aspect of computer architecture, with implications that spread across a wide range of domains. We present the design and implementation of a hardware dynamic information flow tracking (DIFT) architecture for RISC-V processor cores. Our approach exhibits the following features at the architecture level. First, it supports a robust and software-programmable policy that protects bare-metal applications against memory corruption attacks such as buffer overflows and format strings, without causing false alarms when running real-world benchmarks. Second, it is fast and transparent, having a small impact on applications performances and providing a fine-grain management of security tags. Third, it consists of a flexible design that can be easily extended for targeting new sets of attacks. We implemented our architecture on PULPino, an open-source platform that supports the design of different RISC-V cores targeting IoT applications. FPGA-based experimental results show that the overall overhead is low, with no impact on the processor performance and negligible storage increase.

Journal ArticleDOI
TL;DR: A multilevel MPI+OpenMP+OpenCL parallelization approach that provides complete portability across a wide range of hybrid supercomputer architectures and a parallel CFD algorithm for heterogeneous computing of turbulent flows is presented.

Journal ArticleDOI
TL;DR: Efficient algorithms are presented to enhance the ability of GPU-based LU solvers to achieve higher parallelism as well as to exploit the dynamic parallelism feature in the state-of-the-art GPUs.
Abstract: Lower–upper (LU) factorization is widely used in many scientific computations. It is one of the most critical modules in circuit simulators, such as the Simulation Program With Integrated Circuit Emphasis. To exploit the emerging graphics process unit (GPU) computing platforms, several GPU-based sparse LU solvers have been recently proposed. In this paper, efficient algorithms are presented to enhance the ability of GPU-based LU solvers to achieve higher parallelism as well as to exploit the dynamic parallelism feature in the state-of-the-art GPUs. Also, rigorous performance comparisons of the proposed algorithms with GLU as well as KLU, for both the single-precision and double-precision cases, are presented.

Journal ArticleDOI
TL;DR: A novel parallel strategy by leveraging the parallel power of a multi-core CPU and that of a many-core GPU for HW/SW partitioning on CPU-GPU is presented and the solution quality obtained is competitive with existing heuristic methods in reasonable time.
Abstract: In hardware/software (HW/SW) co-design, hardware/software partitioning is an essential step in that it determines which components to be implemented in hardware and which ones in software. Most of HW/SW partitioning problems are NP hard. For large-size problems, heuristic methods have to be utilized. This paper presents a parallel genetic algorithm with dispersion correction for HW/SW partitioning on CPU-GPU. First, an enhanced genetic algorithm with dispersion correction is presented. The under-constraint individuals are marched to feasible region step by step. In this way, the intensification can be enhanced as well as the constraint problem can be handled. Second, the individuals performing costs computation and dispersion correction are run in parallel. For a given problem size, the overall run-time can be reduced while the diversity of genetic algorithm can be kept. Third, especially when a number of under-constraint individuals should be corrected in an irregular way, the computation process is complicated and the computation overhead is large. Therefore, we present a novel parallel strategy by leveraging the parallel power of a multi-core CPU and that of a many-core GPU. The proposed strategy computes the costs of each individual in parallel on GPU and corrects the under-constraint individuals in parallel on the multi-core CPU. In this way, a highly efficient parallel computing can be achieved in which dozens of irregular correction computing steps are mapped to the multi-core CPU and thousands of regular cost computing steps are mapped to the many-core GPU. Fourth, at each iteration of the hybrid parallel strategy, the solution vectors of individuals are transferred to the GPU and their costs are transferred back to the CPU. In order to further improve the efficiency of proposed algorithm, we propose an asynchronous transfer pattern (stream concurrency pattern) for CPU-GPU, in which the transfer process and computation process are overlapped and eventually the overall run-time can be reduced further. Finally, the experiments show that the solution quality obtained by our method is competitive with existing heuristic methods in reasonable time. Furthermore, by combining with the multi-core CPU and many-core GPU, the running time of the proposed method is efficiently reduced.

Journal ArticleDOI
TL;DR: Mia Wallace is presented, a 65-nm system-on-chip integrating a near-threshold parallel processor cluster tightly coupled with a CNN accelerator that achieves peak energy efficiency of 108 GMAC/s/W at 0.72 V and peak performance of 14 GMac/s at 1.2 V.
Abstract: Convolutional neural networks (CNNs) have revolutionized computer vision, speech recognition, and other fields requiring strong classification capabilities. These strengths make CNNs appealing in edge node Internet-of-Things (IoT) applications requiring near-sensors processing. Specialized CNN accelerators deliver significant performance per watt and satisfy the tight constraints of deeply embedded devices, but they cannot be used to implement arbitrary CNN topologies or nonconventional sensory algorithms where CNNs are only a part of the processing stack. A higher level of flexibility is desirable for next generation IoT nodes. Here, we present Mia Wallace , a 65-nm system-on-chip integrating a near-threshold parallel processor cluster tightly coupled with a CNN accelerator: it achieves peak energy efficiency of 108 GMAC/s/W at 0.72 V and peak performance of 14 GMAC/s at 1.2 V, leaving 1.2 GMAC/s available for general-purpose parallel processing.

Book
28 Feb 2018
TL;DR: This book covers the scope of parallel programming for modern high performance computing systems and introduces parallelization through important programming paradigms, such as master-slave, geometric Single Program Multiple Data (SPMD) and divide-and-conquer.
Abstract: In view of the growing presence and popularity of multicore and manycore processors, accelerators, and coprocessors, as well as clusters using such computing devices, the development of efficient parallel applications has become a key challenge to be able to exploit the performance of such systems. This book covers the scope of parallel programming for modern high performance computing systems. It first discusses selected and popular state-of-the-art computing devices and systems available today, These include multicore CPUs, manycore (co)processors, such as Intel Xeon Phi, accelerators, such as GPUs, and clusters, as well as programming models supported on these platforms. It next introduces parallelization through important programming paradigms, such as master-slave, geometric Single Program Multiple Data (SPMD) and divide-and-conquer. The practical and useful elements of the most popular and important APIs for programming parallel HPC systems are discussed, including MPI, OpenMP, Pthreads, CUDA, OpenCL, and OpenACC. It also demonstrates, through selected code listings, how selected APIs can be used to implement important programming paradigms. Furthermore, it shows how the codes can be compiled and executed in a Linux environment. The book also presents hybrid codes that integrate selected APIs for potentially multi-level parallelization and utilization of heterogeneous resources, and it shows how to use modern elements of these APIs. Selected optimization techniques are also included, such as overlapping communication and computations implemented using various APIs. Features: Discusses the popular and currently available computing devices and cluster systems Includes typical paradigms used in parallel programs Explores popular APIs for programming parallel applications Provides code templates that can be used for implementation of paradigms Provides hybrid code examples allowing multi-level parallelization Covers the optimization of parallel programs

Proceedings ArticleDOI
20 Jun 2018
TL;DR: In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU, and all algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing.
Abstract: Signal, image and Synthetic Aperture Radar imagery algorithms in recent time are used in a daily routine. Due to huge data and complexity, their processing is almost impossible in a real time. Often image processing algorithms are inherently parallel in nature, so they fit nicely into parallel architectures multicore Central Processing Unit (CPU) and Graphics Processing Unit GPUs. In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU. All algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing. Relative speedups compared to CPU were given for all algorithms. TensorFlow GPU implementation can outperform multi-core CPUs for tested algorithms, obtained speedups range from 3.6 to 15 times.