Showing papers on "Multi-core processor published in 2017"

PDF

Open Access

Journal Article•DOI•

A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)

[...]

Saber Moradi¹, Ning Qiao², Fabio Stefanini³, Giacomo Indiveri²•Institutions (3)

Yale University¹, University of Zurich², Columbia University³

03 Nov 2017-IEEE Transactions on Biomedical Circuits and Systems

TL;DR: A novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration is presented.

...read moreread less

Abstract: Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.

...read moreread less

479 citations

Journal Article•DOI•

Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices

[...]

Michael Gautschi¹, Pasquale Davide Schiavone¹, Andreas Traber¹, Igor Loi, Antonio Pullini¹, Davide Rossi, Eric Flamand¹, Frank K. Gurkaynak¹, Luca Benini¹ - Show less +5 more•Institutions (1)

ETH Zurich¹

24 Feb 2017-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: In this paper, the authors describe the design of an open-source RISC-V processor core specifically designed for near-threshold (NT) operation in tightly coupled multicore clusters and introduce instruction extensions and micro-architectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy.

...read moreread less

Abstract: Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold (NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper, we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multicore clusters. We introduce instruction extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy. For typical data-intensive sensor processing workloads, the proposed core is, on average, $3.5\times $ faster and $3.2\times $ more energy efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. Single Instruction Multiple Data extensions, such as dot products, and a built-in L0 storage further reduce the shared-memory accesses by $8\times $ reducing contentions by $3.2\times $ . With four NT-optimized cores, the cluster is operational from 0.6 to 1.2 V, achieving a peak efficiency of 67 MOPS/mW in a low-cost 65-nm bulk CMOS technology. In a low-power 28-nm FD-SOI process, a peak efficiency of 193 MOPS/mW (40 MHz and 1 mW) can be achieved.

...read moreread less

304 citations

Proceedings Article•DOI•

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

[...]

Christopher De Sa¹, Matthew Feldman¹, Christopher Ré¹, Kunle Olukotun¹•Institutions (1)

Stanford University¹

24 Jun 2017

TL;DR: The DMGC model is introduced, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and it is shown that it provides a way to both classify these algorithms and model their performance.

...read moreread less

Abstract: Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buck-wild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11X. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.

...read moreread less

155 citations

Proceedings Article•DOI•

ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks

[...]

George Prekas¹, Marios Kogias¹, Edouard Bugnion¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

14 Oct 2017

TL;DR: ZYGOS is presented, a system optimized for μs-scale, in-memory computing on multicore servers that implements a work-conserving scheduler within a specialized operating system designed for high request rates and a large number of network connections.

...read moreread less

Abstract: This paper focuses on the efficient scheduling on multicore systems of very fine-grain networked tasks, which are the typical building block of online data-intensive applications. The explicit goal is to deliver high throughput (millions of remote procedure calls per second) for tail latency service-level objectives that are a small multiple of the task size. We present ZYGOS, a system optimized for μs-scale, in-memory computing on multicore servers. It implements a work-conserving scheduler within a specialized operating system designed for high request rates and a large number of network connections. ZYGOS uses a combination of shared-memory data structures, multi-queue NICs, and inter-processor interrupts to rebalance work across cores. For an aggressive service-level objective expressed at the 99th percentile, ZYGOS achieves 75% of the maximum possible load determined by a theoretical, zero-overhead model (centralized queueing with FCFS) for 10μs tasks, and 88% for 25μs tasks. We evaluate ZYGOS with a networked version of Silo, a state-of-the-art in-memory transactional database, running TPC-C. For a service-level objective of 1000μs latency at the 99th percentile, ZYGOS can deliver a 1.63x speedup over Linux (because of its dataplane architecture) and a 1.26x speedup over IX, a state-of-the-art dataplane (because of its work-conserving scheduler).

...read moreread less

144 citations

Proceedings Article•DOI•

Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures

[...]

David Sidler¹, Zsolt István¹, Muhsen Owaida¹, Gustavo Alonso¹•Institutions (1)

ETH Zurich¹

09 May 2017

TL;DR: This work integrates the hardware accelerator into MonetDB, a main-memory column store, and demonstrates a significant improvement in response time and throughput, and provides a novel and efficient implementation of two commonly used SQL operators for strings.

...read moreread less

Abstract: Taking advantage of recently released hybrid multicore architectures, such as the Intel's Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.

...read moreread less

91 citations

Proceedings Article•DOI•

Concurrent Data Structures for Near-Memory Computing

[...]

Zhiyu Liu¹, Irina Calciu², Maurice Herlihy¹, Onur Mutlu³•Institutions (3)

Brown University¹, VMware², ETH Zurich³

24 Jul 2017

TL;DR: This paper is the first to examine the design of concurrent data structures for PIM, and shows two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrentData structures, and (2) novel designs for Pim data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structure, with a significantly simpler design.

...read moreread less

Abstract: The performance gap between memory and CPU has grown exponentially. To bridge this gap, hardware architects have proposed near-memory computing (also called processing-in-memory, or PIM), where a lightweight processor (called a PIM core) is located close to memory. Due to its proximity to memory, a memory access from a PIM core is much faster than that from a CPU core. New advances in 3D integration and die-stacked memory make PIM viable in the near future. Prior work has shown significant performance improvements by using PIM for embarrassingly parallel and data-intensive applications, as well as for pointer-chasing traversals in sequential data structures. However, current server machines have hundreds of cores, and algorithms for concurrent data structures exploit these cores to achieve high throughput and scalability, with significant benefits over sequential data structures. Thus, it is important to examine how PIM performs with respect to modern concurrent data structures and understand how concurrent data structures can be developed to take advantage of PIM. This paper is the first to examine the design of concurrent data structures for PIM. We show two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrent data structures, such as pointer-chasing data structures and FIFO queues, (2) novel designs for PIM data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structures, with a significantly simpler design.

...read moreread less

90 citations

Proceedings Article•DOI•

Making caches work for graph analytics

[...]

Yunming Zhang¹, Vladimir Kiriansky¹, Charith Mendis¹, Saman Amarasinghe¹, Matei Zaharia² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Stanford University²

01 Jul 2017

TL;DR: Cagra achieves speedups of up to 5× for PageRank, Collaborative Filtering, Label Propagation and Betweenness Centrality over the best published results from state-of-the-art graph frameworks, including GraphMat, Ligra and GridGraph.

...read moreread less

Abstract: Large-scale applications implemented in today's high performance graph frameworks heavily underutilize modern hardware systems. While many graph frameworks have made substantial progress in optimizing these applications, we show that it is still possible to achieve up to 5× speedups over the fastest frameworks by greatly improving cache utilization. Previous systems have applied out-of-core processing techniques from the memory/disk boundary to the cache/DRAM boundary. However, we find that blindly applying such techniques is ineffective because the much smaller performance gap between cache and DRAM requires new designs for achieving scalable performance and low overhead. We present Cagra, a cache optimized inmemory graph framework. Cagra uses a novel technique, CSR Segmenting, to break the vertices into segments that fit in last level cache, and partitions the graph into subgraphs based on the segments. Random accesses in each subgraph are limited to one segment at a time, eliminating the much slower random accesses to DRAM. The intermediate updates from each subgraph are written into buffers sequentially and later merged using a low overhead parallel cache-aware merge. Cagra achieves speedups of up to 5× for PageRank, Collaborative Filtering, Label Propagation and Betweenness Centrality over the best published results from state-of-the-art graph frameworks, including GraphMat, Ligra and GridGraph.

...read moreread less

86 citations

Proceedings Article•DOI•

UNO: uniflying host and smart NIC offload for flexible packet processing

[...]

Yanfang Le¹, Hyunseok Chang², Sarit Mukherjee², Limin Wang², Aditya Akella¹, Michael M. Swift¹, T. V. Lakshman² - Show less +3 more•Institutions (2)

University of Wisconsin-Madison¹, Bell Labs²

24 Sep 2017

TL;DR: This paper proposes a generalized SDN-controlled NF offload architecture called UNO, which can transparently offload dynamically selected host processors' packet processing functions to sNICs by using multiple switches in the host while keeping the data centerwide network control and management planes unmodified.

...read moreread less

Abstract: Increasingly, smart Network Interface Cards (sNICs) are being used in data centers to offload networking functions (NFs) from host processors thereby making these processors available for tenant applications. Modern sNICs have fully programmable, energy-efficient multi-core processors on which many packet processing functions, including a full-blown programmable switch, can run. However, having multiple switch instances deployed across the host hypervisor and the attached sNICs makes controlling them difficult and data plane operations more complex. This paper proposes a generalized SDN-controlled NF offload architecture called UNO. It can transparently offload dynamically selected host processors' packet processing functions to sNICs by using multiple switches in the host while keeping the data centerwide network control and management planes unmodified. UNO exposes a single virtual control plane to the SDN controller and hides dynamic NF offload behind a unified virtual management plane. This enables UNO to make optimal use of host's and sNIC's combined packet processing capabilities with local optimization based on locally observed traffic patterns and resource consumption, and without central controller involvement. Experimental results based on a real UNO prototype in realistic scenarios show promising results: it can save processing worth up to 8 CPU cores, reduce power usage by up to 2x, and reduce the control plane overhead by more than 50%.

...read moreread less

82 citations

Journal Article•DOI•

IBM Power9 Processor Architecture

[...]

Satish Kumar Sadasivam¹, Thompto Brian W¹, Ron Kalla¹, William J. Starke¹•Institutions (1)

IBM¹

01 Mar 2017-IEEE Micro

TL;DR: With a new core microarchitecture design, along with an innovative I/O fabric to support several accelerated computing requirements, the Power9 processor meets the diverse computing needs of the cognitive era and provides a platform for accelerated computing.

...read moreread less

Abstract: The IBM Power9 processor has an enhanced core and chip architecture that provides superior thread performance and higher throughput. The core and chip architectures are optimized for emerging workloads to support the needs of next-generation computing. Multiple variants of silicon target the scale-out and scale-up markets. With a new core microarchitecture design, along with an innovative I/O fabric to support several accelerated computing requirements, the Power9 processor meets the diverse computing needs of the cognitive era and provides a platform for accelerated computing.

...read moreread less

82 citations

Journal Article•DOI•

Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model

[...]

Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento¹, Florent Pruvost, Marc Sergent, Samuel Thibault² - Show less +3 more•Institutions (2)

Centre national de la recherche scientifique¹, University of Bordeaux²

19 Dec 2017-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper has extended the StarPU runtime system with an advanced inter-node data management layer that supports the sequential task-based programming model, and shows that this paradigm can also be employed to achieve high performance on modern supercomputers composed of multiple such nodes, with extremely limited changes in the user code.

...read moreread less

Abstract: The emergence of accelerators as standard computing resources on supercomputers and the subsequent architectural complexity increase revived the need for high-level parallel programming paradigms. Sequential task-based programming model has been shown to efficiently meet this challenge on a single multicore node possibly enhanced with accelerators, which motivated its support in the OpenMP 4.0 standard. In this paper, we show that this paradigm can also be employed to achieve high performance on modern supercomputers composed of multiple such nodes, with extremely limited changes in the user code. To prove this claim, we have extended the StarPU runtime system with an advanced inter-node data management layer that supports this model by posting communications automatically. We illustrate our discussion with the task- based tile Cholesky algorithm that we implemented on top of this new runtime system layer. We show that it allows for very high productivity while achieving a performance competitive with both the pure Message Passing Interface (MPI)-based ScaLAPACK Cholesky reference implemen- tation and the DPLASMA Cholesky code, which implements another (non sequential) task-based programming paradigm.

...read moreread less

76 citations

Journal Article•DOI•

Thermal Safe Power (TSP): Efficient Power Budgeting for Heterogeneous Manycore Systems in Dark Silicon

[...]

Santiago Pagani¹, Heba Khdr¹, Jian-Jia Chen², Muhammad Shafique¹, Minming Li³, Jorg Henkel¹ - Show less +2 more•Institutions (3)

Karlsruhe Institute of Technology¹, Technical University of Dortmund², City University of Hong Kong³

01 Jan 2017-IEEE Transactions on Computers

TL;DR: A new power budget concept, called Thermal Safe Power (TSP), which is an abstraction that provides safe power and power density constraints as a function of the number of simultaneously active cores, which results in dark silicon estimations which are less pessimistic than estimations using constant power budgets.

...read moreread less

Abstract: Chip manufacturers provide the Thermal Design Power (TDP) for a specific chip. The cooling solution is designed to dissipate this power level. But because TDP is not necessarily the maximum power that can be applied, chips are operated with Dynamic Thermal Management (DTM) techniques. To avoid excessive triggers of DTM, usually, system designers also use TDP as power constraint. However, using a single and constant value as power constraint, e.g., TDP, can result in significant performance losses in homogeneous and heterogeneous manycore systems. Having better power budgeting techniques is a major step towards dealing with the dark silicon problem. This paper presents a new power budget concept, called Thermal Safe Power (TSP), which is an abstraction that provides safe power and power density constraints as a function of the number of simultaneously active cores. Executing cores at any power consumption below TSP ensures that DTM is not triggered. TSP can be computed offline for the worst cases, or online for a particular mapping of cores. TSP can also serve as a fundamental tool for guiding task partitioning and core mapping decisions, specially when core heterogeneity or timing guarantees are involved. Moreover, TSP results in dark silicon estimations which are less pessimistic than estimations using constant power budgets.

...read moreread less

Journal Article•DOI•

Proactive elasticity and energy awareness in data stream processing

[...]

Tiziano De Matteis¹, Gabriele Mencagli¹•Institutions (1)

University of Pisa¹

01 May 2017-Journal of Systems and Software

TL;DR: A set of energy-aware proactive strategies, optimized for throughput and latency QoS requirements, which regulate the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support offered by modern multicore CPUs are designed.

...read moreread less

Journal Article•DOI•

OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology

[...]

Hasitha Muthumala Waidyasooriya¹, Yasuhiro Takei¹, Shunsuke Tatsumi¹, Masanori Hariyama¹•Institutions (1)

Tohoku University¹

01 May 2017-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This work proposes an FPGA-platform using C-like programming language called open computing language (OpenCL) and an optimization methodology to find the optimal architecture for a given application using the proposed FPFA-platform.

...read moreread less

Abstract: Stencil computation is widely used in scientific computations and many accelerators based on multicore CPUs and GPUs have been proposed. Stencil computation has a small operational intensity so that a large external memory bandwidth is usually required for high performance. FPGAs have the potential to solve this problem by utilizing large internal memory efficiently. However, a very large design, testing and debugging time is required to implement an FPGA architecture successfully. To solve this problem, we propose an FPGA-platform using C-like programming language called open computing language (OpenCL). We also propose an optimization methodology to find the optimal architecture for a given application using the proposed FPFA-platform. According to the experimental results, we achieved 119 $\sim$ 237 Gflop/s of processing power and higher processing speed compared to conventional GPU and multicore CPU implementations.

...read moreread less

Proceedings Article•DOI•

Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation

[...]

Tao B. Schardl¹, William S. Moses¹, Charles E. Leiserson¹•Institutions (1)

Massachusetts Institute of Technology¹

26 Jan 2017

TL;DR: This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler's intermediate representation (IR) with only minor changes to its existing analyses and code transformations.

...read moreread less

Abstract: This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler's intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the program's control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, we added or modified about 6000 lines of LLVM's 4-million-line codebase. Tapir enables LLVM's existing compiler optimizations for serial code -- including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination -- to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.

...read moreread less

Proceedings Article•DOI•

SlimSell: A Vectorizable Graph Representation for Breadth-First Search

[...]

Maciej Besta¹, Florian Marending¹, Edgar Solomonik², Torsten Hoefler¹•Institutions (2)

ETH Zurich¹, University of Illinois at Urbana–Champaign²

01 May 2017

TL;DR: SlimSell as mentioned in this paper is a vectorizable graph representation to accelerate BFS based on sparse-matrix dense-vector (SpMV) products, which reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem.

...read moreread less

Abstract: Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.

...read moreread less

Proceedings Article•DOI•

APPROX-NoC: A Data Approximation Framework for Network-On-Chip Architectures

[...]

Rahul Boyapati¹, Jiayi Huang¹, Pritam Majumder¹, Ki Hwan Yum¹, Eun Jung Kim¹ - Show less +1 more•Institutions (1)

Texas A&M University¹

24 Jun 2017

TL;DR: APPROX-NoC is proposed, a hardware data approximation framework with an online data error control mechanism for high performance NoCs that facilitates approximate matching of data patterns, within a controllable value range, to compress them thereby reducing the volume of data movement across the chip.

...read moreread less

Abstract: The trend of unsustainable power consumption and large memory bandwidth demands in massively parallel multicore systems, with the advent of the big data era, has brought upon the onset of alternate computation paradigms utilizing heterogeneity, specialization, processor-in-memory and approximation. Approximate Computing is being touted as a viable solution for high performance computation by relaxing the accuracy constraints of applications. This trend has been accentuated by emerging data intensive applications in domains like image/video processing, machine learning and big data analytics that allow inaccurate outputs within an acceptable variance. Leveraging relaxed accuracy for high throughput in Networks-on-Chip (NoCs), which have rapidly become the accepted method for connecting a large number of on-chip components, has not yet been explored. We propose APPROX-NoC, a hardware data approximation framework with an online data error control mechanism for high performance NoCs. APPROX-NoC facilitates approximate matching of data patterns, within a controllable value range, to compress them thereby reducing the volume of data movement across the chip.Our evaluation shows that APPROX-NoC achieves on average up to 9% latency reduction and 60% throughput improvement compared with state-of-the-art NoC data compression mechanisms, while maintaining low application error. Additionally, with a data intensive graph processing application we achieve a 36.7% latency reduction compared to state-of-the-art compression mechanisms.

...read moreread less

Journal Article•DOI•

Power Density-Aware Resource Management for Heterogeneous Tiled Multicores

[...]

Heba Khdr¹, Santiago Pagani¹, Ericles Sousa², Vahid Lari², Anuj Pathania¹, Frank Hannig², Muhammad Shafique¹, Jürgen Teich², Jorg Henkel¹ - Show less +5 more•Institutions (2)

Karlsruhe Institute of Technology¹, University of Erlangen-Nuremberg²

01 Mar 2017-IEEE Transactions on Computers

TL;DR: This paper presents a resource management technique that introduces power density as a novel system level constraint, and provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime.

...read moreread less

Abstract: Increasing power densities have led to the dark silicon era, for which heterogeneous multicores with different power and performance characteristics are promising architectures. This paper focuses on maximizing the overall system performance under a critical temperature constraint for heterogeneous tiled multicores, where all cores or accelerators inside a tile share the same voltage and frequency levels. For such architectures, we present a resource management technique that introduces power density as a novel system level constraint, in order to avoid thermal violations. The proposed technique then assigns applications to tiles by choosing their degree of parallelism and the voltage/frequency levels of each tile, such that the power density constraint is satisfied. Moreover, our technique provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime. Thus, the available thermal headroom is exploited to maximize the overall system performance.

...read moreread less

Proceedings Article•

Everything you always wanted to know about multicore graph processing but were afraid to ask

[...]

Jasmina Malicevic¹, Baptiste Lepers¹, Willy Zwaenepoel¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

12 Jul 2017

TL;DR: It is demonstrated that NUMA-awareness and its attendant pre-processing costs are beneficial only on large machines and for certain algorithms, calling into question the benefits of proposed algorithmic optimizations that rely on extensive preprocessing.

...read moreread less

Abstract: Graph processing systems are used in a wide variety of fields, ranging from biology to social networks, and a large number of such systems have been described in the recent literature. We perform a systematic comparison of various techniques proposed to speed up in-memory multicore graph processing. In addition, we take an end-to-end view of execution time, including not only algorithm execution time, but also pre-processing time and the time to load the graph input data from storage. More specifically, we study various data structures to represent the graph in memory, various approaches to pre-processing and various ways to structure the graph computation. We also investigate approaches to improve cache locality, synchronization, and NUMA-awareness. In doing so, we take our inspiration from a number of graph processing systems, and implement the techniques they propose in a single system. We then selectively enable different techniques, allowing us to assess their benefits in isolation and independent of unrelated implementation considerations. Our main observation is that the cost of pre-processing in many circumstances dominates the cost of algorithm execution, calling into question the benefits of proposed algorithmic optimizations that rely on extensive preprocessing. Equally surprising, using radix sort turns out to be the most efficient way of pre-processing the graph input data into adjacency lists, when the graph input data is already in memory or is loaded from fast storage. Furthermore, we adapt a technique developed for out-of-core graph processing, and show that it significantly improves cache locality. Finally, we demonstrate that NUMA-awareness and its attendant pre-processing costs are beneficial only on large machines and for certain algorithms.

...read moreread less

Proceedings Article•DOI•

Jenga: Software-Defined Cache Hierarchies

[...]

Po-An Tsai¹, Nathan Beckmann², Daniel Sanchez¹•Institutions (2)

Massachusetts Institute of Technology¹, Carnegie Mellon University²

24 Jun 2017

TL;DR: Jenga is proposed, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications, and builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime.

...read moreread less

Abstract: Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications.We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.

...read moreread less

Proceedings Article•DOI•

Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms

[...]

Roberto Cavicchioli¹, Nicola Capodieci¹, Marko Bertogna¹•Institutions (1)

University of Modena and Reggio Emilia¹

01 Sep 2017

TL;DR: The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms.

...read moreread less

Abstract: Most of today's mixed criticality platforms feature Systems on Chip (SoC) where a multi-core CPU complex (the host) competes with an integrated Graphic Processor Unit (iGPU, the device) for accessing central memory. The multi-core host and the iGPU share the same memory controller, which has to arbitrate data access to both clients through often undisclosed or non-priority driven mechanisms. Such aspect becomes critical when the iGPU is a high performance massively parallel computing complex potentially able to saturate the available DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms. We analyzed different well known and commercially available platforms in order to estimate variations in throughput and latencies within various memory access patterns, both at host and device side.

...read moreread less

Proceedings Article•DOI•

SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support

[...]

Xiaodong Wang¹, Shuang Chen², Jeff Setter³, Jose F. Martinez¹•Institutions (3)

Cornell University¹, Shanghai Jiao Tong University², Stanford University³

01 Feb 2017

TL;DR: This paper prototypeises SWAP on a 48-core Cavium ThunderX platform running Linux, and shows average speedups over no cache partitioning that are twice as large as those attained with ThunderX's hardware way partitioning alone.

...read moreread less

Abstract: Performance isolation is an important goal in server-class environments. Partitioning the last-level cache of a chip multiprocessor (CMP) across co-running applications has proven useful in this regard. Two popular approaches are (a) hardware support for way partitioning, or (b) operating system support for set partitioning through page coloring. Unfortunately, neither approach by itself is scalable beyond a handful of cores without incurring in significant performance overheads. We propose SWAP, a scalable and fine-grained cache management technique that seamlessly combines set and way partitioning. By cooperatively managing cache ways and sets, SWAP ("Set and WAy Partitioning") can successfully provide hundreds of fine-grained cache partitions for the manycore era.SWAP requires no additional hardware beyond way partitioning. In fact, SWAP can be readily implemented in existing commercial servers whose processors do support hardware way partitioning. In this paper, we prototype SWAP on a 48-core Cavium ThunderX platform running Linux, and we show average speedups over no cache partitioning that are twice as large as those attained with ThunderX's hardware way partitioning alone.

...read moreread less

Journal Article•DOI•

The IX Operating System: Combining Low Latency, High Throughput, and Efficiency in a Protected Dataplane

[...]

Adam Belay¹, George Prekas², Mia Primorac², Ana Klimovic¹, Samuel Grossman¹, Christos Kozyrakis¹, Edouard Bugnion² - Show less +3 more•Institutions (2)

Stanford University¹, École Polytechnique Fédérale de Lausanne²

16 Jan 2017-ACM Transactions on Computer Systems

TL;DR: ix is presented, a dataplane operating system that provides high I/O performance and high resource efficiency while maintaining the protection and isolation benefits of existing kernels.

...read moreread less

Abstract: The conventional wisdom is that aggressive networking requirements, such as high packet rates for small messages and μs-scale tail latency, are best addressed outside the kernel, in a user-level networking stack. We present ix, a dataplane operating system that provides high I/O performance and high resource efficiency while maintaining the protection and isolation benefits of existing kernels. ix uses hardware virtualization to separate management and scheduling functions of the kernel (control plane) from network processing (dataplane). The dataplane architecture builds upon a native, zero-copy API and optimizes for both bandwidth and latency by dedicating hardware threads and networking queues to dataplane instances, processing bounded batches of packets to completion, and eliminating coherence traffic and multicore synchronization. The control plane dynamically adjusts core allocations and voltage/frequency settings to meet service-level objectives. We demonstrate that ix outperforms Linux and a user-space network stack significantly in both throughput and end-to-end latency. Moreover, ix improves the throughput of a widely deployed, key-value store by up to 6.4× and reduces tail latency by more than 2× . With three varying load patterns, the control plane saves 46%--54% of processor energy, and it allows background jobs to run at 35%--47% of their standalone throughput.

...read moreread less

Journal Article•DOI•

New Model-Based Methods and Algorithms for Performance and Energy Optimization of Data Parallel Applications on Homogeneous Multicore Clusters

[...]

Alexey Lastovetsky¹, Ravi Reddy Manumachu¹•Institutions (1)

University College Dublin¹

01 Apr 2017-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper proposes new model-based methods and algorithms for minimization of time and energy of computations for the most general shapes of performance and energy profiles of data parallel applications observed on the modern homogeneous multicore clusters.

...read moreread less

Abstract: Modern homogeneous parallel platforms are composed of tightly integrated multicore CPUs. This tight integration has resulted in the cores contending for various shared on-chip resources such as Last Level Cache (LLC) and interconnect, leading to resource contention and non-uniform memory access (NUMA). Due to these newly introduced complexities, the performance and energy profiles of real-life scientific applications on these platforms are not smooth and may deviate significantly from the shapes that allowed traditional and state-of-the-art load balancing algorithms to minimize their computation time. In this paper, we propose new model-based methods and algorithms for minimization of time and energy of computations for the most general shapes of performance and energy profiles of data parallel applications observed on the modern homogeneous multicore clusters. We formulate the performance and energy optimization problems and present efficient algorithms of complexity $O(p^2)$ solving these problems where $p$ is the number of processors. It is important to note that the globally optimal solutions found by these algorithms may not load-balance the application. We experimentally study the efficiency and scalability of our algorithms for two data parallel applications, matrix multiplication and fast Fourier transform, on a modern multicore CPU and clusters of such CPUs. We also demonstrate the optimality of solutions determined by our algorithms.

...read moreread less

Journal Article•DOI•

PARSEC3.0: A Multicore Benchmark Suite with Network Stacks and SPLASH-2X

[...]

Xusheng Zhan¹, Yungang Bao¹, Christian Bienia², Kai Li²•Institutions (2)

Chinese Academy of Sciences¹, Princeton University²

13 Feb 2017-ACM Sigarch Computer Architecture News

TL;DR: PARSEC3.0 is introduced, a new version of PARSEC suite that implements a user-level network stack and generates three network workloads with this stack to cover network domain and integrates Splash-2 and splash-2x into PARSEC framework so that researchers use these benchmark suite conveniently.

...read moreread less

Abstract: Benchmarks play a very important role in accelerating the development and research of CMP. As one of them, the PARSEC suite continues to be updated and revised over and over again so that it can offer better support for researchers. The former versions of PARSEC have enough workloads to evaluate the property of CMP about CPU, cache and memory, but it lacks of applications based on network stack to assess the performance of CMPs in respect of network. In this work, we introduce PARSEC3.0, a new version of PARSEC suite that implements a user-level network stack and generates three network workloads with this stack to cover network domain. We explore the input sets of splash-2 and expand them to multiple scales, a.k.a, splash-2x. We integrate splash-2 and splash-2x into PARSEC framework so that researchers use these benchmark suite conveniently. Finally, we evaluate the u-TCP/IP stack and new network workloads, and analyze the characterizes of splash-2 and splash-2x

...read moreread less

Journal Article•DOI•

Network Coding in Heterogeneous Multicore IoT Nodes With DAG Scheduling of Parallel Matrix Block Operations

[...]

Simon Wunderlich¹, Juan A. Cabrera¹, Frank H. P. Fitzek¹, Martin Reisslein²•Institutions (2)

Dresden University of Technology¹, Arizona State University²

11 May 2017-IEEE Internet of Things Journal

TL;DR: This paper builds on and adapts highly optimized dense matrix operations from the high performance computing field to RLNC on heterogeneous multicore IoT nodes, which demonstrate higher RLNC encoding and decoding throughputs than existing approaches and indicates that the utilization of more cores decreases energy consumption.

...read moreread less

Abstract: Random linear network coding (RLNC) has the potential to improve the performance of current and future Internet of Things (IoT) communication systems, but is computationally demanding due to matrix multiplications and inversions. Some single-core RLNC implementations achieve already sufficient coding speeds for contemporary multimedia streaming formats. However, advances in multimedia streaming formats and IoT applications will require the exploitation of heterogeneous multicore architectures, which are becoming common for a wide range of IoT nodes, including smartphones. In this paper, we introduce and evaluate efficient RLNC computing strategies for IoT node architectures, including the emerging heterogeneous big.LITTLE multicore architectures with multiple big (fast) cores and multiple LITTLE (slow) cores. In contrast to existing RLNC implementation strategies, we build on and adapt highly optimized dense matrix operations from the high performance computing field to RLNC on heterogeneous multicore IoT nodes. Our approach includes the optimization of RLNC matrix operations through optimized operations on matrix blocks with single instruction multiple data instructions. We schedule block operations on the heterogeneous cores through a directed acyclic graph that avoids artificial synchronization points while ensuring the data dependencies. We examine priority scheduling according to the number of outgoing dependencies of a task and data locality of cached blocks. Our extensive measurements with several heterogeneous big.LITTLE multicore IoT node and smartphone processor boards demonstrate higher RLNC encoding and decoding throughputs than existing approaches. Moreover, our measurements indicate that the utilization of more cores decreases energy consumption, which is an important goal for IoT nodes.

...read moreread less

Proceedings Article•DOI•

Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels

[...]

Ang Li¹, Weifeng Liu², Mads Ruben Burgdorff Kristensen³, Brian Vinter³, Hao Wang⁴, Kaixi Hou⁴, Andres Marquez¹, Shuaiwen Leon Song⁵ - Show less +4 more•Institutions (5)

Pacific Northwest National Laboratory¹, Norwegian University of Science and Technology², University of Copenhagen³, Virginia Tech⁴, College of William & Mary⁵

12 Nov 2017

TL;DR: A comprehensive evaluation for a wide spectrum of scientific kernels with a large amount of representative inputs on two Intel OPMs, guided by general optimization models, demonstrates OPM's effectiveness for easing programmers' tuning efforts to reach ideal throughput for both compute-bound and memory-bound applications.

...read moreread less

Abstract: High-bandwidth On-Package Memory (OPM) innovates the conventional memory hierarchy by augmenting a new on-package layer between classic on-chip cache and off-chip DRAM. Due to its relative location and capacity, OPM is often used as a new type of LLC. Despite the adaptation in modern processors, the performance and power impact of OPM on HPC applications, especially scientific kernels, is still unknown. In this paper, we fill this gap by conducting a comprehensive evaluation for a wide spectrum of scientific kernels with a large amount of representative inputs, including dense, sparse and medium, on two Intel OPMs: eDRAM on multicore Broadwell and MCDRAM on manycore Knights Landing. Guided by our general optimization models, we demonstrate OPM's effectiveness for easing programmers' tuning efforts to reach ideal throughput for both compute-bound and memory-bound applications.

...read moreread less

Journal Article•DOI•

Tightening Contention Delays While Scheduling Parallel Applications on Multi-core Architectures

[...]

Benjamin Rouxel¹, Steven Derrien¹, Isabelle Puaut¹•Institutions (1)

University of Rennes¹

27 Sep 2017

TL;DR: Two contention-aware scheduling strategies that produce a time-triggered schedule of the application’s tasks are presented, as well as a heuristic solution that generates schedules very close to ones of the ILP (5% longer on average), with a much lower time complexity.

...read moreread less

Abstract: Multi-core systems are increasingly interesting candidates for executing parallel real-time applications, in avionic, space or automotive industries, as they provide both computing capabilities and power efficiency. However, ensuring that timing constraints are met on such platforms is challenging, because some hardware resources are shared between cores. Assuming worst-case contentions when analyzing the schedulability of applications may result in systems mistakenly declared unschedulable, although the worst-case level of contentions can never occur in practice. In this paper, we present two contention-aware scheduling strategies that produce a time-triggered schedule of the application’s tasks. Based on knowledge of the application’s structure, our scheduling strategies precisely estimate the effective contentions, in order to minimize the overall makespan of the schedule. An Integer Linear Programming (ILP) solution of the scheduling problem is presented, as well as a heuristic solution that generates schedules very close to ones of the ILP (5% longer on average), with a much lower time complexity. Our heuristic improves by 19% the overall makespan of the resulting schedules compared to a worst-case contention baseline.

...read moreread less

Journal Article•DOI•

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

[...]

Moritz Kreutzer¹, Jonas Thies², Melven Röhrig-Zöllner², Andreas Pieper³, Faisal Shahzad¹, Martin Galgon⁴, Achim Basermann², Holger Fehske³, Georg Hager¹, Gerhard Wellein¹ - Show less +6 more•Institutions (4)

University of Erlangen-Nuremberg¹, German Aerospace Center², University of Greifswald³, University of Wuppertal⁴

01 Oct 2017-International Journal of Parallel Programming

TL;DR: GHOST is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems and implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism.

...read moreread less

Abstract: While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring “standard” as well as “accelerated” resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source.

...read moreread less

Multi-core parallelism in a column-store

[...]

Mrunal M. Gawade

01 Jan 2017

TL;DR: The research reported in this thesis addresses several challenges of improving the efficiency and effectiveness of parallel processing of analytical database queries on modern multi- and many-core systems, using an open-source column-oriented analytical database management system, MonetDB, for validation.

...read moreread less

Abstract: The research reported in this thesis addresses several challenges of improving the efficiency and effectiveness of parallel processing of analytical database queries on modern multi- and many-core systems, using an open-source column-oriented analytical database management system, MonetDB, for validation. In contrast to the existing work we also broaden the research from focusing on individual operators and algorithms to consider the entire system and process holistically. Resource-efficient parallel query execution requires a detailed insight into its query execution affecting parameters. We design and develop new visual analysis techniques and tools that help to identify and rank performance bottlenecks of parallel query execution on multi-core systems. We design and develop a novel learning based adaptive technique for multi-core parallel plan generation using query execution feedback. This techniques proves to be particularly efficient with concurrent workloads, a scenario which is very common in practice but has been largely uncharted in database query parallelization research. We also introduce a simple technique where a multi-socket system is treated as a distributed shared nothing database system, where the remote memory accesses could be constrained thereby having a controlled query execution performance. Many-core system architectures imitate GPU style parallel execution, however, data transfer on the PCle bus which connects Xeon-Phi co-processor to the host, is a bottleneck. We analyze the effect of streaming execution of selected queries, to utilize PCle bandwidth optimally. The lessons, experiences and insights gained in this thesis are valuable for the emerging analytical database systems in the context of multi and many-core systems.

...read moreread less

Journal Article•DOI•

GPU-Accelerated Batch-ACPF Solution for N-1 Static Security Analysis

[...]

Gan Zhou¹, Yanjun Feng¹, Rui Bo, Lungsheng Chien², Zhang Xu¹, Yansheng Lang³, Yupei Jia³, Zhengping Chen - Show less +4 more•Institutions (3)

Southeast University¹, Nvidia², Electric Power Research Institute³

01 May 2017-IEEE Transactions on Smart Grid

TL;DR: A novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses and lays a critical foundation for many other power system applications that need to deal with massive subtasks.

...read moreread less

Abstract: Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The proposed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.

...read moreread less

Collapse