Showing papers on "Degree of parallelism published in 2018"

PDF

Open Access

Journal Article•DOI•

[...]

Xuanhua Shi¹, Zhigao Zheng¹, Yongluan Zhou², Hai Jin¹, Ligang He³, Bo Liu¹, Qiang-Sheng Hua¹ - Show less +3 more•Institutions (3)

Huazhong University of Science and Technology¹, University of Copenhagen², University of Warwick³

03 Jan 2018-ACM Computing Surveys

TL;DR: The state-of-the-art research on GPU-based graph processing is summarized, the existing challenges are analyzed in detail, and the research opportunities for the future are explored.

...read moreread less

Abstract: In the big data era, much real-world data can be naturally represented as graphs. Consequently, many application domains can be modeled as graph processing. Graph processing, especially the processing of the large-scale graphs with the number of vertices and edges in the order of billions or even hundreds of billions, has attracted much attention in both industry and academia. It still remains a great challenge to process such large-scale graphs. Researchers have been seeking for new possible solutions. Because of the massive degree of parallelism and the high memory access bandwidth in GPU, utilizing GPU to accelerate graph processing proves to be a promising solution. This article surveys the key issues of graph processing on GPUs, including data layout, memory access pattern, workload mapping, and specific GPU programming. In this article, we summarize the state-of-the-art research on GPU-based graph processing, analyze the existing challenges in detail, and explore the research opportunities for the future.

...read moreread less

129 citations

Proceedings Article•DOI•

Listing k-cliques in Sparse Real-World Graphs*

[...]

Maximilien Danisch¹, Oana Balalau², Mauro Sozio³•Institutions (3)

University of Paris¹, Max Planck Society², Télécom ParisTech³

23 Apr 2018

TL;DR: This work revisits the iconic algorithm of Chiba and Nishizeki and develops the most efficient parallel algorithm for list all k-cliques in graphs containing up to tens of millions of edges, which is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism.

...read moreread less

Abstract: Motivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all $10$-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs.

...read moreread less

109 citations

Journal Article•DOI•

DPPDL: A Dynamic Partial-Parallel Data Layout for Green Video Surveillance Storage

[...]

Zhizhuo Sun, Quanxin Zhang¹, Yuanzhang Li¹, Yu-an Tan¹•Institutions (1)

Beijing Institute of Technology¹

01 Jan 2018-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A dynamic partial-parallel data layout (DPPDL) is proposed for green video surveillance storage, which dynamically allocates the storage space with an appropriate degree of partial parallelism according to performance requirement.

...read moreread less

Abstract: Video surveillance requires storing massive amounts of video data, which results in the rapid increasing of storage energy consumption. With the popularization of video surveillance, green storage for video surveillance is very attractive. The existing energy-saving methods for massive storage mostly concentrate on the data centers, mainly with random access, whereas the storage of video surveillance has inherent workload characteristics and access pattern, which can be fully exploited to save more energy. A dynamic partial-parallel data layout (DPPDL) is proposed for green video surveillance storage. It adopts a dynamic partial-parallel strategy, which dynamically allocates the storage space with an appropriate degree of partial parallelism according to performance requirement. Partial parallelism benefits energy conservation by scheduling only partial disks to work; a dynamic degree of parallelism can provide appropriate performances for various intensity workloads. DPPDL is evaluated by a simulated video surveillance consisting of 60–300 cameras with $1920 \times 1080$ pixels. The experiment shows that DPPDL is most energy efficient, while tolerating single disk failure and providing more than 20% performance margin. On average, it saves 7%, 19%, 31%, 36%, 56%, and 59% more energy than a CacheRAID, Semi-RAID, Hibernator, MAID, eRAID5, and PARAID, respectively.

...read moreread less

46 citations

Journal Article•DOI•

Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model

[...]

Xuanhua Shi¹, Xuan Luo¹, Junling Liang¹, Peng Zhao¹, Sheng Di², Bingsheng He³, Hai Jin¹ - Show less +3 more•Institutions (3)

Huazhong University of Science and Technology¹, Argonne National Laboratory², National University of Singapore³

01 Jan 2018-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes a light-weight asynchronous processing framework called Frog with a preprocessing/hybrid coloring model based on the Pareto principle about coloring algorithms, and finds that a majority of vertices are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency.

...read moreread less

Abstract: GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, the coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weight asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on the Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of real-world graph coloring cases. We find that a majority of vertices (about 80 percent) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk, and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph, and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP, and CC). On all the tested applications and datasets, Frog is able to significantly outperform existing GPU-based graph processing systems except Gunrock and MapGraph. MapGraph gets better performance than Frog when running BFS on RoadNet-CA. The comparison between Gunrock and Frog is inconclusive. Frog can outperform Gunrock more than 1.04X when running PageRank and SSSP, while the advantage of Frog is not obvious when running BFS and CC on some datasets especially for RoadNet-CA.

...read moreread less

38 citations

Journal Article•DOI•

Rethinking elastic online scheduling of big data streaming applications over high-velocity continuous data streams

[...]

Dawei Sun¹, Dawei Sun², Hongbin Yan², Shang Gao³, Xunyun Liu¹, Rajkumar Buyya¹ - Show less +2 more•Institutions (3)

University of Melbourne¹, China University of Geosciences (Wuhan)², Deakin University³

01 Feb 2018-The Journal of Supercomputing

TL;DR: Experimental results conclusively demonstrate that the proposed E-Stream provides better system response time and applications fairness compared to the existing Storm framework.

...read moreread less

Abstract: Online scheduling plays a key role for big data streaming applications in a big data stream computing environment, as the arrival rate of high-velocity continuous data stream might fluctuate over time. In this paper, an elastic online scheduling framework for big data streaming applications (E-Stream) is proposed, exhibiting the following features. (1) Profile mathematical relationships between system response time, multiple application fairness, and online features of high-velocity continuous stream. (2) Scale out or scale in a data stream graph by quantifying computation and communication cost, and the vertex semantics for arrival rate of data stream, and adjust the degree of parallelism of vertices in the graph. Subgraph is further constructed to minimize data dependencies among the subgraphs. (3) Elastically schedule a graph by a priority-based earliest finish time first online scheduling strategy, and schedule multiple graphs by a max–min fairness strategy. (4) Evaluate the low system response time and acceptable applications fairness objectives in a real-world big data stream computing environment. Experimental results conclusively demonstrate that the proposed E-Stream provides better system response time and applications fairness compared to the existing Storm framework.

...read moreread less

33 citations

Book Chapter•DOI•

Massively Parallel Video Networks

[...]

Joao Carreira, Viorica Patraucean, Laurent Mazaré, Andrew Zisserman, Simon Osindero - Show less +1 more

08 Sep 2018

TL;DR: In this article, causal video understanding models are proposed to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles by using operation pipelining and multi-rate clocks.

...read moreread less

Abstract: We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. The models are still very deep, with dozens of such operations being performed but in a pipelined fashion that enables depth-parallel computation. We illustrate the proposed principles by applying them to existing image architectures and analyse their behaviour on two video tasks: action recognition and human keypoint localisation. The results show that a significant degree of parallelism, and implicitly speedup, can be achieved with little loss in performance.

...read moreread less

31 citations

Journal Article•DOI•

Efficient GPU-Based Electromagnetic Transient Simulation for Power Systems With Thread-Oriented Transformation and Automatic Code Generation

[...]

Yankan Song¹, Ying Chen¹, Shaowei Huang¹, Yin Xu², Zhitong Yu¹, Wei Xue¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Beijing Jiaotong University²

07 May 2018-IEEE Access

TL;DR: An efficient GPU-based parallel EMT simulator is designed that significantly accelerates EMT simulations compared with a CPU-based program, and code automation tools improve computational efficiency by substantially reducing addressing and memory access.

...read moreread less

Abstract: Electromagnetic transients (EMT) simulation is the most accurate and intensive computation for power systems. Past research has shown the potential of accelerating such simulations using graphics processing units (GPUs). In this paper, an efficient GPU-based parallel EMT simulator is designed. Thread-oriented model transformations are first proposed for the electrical and control systems. Following the transformations, the electrical system is represented by connected networks of massive primitive electrical elements, the computations of which can be constructed as massive fused multiply-add operations and solutions to a linear equation. The control systems are represented by a layered directed acyclic graph with primitive control elements that can be dealt with using single-instruction-multiple-threads groups. Finally, code automation tools are designed to form the GPU kernels. Compared with past work, the proposed model transformations improve the degree of parallelism. Most importantly, the code automation tools improve computational efficiency by substantially reducing addressing and memory access, and render the implementation of the algorithm more general and convenient. Test systems of different sizes were created by connecting multiple IEEE 33-bus distribution systems and adding distributed generators. Simulations were performed on NVIDIA’s K20 $\times$ and P100 cards. The results indicate that the proposed method significantly accelerates EMT simulations compared with a CPU-based program. Real-time performance was also achieved under certain conditions.

...read moreread less

31 citations

Book Chapter•DOI•

Why Deep Neural Networks: A Possible Theoretical Explanation

[...]

Chitta Baral¹, Olac Fuentes², Vladik Kreinovich²•Institutions (2)

Arizona State University¹, University of Texas at El Paso²

01 Jan 2018

TL;DR: In this paper, a possible theoretical explanation for the somewhat surprising empirical success of deep networks is provided.

...read moreread less

Abstract: In the past, the most widely used neural networks were 3-layer ones. These networks were preferred, since one of the main advantages of the biological neural networks—which motivated the use of neural networks in computing—is their parallelism, and 3-layer networks provide the largest degree of parallelism. Recently, however, it was empirically shown that, in spite of this argument, multi-layer (“deep”) neural networks leads to a much more efficient machine learning. In this paper, we provide a possible theoretical explanation for the somewhat surprising empirical success of deep networks.

...read moreread less

29 citations

Journal Article•DOI•

CHI-PG: A fast prototype generation algorithm for Big Data classification problems

[...]

Mikel Elkano¹, Mikel Galar¹, José Antonio Sanz¹, Humberto Bustince¹•Institutions (1)

Universidad Pública de Navarra¹

26 Apr 2018-Neurocomputing

TL;DR: A new distributed MapReduce prototype generation method called CHI-PG is introduced that provides a linear O(N) time complexity and ensures constant accuracy regardless of the degree of parallelism and has been shown to be a candidate solution to the time and memory constraints of k-Nearest Neighbors when tackling large-scale datasets.

...read moreread less

19 citations

Journal Article•DOI•

A performance optimization strategy based on degree of parallelism and allocation fitness

[...]

Changtian Ying¹, Changtian Ying², Changyan Ying², Chen Ban²•Institutions (2)

Shaoxing University¹, Xinjiang University²

01 Dec 2018-Eurasip Journal on Wireless Communications and Networking

TL;DR: The definition of the resource allocation model, parallelism degree model, and allocation fitness model on the basis of the theoretical analysis of Spark architecture is given and a strategy embedded in the evaluation model which is easy to perform is proposed.

...read moreread less

Abstract: With the emergence of big data era, most of the current performance optimization strategies are mainly used in a distributed computing framework with disks as the underlying storage. They may solve the problems in traditional disk-based distribution, but they are hard to transplant and are not well suitable for performance optimization especially for an in-memory computing framework on account of different underlying storage and computation architecture. In this paper, we first give the definition of the resource allocation model, parallelism degree model, and allocation fitness model on the basis of the theoretical analysis of Spark architecture. Second, based on the model presented, we propose a strategy embedded in the evaluation model which is easy to perform. The optimization strategy selects the worker with a lower load that satisfies requirements to assign the latter tasks, and the worker with a higher load may not be assigned tasks. The experiments consisting of four variance jobs are conducted to verify the effectiveness of the presented strategy.

...read moreread less

16 citations

Journal Article•DOI•

A Heterogeneous Parallel Processor for High-Speed Vision Chip

[...]

Jie Yang¹, Yongxing Yang¹, Zhe Chen¹, Liyuan Liu¹, Jian Liu¹, Nanjian Wu¹ - Show less +2 more•Institutions (1)

Chinese Academy of Sciences¹

01 Mar 2018-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: The proposed heterogeneous parallel processor introduces a new degree of parallelism, namely, patch parallel, which is for parallel local-feature extraction and feature detection, which can flexibly perform the state-of-the-art computer vision as well as various image processing algorithms at high speed.

...read moreread less

Abstract: This paper proposes a heterogeneous parallel processor for high-speed vision chip. It contains four levels of processors with different parallelisms and complexities: processing element (PE) array processor, patch processing unit (PPU) array processor, self-organizing map (SOM) neural network processor, and dual-core microprocessor unit (MPU). The fine-grained PE array processor, middle-grained PPU array processor, and SOM neural network processor carry out image processing in pixel-parallel, patch-parallel, and distributed-parallel fashions, respectively. The MPU controls the overall system and executes some serial algorithms. The processor can improve the total system performance from low-level to high-level image processing significantly. A prototype is implemented with $64 \times 64$ PE array, $8 \times 8$ PPU array, $16 \times 24$ SOM network, and a dual-core MPU. The proposed heterogeneous parallel processor introduces a new degree of parallelism, namely, patch parallel, which is for parallel local-feature extraction and feature detection. It can flexibly perform the state-of-the-art computer vision as well as various image processing algorithms at high speed. Various complicated applications, including feature extraction, face detection, and high-speed tracking, are demonstrated.

...read moreread less

Posted Content•

Massively Parallel Video Networks

[...]

Joao Carreira, Viorica Patraucean, Laurent Mazaré, Andrew Zisserman, Simon Osindero - Show less +1 more

11 Jun 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: A class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles are introduced.

...read moreread less

Journal Article•DOI•

Toward efficient parallel routing optimization for large-scale SDN networks using GPGPU

[...]

Xiong Wang¹, Qian Zhang¹, Jing Ren¹, Shizhong Xu¹, Sheng Wang¹, Shui Yu² - Show less +2 more•Institutions (2)

University of Electronic Science and Technology of China¹, Deakin University²

01 Jul 2018-Journal of Network and Computer Applications

TL;DR: An efficient Lagrangian Relaxation based Parallel Routing Optimization Algorithm (LR-PROA) is developed to speed up the routing optimization process in large networks by utilizing the massive parallel computation capability of GPU.

...read moreread less

Proceedings Article•DOI•

Concurrent MAC unit design using VHDL for deep learning networks on FPGA

[...]

Hossam O. Ahmed¹, Maged Ghoneima¹, Mohamed Dessouky¹•Institutions (1)

Ain Shams University¹

28 Apr 2018

TL;DR: The proposed 8-bits fixed-point parallel multiply-accumulate (MAC) unit architecture aimed to create a fully-customize MAC unit for the Convolutional Neural Networks (CNN) instead of depending on the conventional DSP blocks and embedded memories units on the FPGAs architecture silicon fabrics.

...read moreread less

Abstract: Deep neural network algorithms have proven their enormous capabilities in wide range of artificial intelligence applications, specially in Printed/Handwritten text recognition, Multimedia processing, Robotics and many other high end technological trends. The most challenging aspect nowadays is to overcome the extremely computational processing demands in applying such algorithms, especially in real-time systems. Recently, the Field Programmable Gate Array (FPGA) has been considered as one of the optimum hardware accelerator platform for accelerating the deep neural network architectures due to its large adaptability and the high degree of parallelism it offers. In this paper, the proposed 8-bits fixed-point parallel multiply-accumulate (MAC) unit architecture aimed to create a fully-customize MAC unit for the Convolutional Neural Networks (CNN) instead of depending on the conventional DSP blocks and embedded memories units on the FPGAs architecture silicon fabrics. The proposed 8-bits fixed-point parallel multiply-accumulate (MAC) unit architecture is designed using VHDL language and can performs a computational speed up to 4.17 Giga Operation per Second (GOPS) using high-density FPGAs.

...read moreread less

Proceedings Article•DOI•

Performance Analysis of Service Clouds Serving Composite Service Application Jobs

[...]

Xiulin Li¹, Shijun Liu¹, Li Pan¹, Yuliang Shi¹, Xiangxu Meng¹ - Show less +1 more•Institutions (1)

Shandong University¹

02 Jul 2018

TL;DR: A novel tandem queuing network with a parallel multi-station multi-server system as an analytical model for service clouds serving composite service application jobs containing parallelizable tasks is described.

...read moreread less

Abstract: Performance analysis is important for service clouds serving composite service application jobs containing parallelizable tasks, for optimizing the degree of parallelism (DOP) and resource allocation schemes could improve performance obviously. In this paper, we describe a novel tandem queuing network with a parallel multi-station multi-server system as an analytical model for service clouds serving composite service application jobs. We design a partition method (termed the 'pleasing partition') to help us propose an analytical model for parallelizable service which is the vital fraction of composite service. After that, we could obtain a complete probability distribution of response time, waiting time and other important performance metrics calculated by our proposed analytical model. Thus, to use this model, cloud operators could determine proper job configurations and resource allocation schemes, for achieving specific QoS (Quality of Service). Extensive simulations are conducted to validate that our analytical model has high accuracy in predicting performance metrics of composite service application jobs.

...read moreread less

Proceedings Article•DOI•

Evaluation of Distributed Machine Learning Algorithms for Anomaly Detection from Large-Scale System Logs: A Case Study

[...]

Merve Astekin¹, Harun Zengin², Hasan Sözer³•Institutions (3)

Scientific and Technological Research Council of Turkey¹, Boğaziçi University², Özyeğin University³

01 Dec 2018

TL;DR: A case study to evaluate two unsupervised machine learning algorithms for this purpose and showed that the distributed versions can achieve the same accuracy and provide a performance improvement by orders of magnitude when compared to their centralized versions.

...read moreread less

Abstract: Anomaly detection is a valuable feature for detecting and diagnosing faults in large-scale, distributed systems. These systems usually provide tens of millions of lines of logs that can be exploited for this purpose. However, centralized implementations of traditional machine learning algorithms fall short to analyze this data in a scalable manner. One way to address this challenge is to employ distributed systems to analyze the immense amount of logs generated by other distributed systems. We conducted a case study to evaluate two unsupervised machine learning algorithms for this purpose on a benchmark dataset. In particular, we evaluated distributed implementations of PCA and K-means algorithms. We compared the accuracy and performance of these algorithms both with respect to each other and with respect to their centralized implementations. Results showed that the distributed versions can achieve the same accuracy and provide a performance improvement by orders of magnitude when compared to their centralized versions. The performance of PCA turns out to be better than K-means, although we observed that the difference between the two tends to decrease as the degree of parallelism increases.

...read moreread less

Journal Article•DOI•

Providing SLO Compliance on NVMe SSDs Through Parallelism Reservation

[...]

Sheng-Min Huang¹, Li-Pin Chang¹•Institutions (1)

National Chiao Tung University¹

01 Feb 2018-ACM Transactions on Design Automation of Electronic Systems

TL;DR: This study introduces a novel approach, called parallelism reservation, which is inspired by the rich internal parallelism of NVMe SSDs, to reserve sufficient degrees of parallelism for read, write, and garbage collection operations, making sure that an NV me SSD delivers stable read and write throughput and reclaims free space at a constant rate.

...read moreread less

Abstract: Non-Volatile Memory Express (NVMe) is a specification for next-generation solid-state disks (SSDs) Benefited from the massive internal parallelism and the high-speed PCIe bus, NVMe SSDs achieve extremely high data transfer rates, and they are an ideal solution of shared storage in virtualization environments Providing virtual machines with Service Level Objective (SLO) compliance on NVMe SSDs is a challenging task, because garbage collection activities inside of NVMe SSDs globally affect the I/O performance of all virtual machines In this study, we introduce a novel approach, called parallelism reservation, which is inspired by the rich internal parallelism of NVMe SSDs The degree of parallelism stands for how many flash chips are concurrently active Our basic idea is to reserve sufficient degrees of parallelism for read, write, and garbage collection operations, making sure that an NVMe SSD delivers stable read and write throughput and reclaims free space at a constant rate The stable read and write throughput are proportionally distributed among virtual machines for SLO compliance Our experimental results show that our parallelism reservation approach delivered satisfiable throughput and highly predictable response to virtual machines

...read moreread less

Proceedings Article•DOI•

A Case Study of Accelerating Apache Spark with FPGA

[...]

Junjie Hou¹, Yongxin Zhu¹, Linghe Kong¹, Zhe Wang¹, Sen Du¹, Shijin Song¹, Tian Huang² - Show less +3 more•Institutions (2)

Shanghai Jiao Tong University¹, University of Cambridge²

01 Aug 2018

TL;DR: FPGA is used to accelerate the Spark tasks developed with Python, and in this way, the main computing load is performed on FPGA instead of CPU.

...read moreread less

Abstract: Apache Spark is an efficient distributed computing framework for big data processing. It supports in-memory computation of RDDs (Resilient Distributed Dataset) and provides a provision of reusability, fault tolerance, and real-time stream processing. However, the tasks in Spark framework are only performed on CPU. The low degree of parallelism and power inefficiency of CPU may restrict the performance and scalability of the cluster. In order to improve the performance and power dissipation of the data center, heterogeneous accelerators such as FPGA, GPU, MIC (Many Integrated Core) exhibit more efficient performance than the general-purpose processor in big data processing. In this work, we propose a framework to integrate FPGA accelerator into a Spark cluster. We use FPGA to accelerate the Spark tasks developed with Python, and in this way, the main computing load is performed on FPGA instead of CPU. We illustrate the performance of the FPGA based Spark framework with a case study of 2D-FFT algorithm acceleration. The results showed that FPGA based Spark implementation acquires 1.79x speedup than CPU implementation.

...read moreread less

Journal Article•DOI•

Fast induced sorting suffixes on a multicore machine

[...]

Bin Lao¹, Ge Nong¹, Wai Hong Chan², Yi Pan³•Institutions (3)

Sun Yat-sen University¹, University of Hong Kong², Georgia State University³

01 Jul 2018-The Journal of Supercomputing

TL;DR: The attempt for designing a parallel variant of SAIS on a multicore machine which is considered as a shared memory parallel model, called pSAIS, has a high degree of parallelism and achieves the best average time and space performance among all the parallel algorithms in comparison.

...read moreread less

Abstract: Sorting the suffixes of an input string is a fundamental task in many applications such as data compression, genome alignment, and full-text search. The induced sorting (IS) method has been successfully applied to design a number of state-of-the-art suffix sorting algorithms. In particular, the SAIS algorithm designed by the IS method is not only linear in time but also fast in practice. However, the parallelization of SAIS remains a challenge due to that the IS process in the algorithm is inherently sequential and the performance bottleneck of the whole algorithm. This article presents our attempt for designing a parallel variant of SAIS on a multicore machine which is considered as a shared memory parallel model, called pSAIS. By a combined use of multithreading and pipelining, the inducing process is accelerated by fully utilizing the machine's parallel computing power. An experimental study is conducted for performance evaluation of pSAIS with the other existing parallel suffix sorting algorithms, on a set of realistic inputs with varying sizes and alphabets. The experiment results show that our program for pSAIS has a high degree of parallelism and achieves the best average time and space performance among all the parallel algorithms in comparison. While pSAIS is designed for quickly building big suffix arrays in a multicore machine, our study may give some hints for extending the induced sorting method to GPU for constructing small suffix arrays even faster.

...read moreread less

Proceedings Article•DOI•

Adaptive Methods for Irregular Parallel Discrete Event Simulation Workloads

[...]

Eric Mikida¹, Laxmikant V. Kale¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

14 May 2018

TL;DR: This paper analyzes the relationship between synchronization cost and event efficiency, and first looks at how these two characteristics are coupled via the computation of Global Virtual Time (GVT), then introduces dynamic load balancing, and shows how this can achieve higher efficiency with less synchronization cost.

...read moreread less

Abstract: Parallel Discrete Event Simulations (PDES) running at large scales involve the coordination of billions of very fine grain events distributed across a large number of processes. At such large scales optimistic synchronization protocols, such as TimeWarp, allow for a high degree of parallelism between processes, but with the additional complexity of managing event rollback and cancellation. This can become especially problematic in models that exhibit imbalance resulting in low event efficiency, which increases the total amount of work required to run a simulation to completion. Managing this complexity becomes key to achieving a high degree of performance across a wide range of models. In this paper, we address this issue by analyzing the relationship between synchronization cost and event efficiency. We first look at how these two characteristics are coupled via the computation of Global Virtual Time (GVT). We then introduce dynamic load balancing, and show how, when combined with low overhead GVT computation, we can achieve higher efficiency with less synchronization cost. In doing so, we achieve up to 2x better performance on a variety of benchmarks and models of practical importance.

...read moreread less

Proceedings Article•DOI•

Runtime Scheduling Policies for Distributed Graph Algorithms

[...]

Jesun Sahariar Firoz¹, Marcin Zalewski², Andrew Lumsdaine³, Martina Barnas¹•Institutions (3)

Indiana University¹, Pacific Northwest National Laboratory², University of Washington³

21 May 2018

TL;DR: This paper presents several implementations of Distributed Control, a data-driven unordered approach with work prioritization and demonstrates that customizable scheduling policies result in the most efficient implementation, outperforming the well-known ?

...read moreread less

Abstract: In this paper we explore scheduling and runtime system support for unordered distributed graph computations that rely on optimistic (speculative) execution. Performance of such algorithms is impacted by two competing trends: the higher degree of parallelism enabled by optimistic execution in turn requires substantial runtime support. To address the potentially high overhead and scheduling complexity introduced by the runtime, we investigate customizable scheduling policies that augment the scheduler of the underlying runtime to adapt it to a specific graph application. We present several implementations of Distributed Control (DC), a data-driven unordered approach with work prioritization and demonstrate that customizable scheduling policies result in the most efficient implementation, outperforming the well-known ?-stepping Single-Source Shortest Paths (SSSP) and Jones-Plassmann vertex-coloring algorithms. We apply two scheduling techniques, flow control and adaptive frequency of network progress, which allow application-level control over the balance of domain work and the runtime work. Experimental results show the benefit of such application-aware scheduling for irregular distributed graph algorithms.

...read moreread less

Journal Article•DOI•

Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization

[...]

Farhad Merchant¹, Tarun Vatwani², Anupam Chattopadhyay², Soumyendu Raha³, S. K. Nandy³, Ranjani Narayan - Show less +2 more•Institutions (3)

RWTH Aachen University¹, Nanyang Technological University², Indian Institute of Science³

01 Aug 2018-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this article, the authors present an efficient implementation of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where they achieve performance improvement of 3-90x in terms of GFLops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700.

...read moreread less

Abstract: QR factorization is a ubiquitous operation in many engineering and scientific applications. In this paper, we present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity to re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12 percent better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive.

...read moreread less

Book Chapter•DOI•

Autonomic and Latency-Aware Degree of Parallelism Management in SPar

[...]

Adriano Vogel¹, Dalvan Griebler¹, Daniele De Sensi², Marco Danelutto², Luiz Gustavo Fernandes¹ - Show less +1 more•Institutions (2)

Pontifícia Universidade Católica do Rio Grande do Sul¹, University of Pisa²

27 Aug 2018

TL;DR: This work proposes an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints and experimentally evaluated the implemented strategy and demonstrated its effectiveness on a real-world application.

...read moreread less

Abstract: Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency.

...read moreread less

Journal Article•DOI•

Policy-Aware Service Composition: Predicting Parallel Execution Performance of Composite Services

[...]

Mai Xuan Trang¹, Yohei Murakami¹, Toru Ishida¹•Institutions (1)

Kyoto University¹

01 Jul 2018-IEEE Transactions on Services Computing

TL;DR: A model that embeds service policies into formulae to calculate composite service performance and predicts the optimal DOP for the composite service, where it attains the best performance is proposed.

...read moreread less

Abstract: With the increasing volume of data to be analysed, one of the challenges in Service Oriented Architecture (SOA) is to make web services efficient in processing large-scale data. Parallel execution and cloud technologies are the keys to speed-up the service invocation. In SOA, service providers typically employ policies to limit parallel execution of the services based on arbitrary decisions. In order to attain optimal performance improvement, users need to adapt to the services policies. A composite service is a combination of several atomic services provided by various providers. To use parallel execution for greater composite service efficiency, the degree of parallelism (DOP) of the composite services need to be optimized by considering the policies of all atomic services. We propose a model that embeds service policies into formulae to calculate composite service performance. From the calculation, we predict the optimal DOP for the composite service, where it attains the best performance. Extensive experiments are conducted on real-world translation services. We use several measures such as mean prediction error (MPE), mean absolute deviation (MAD) and tracking signal (TS) to evaluate our model. The analysis results show that our proposed model has good prediction accuracy in identifying optimal DOPs for composite services.

...read moreread less

Proceedings Article•DOI•

Scheduling techniques for complex workloads in distributed systems

[...]

Georgios L. Stavrinides¹, Helen D. Karatza¹•Institutions (1)

Aristotle University of Thessaloniki¹

26 Jun 2018

TL;DR: Simulation results show that the scheduling strategy that takes into account the degree of parallelism of a job performs better than the other method that does not take into account individual job characteristics.

...read moreread less

Abstract: Effective scheduling techniques are very important in distributed systems, as they directly affect the system performance and the utilization of resources. Particularly important is the scheduling of jobs in the case of complex workloads. This paper concentrates on the study of workloads in a distributed system, which consist of parallel jobs of gang type, as well as single-task computationally intensive jobs. Simulation is employed to evaluate the performance of two scheduling techniques for different cases of system load and service time variability. The simulation results show that the scheduling strategy that takes into account the degree of parallelism of a job performs better than the other method that does not take into account individual job characteristics.

...read moreread less

Journal Article•DOI•

False history filtering for reducing hardware overhead of FPGA-based LZ77 compressor

[...]

Seungdo Choi¹, Youngil Kim¹, Yong Ho Song¹•Institutions (1)

Hanyang University¹

01 Aug 2018-Journal of Systems Architecture

TL;DR: A false history filtering technique is proposed that is used by a parallel hardware accelerator to avoid excessive hardware resource cost and detects unnecessary string comparison operations that generate meaningless or unused results.

...read moreread less

Journal Article•DOI•

C-Stream: A Co-routine-Based Elastic Stream Processing Engine

[...]

Semih Sahin¹, Bugra Gedik²•Institutions (2)

Georgia Institute of Technology¹, Bilkent University²

27 Apr 2018

TL;DR: The design and implementation of C-Stream are described, which is an elastic stream processing engine that varies the degree of parallelism to resolve bottlenecks by both dynamically changing the number of threads used to execute an application and adjusting thenumber of replicas of data-parallel operators.

...read moreread less

Abstract: Stream processing is a computational paradigm for on-the-fly processing of live data. This paradigm lends itself to implementations that can provide high throughput and low latency by taking advantage of various forms of parallelism that are naturally captured by the stream processing model of computation, such as pipeline, task, and data parallelism. In this article, we describe the design and implementation of C-Stream, which is an elastic stream processing engine. C-Stream encompasses three unique properties. First, in contrast to the widely adopted event-based interface for developing streaming operators, C-Stream provides an interface wherein each operator has its own driver loop and relies on data availability application programming interfaces (APIs) to decide when to perform its computations. This self-control-based model significantly simplifies the development of operators that require multiport synchronization. Second, C-Stream contains a dynamic scheduler that manages the multithreaded execution of the operators. The scheduler, which is customizable via plug-ins, enables the execution of the operators as co-routines, using any number of threads. The base scheduler implements back-pressure, provides data availability APIs, and manages preemption and termination handling. Last, C-Stream varies the degree of parallelism to resolve bottlenecks by both dynamically changing the number of threads used to execute an application and adjusting the number of replicas of data-parallel operators. We provide an experimental evaluation of C-Stream. The results show that C-Stream is scalable, highly customizable, and can resolve bottlenecks by dynamically adjusting the level of data parallelism used.

...read moreread less

Proceedings Article•DOI•

A Parallel Platform for Fusion of Heterogeneous Stream Data

[...]

Shan Zhang¹, Jielong Xu, Sora Choi, Jian Tang¹, Pramod K. Varshney¹, Zhenhua Chen¹ - Show less +2 more•Institutions (1)

Syracuse University¹

10 Jul 2018

TL;DR: Experimental results show that C-Storm offers a significant 4.7x speedup over a commonly used sequential baseline and higher degree of parallelism leads to better performance.

...read moreread less

Abstract: This paper presents a novel parallel platform, C-Storm (Copula-based Storm), for the computationally complex problem of fusion of heterogeneous data streams for inference C-Storm is designed by marrying copula-based dependence modeling for highly accurate inference and a highly-regarded parallel computing platform Storm for fast stream data processing C-Storm has the following desirable features: 1) C-Storm offers fast inference responses 2) C-Storm provides high inference accuracies 3) C-Storm is a general-purpose inference platform that can support data fusion applications 4) C-Storm is easy to use and its users do not need to know deep knowledge of Storm or copula theory We implemented C-Storm based on Apache Storm 102 and conducted extensive experiments using a typical data fusion application Experimental results show that C-Storm offers a significant 47x speedup over a commonly used sequential baseline and higher degree of parallelism leads to better performance

...read moreread less

FPGA-Based High Throughput Merge Sorter

[...]

Xianwei Zeng

01 Jan 2018

TL;DR: An FPGA-based odd-even merge sorter which features throughput of 27.18 GB/s when merging 4 streams and presents stable throughput performance when the number of input streams is increased due to its high degree of parallelism.

...read moreread less

Abstract: As database systems have shifted from disk-based to in-memory, and the scale of the database in big data analysis increases significantly, the workloads analyzing huge datasets are growing. Adopting FPGAs as hardware accelerators improves the flexibility, parallelism and power consumption versus CPU-only systems. The accelerators are also required to keep up with high memory bandwidth provided by advanced memory technologies and new interconnect interfaces. Sorting is the most fundamental database operation. In multiple-pass merge sorting, the final pass of the merge operation requires significant throughput performance to keep up with the high memory bandwidth. We study the state-of-the-art hardware-based sorters and present an analysis of our own design. In this thesis, we present an FPGA-based odd-even merge sorter which features throughput of 27.18 GB/s when merging 4 streams. Our design also presents stable throughput performance when the number of input streams is increased due to its high degree of parallelism. Thanks to such a generic design, the odd-even merge sorter does not suffer throughput drop for skewed data distributions and presents constant performance over various kinds of input distributions.

...read moreread less

Posted Content•

An Efficient Dispatcher for Large Scale GraphProcessing on OpenCL-based FPGAs

[...]

Chengbo Yang

03 Jun 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: A modularize graph processing framework, which focus on the whole executing procedure with the extremely different degree of parallelism, and design a novel conversion dispatcher to change processing module, to match the corresponding exchange point.

...read moreread less

Abstract: High parallel framework has been proved to be very suitable for graph processing. There are various work to optimize the implementation in FPGAs, a pipeline parallel device. The key to make use of the parallel performance of FPGAs is to process graph data in pipeline model and take advantage of on-chip memory to realize necessary locality process. This paper proposes a modularize graph processing framework, which focus on the whole executing procedure with the extremely different degree of parallelism. The framework has three contributions. First, the combination of vertex-centric and edge-centric processing framework can been adjusting in the executing procedure to accommodate top-down algorithm and bottom-up algorithm. Second, owing to the pipeline parallel and finite on-chip memory accelerator, the novel edge-block, a block consist of edges vertex, achieve optimizing the way to utilize the on-chip memory to group the edges and stream the edges in a block to realize the stream pattern to pipeline parallel processing. Third, depending to the analysis of the block structure of nature graph and the executing characteristics during graph processing, we design a novel conversion dispatcher to change processing module, to match the corresponding exchange point.

...read moreread less