scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2018"


Journal ArticleDOI
TL;DR: The state-of-the-art research on GPU-based graph processing is summarized, the existing challenges are analyzed in detail, and the research opportunities for the future are explored.
Abstract: In the big data era, much real-world data can be naturally represented as graphs. Consequently, many application domains can be modeled as graph processing. Graph processing, especially the processing of the large-scale graphs with the number of vertices and edges in the order of billions or even hundreds of billions, has attracted much attention in both industry and academia. It still remains a great challenge to process such large-scale graphs. Researchers have been seeking for new possible solutions. Because of the massive degree of parallelism and the high memory access bandwidth in GPU, utilizing GPU to accelerate graph processing proves to be a promising solution. This article surveys the key issues of graph processing on GPUs, including data layout, memory access pattern, workload mapping, and specific GPU programming. In this article, we summarize the state-of-the-art research on GPU-based graph processing, analyze the existing challenges in detail, and explore the research opportunities for the future.

129 citations


Proceedings ArticleDOI
23 Apr 2018
TL;DR: This work revisits the iconic algorithm of Chiba and Nishizeki and develops the most efficient parallel algorithm for list all k-cliques in graphs containing up to tens of millions of edges, which is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism.
Abstract: Motivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all $10$-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs.

109 citations


Journal ArticleDOI
TL;DR: A dynamic partial-parallel data layout (DPPDL) is proposed for green video surveillance storage, which dynamically allocates the storage space with an appropriate degree of partial parallelism according to performance requirement.
Abstract: Video surveillance requires storing massive amounts of video data, which results in the rapid increasing of storage energy consumption. With the popularization of video surveillance, green storage for video surveillance is very attractive. The existing energy-saving methods for massive storage mostly concentrate on the data centers, mainly with random access, whereas the storage of video surveillance has inherent workload characteristics and access pattern, which can be fully exploited to save more energy. A dynamic partial-parallel data layout (DPPDL) is proposed for green video surveillance storage. It adopts a dynamic partial-parallel strategy, which dynamically allocates the storage space with an appropriate degree of partial parallelism according to performance requirement. Partial parallelism benefits energy conservation by scheduling only partial disks to work; a dynamic degree of parallelism can provide appropriate performances for various intensity workloads. DPPDL is evaluated by a simulated video surveillance consisting of 60–300 cameras with $1920 \times 1080$ pixels. The experiment shows that DPPDL is most energy efficient, while tolerating single disk failure and providing more than 20% performance margin. On average, it saves 7%, 19%, 31%, 36%, 56%, and 59% more energy than a CacheRAID, Semi-RAID, Hibernator, MAID, eRAID5, and PARAID, respectively.

46 citations


Journal ArticleDOI
TL;DR: This work proposes a light-weight asynchronous processing framework called Frog with a preprocessing/hybrid coloring model based on the Pareto principle about coloring algorithms, and finds that a majority of vertices are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency.
Abstract: GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, the coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weight asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on the Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of real-world graph coloring cases. We find that a majority of vertices (about 80 percent) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk, and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph, and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP, and CC). On all the tested applications and datasets, Frog is able to significantly outperform existing GPU-based graph processing systems except Gunrock and MapGraph. MapGraph gets better performance than Frog when running BFS on RoadNet-CA. The comparison between Gunrock and Frog is inconclusive. Frog can outperform Gunrock more than 1.04X when running PageRank and SSSP, while the advantage of Frog is not obvious when running BFS and CC on some datasets especially for RoadNet-CA.

38 citations


Journal ArticleDOI
TL;DR: Experimental results conclusively demonstrate that the proposed E-Stream provides better system response time and applications fairness compared to the existing Storm framework.
Abstract: Online scheduling plays a key role for big data streaming applications in a big data stream computing environment, as the arrival rate of high-velocity continuous data stream might fluctuate over time. In this paper, an elastic online scheduling framework for big data streaming applications (E-Stream) is proposed, exhibiting the following features. (1) Profile mathematical relationships between system response time, multiple application fairness, and online features of high-velocity continuous stream. (2) Scale out or scale in a data stream graph by quantifying computation and communication cost, and the vertex semantics for arrival rate of data stream, and adjust the degree of parallelism of vertices in the graph. Subgraph is further constructed to minimize data dependencies among the subgraphs. (3) Elastically schedule a graph by a priority-based earliest finish time first online scheduling strategy, and schedule multiple graphs by a max–min fairness strategy. (4) Evaluate the low system response time and acceptable applications fairness objectives in a real-world big data stream computing environment. Experimental results conclusively demonstrate that the proposed E-Stream provides better system response time and applications fairness compared to the existing Storm framework.

33 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this article, causal video understanding models are proposed to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles by using operation pipelining and multi-rate clocks.
Abstract: We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. The models are still very deep, with dozens of such operations being performed but in a pipelined fashion that enables depth-parallel computation. We illustrate the proposed principles by applying them to existing image architectures and analyse their behaviour on two video tasks: action recognition and human keypoint localisation. The results show that a significant degree of parallelism, and implicitly speedup, can be achieved with little loss in performance.

31 citations


Journal ArticleDOI
TL;DR: An efficient GPU-based parallel EMT simulator is designed that significantly accelerates EMT simulations compared with a CPU-based program, and code automation tools improve computational efficiency by substantially reducing addressing and memory access.
Abstract: Electromagnetic transients (EMT) simulation is the most accurate and intensive computation for power systems. Past research has shown the potential of accelerating such simulations using graphics processing units (GPUs). In this paper, an efficient GPU-based parallel EMT simulator is designed. Thread-oriented model transformations are first proposed for the electrical and control systems. Following the transformations, the electrical system is represented by connected networks of massive primitive electrical elements, the computations of which can be constructed as massive fused multiply-add operations and solutions to a linear equation. The control systems are represented by a layered directed acyclic graph with primitive control elements that can be dealt with using single-instruction-multiple-threads groups. Finally, code automation tools are designed to form the GPU kernels. Compared with past work, the proposed model transformations improve the degree of parallelism. Most importantly, the code automation tools improve computational efficiency by substantially reducing addressing and memory access, and render the implementation of the algorithm more general and convenient. Test systems of different sizes were created by connecting multiple IEEE 33-bus distribution systems and adding distributed generators. Simulations were performed on NVIDIA’s K20 $\times$ and P100 cards. The results indicate that the proposed method significantly accelerates EMT simulations compared with a CPU-based program. Real-time performance was also achieved under certain conditions.

31 citations


Book ChapterDOI
01 Jan 2018
TL;DR: In this paper, a possible theoretical explanation for the somewhat surprising empirical success of deep networks is provided.
Abstract: In the past, the most widely used neural networks were 3-layer ones. These networks were preferred, since one of the main advantages of the biological neural networks—which motivated the use of neural networks in computing—is their parallelism, and 3-layer networks provide the largest degree of parallelism. Recently, however, it was empirically shown that, in spite of this argument, multi-layer (“deep”) neural networks leads to a much more efficient machine learning. In this paper, we provide a possible theoretical explanation for the somewhat surprising empirical success of deep networks.

29 citations


Journal ArticleDOI
TL;DR: A new distributed MapReduce prototype generation method called CHI-PG is introduced that provides a linear O(N) time complexity and ensures constant accuracy regardless of the degree of parallelism and has been shown to be a candidate solution to the time and memory constraints of k-Nearest Neighbors when tackling large-scale datasets.

19 citations


Journal ArticleDOI
TL;DR: The definition of the resource allocation model, parallelism degree model, and allocation fitness model on the basis of the theoretical analysis of Spark architecture is given and a strategy embedded in the evaluation model which is easy to perform is proposed.
Abstract: With the emergence of big data era, most of the current performance optimization strategies are mainly used in a distributed computing framework with disks as the underlying storage. They may solve the problems in traditional disk-based distribution, but they are hard to transplant and are not well suitable for performance optimization especially for an in-memory computing framework on account of different underlying storage and computation architecture. In this paper, we first give the definition of the resource allocation model, parallelism degree model, and allocation fitness model on the basis of the theoretical analysis of Spark architecture. Second, based on the model presented, we propose a strategy embedded in the evaluation model which is easy to perform. The optimization strategy selects the worker with a lower load that satisfies requirements to assign the latter tasks, and the worker with a higher load may not be assigned tasks. The experiments consisting of four variance jobs are conducted to verify the effectiveness of the presented strategy.

16 citations


Journal ArticleDOI
Jie Yang1, Yongxing Yang1, Zhe Chen1, Liyuan Liu1, Jian Liu1, Nanjian Wu1 
TL;DR: The proposed heterogeneous parallel processor introduces a new degree of parallelism, namely, patch parallel, which is for parallel local-feature extraction and feature detection, which can flexibly perform the state-of-the-art computer vision as well as various image processing algorithms at high speed.
Abstract: This paper proposes a heterogeneous parallel processor for high-speed vision chip. It contains four levels of processors with different parallelisms and complexities: processing element (PE) array processor, patch processing unit (PPU) array processor, self-organizing map (SOM) neural network processor, and dual-core microprocessor unit (MPU). The fine-grained PE array processor, middle-grained PPU array processor, and SOM neural network processor carry out image processing in pixel-parallel, patch-parallel, and distributed-parallel fashions, respectively. The MPU controls the overall system and executes some serial algorithms. The processor can improve the total system performance from low-level to high-level image processing significantly. A prototype is implemented with $64 \times 64$ PE array, $8 \times 8$ PPU array, $16 \times 24$ SOM network, and a dual-core MPU. The proposed heterogeneous parallel processor introduces a new degree of parallelism, namely, patch parallel, which is for parallel local-feature extraction and feature detection. It can flexibly perform the state-of-the-art computer vision as well as various image processing algorithms at high speed. Various complicated applications, including feature extraction, face detection, and high-speed tracking, are demonstrated.

Posted Content
TL;DR: A class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles are introduced.
Abstract: We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. The models are still very deep, with dozens of such operations being performed but in a pipelined fashion that enables depth-parallel computation. We illustrate the proposed principles by applying them to existing image architectures and analyse their behaviour on two video tasks: action recognition and human keypoint localisation. The results show that a significant degree of parallelism, and implicitly speedup, can be achieved with little loss in performance.

Journal ArticleDOI
TL;DR: An efficient Lagrangian Relaxation based Parallel Routing Optimization Algorithm (LR-PROA) is developed to speed up the routing optimization process in large networks by utilizing the massive parallel computation capability of GPU.

Proceedings ArticleDOI
28 Apr 2018
TL;DR: The proposed 8-bits fixed-point parallel multiply-accumulate (MAC) unit architecture aimed to create a fully-customize MAC unit for the Convolutional Neural Networks (CNN) instead of depending on the conventional DSP blocks and embedded memories units on the FPGAs architecture silicon fabrics.
Abstract: Deep neural network algorithms have proven their enormous capabilities in wide range of artificial intelligence applications, specially in Printed/Handwritten text recognition, Multimedia processing, Robotics and many other high end technological trends. The most challenging aspect nowadays is to overcome the extremely computational processing demands in applying such algorithms, especially in real-time systems. Recently, the Field Programmable Gate Array (FPGA) has been considered as one of the optimum hardware accelerator platform for accelerating the deep neural network architectures due to its large adaptability and the high degree of parallelism it offers. In this paper, the proposed 8-bits fixed-point parallel multiply-accumulate (MAC) unit architecture aimed to create a fully-customize MAC unit for the Convolutional Neural Networks (CNN) instead of depending on the conventional DSP blocks and embedded memories units on the FPGAs architecture silicon fabrics. The proposed 8-bits fixed-point parallel multiply-accumulate (MAC) unit architecture is designed using VHDL language and can performs a computational speed up to 4.17 Giga Operation per Second (GOPS) using high-density FPGAs.

Proceedings ArticleDOI
Xiulin Li1, Shijun Liu1, Li Pan1, Yuliang Shi1, Xiangxu Meng1 
02 Jul 2018
TL;DR: A novel tandem queuing network with a parallel multi-station multi-server system as an analytical model for service clouds serving composite service application jobs containing parallelizable tasks is described.
Abstract: Performance analysis is important for service clouds serving composite service application jobs containing parallelizable tasks, for optimizing the degree of parallelism (DOP) and resource allocation schemes could improve performance obviously. In this paper, we describe a novel tandem queuing network with a parallel multi-station multi-server system as an analytical model for service clouds serving composite service application jobs. We design a partition method (termed the 'pleasing partition') to help us propose an analytical model for parallelizable service which is the vital fraction of composite service. After that, we could obtain a complete probability distribution of response time, waiting time and other important performance metrics calculated by our proposed analytical model. Thus, to use this model, cloud operators could determine proper job configurations and resource allocation schemes, for achieving specific QoS (Quality of Service). Extensive simulations are conducted to validate that our analytical model has high accuracy in predicting performance metrics of composite service application jobs.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: A case study to evaluate two unsupervised machine learning algorithms for this purpose and showed that the distributed versions can achieve the same accuracy and provide a performance improvement by orders of magnitude when compared to their centralized versions.
Abstract: Anomaly detection is a valuable feature for detecting and diagnosing faults in large-scale, distributed systems. These systems usually provide tens of millions of lines of logs that can be exploited for this purpose. However, centralized implementations of traditional machine learning algorithms fall short to analyze this data in a scalable manner. One way to address this challenge is to employ distributed systems to analyze the immense amount of logs generated by other distributed systems. We conducted a case study to evaluate two unsupervised machine learning algorithms for this purpose on a benchmark dataset. In particular, we evaluated distributed implementations of PCA and K-means algorithms. We compared the accuracy and performance of these algorithms both with respect to each other and with respect to their centralized implementations. Results showed that the distributed versions can achieve the same accuracy and provide a performance improvement by orders of magnitude when compared to their centralized versions. The performance of PCA turns out to be better than K-means, although we observed that the difference between the two tends to decrease as the degree of parallelism increases.

Journal ArticleDOI
TL;DR: This study introduces a novel approach, called parallelism reservation, which is inspired by the rich internal parallelism of NVMe SSDs, to reserve sufficient degrees of parallelism for read, write, and garbage collection operations, making sure that an NV me SSD delivers stable read and write throughput and reclaims free space at a constant rate.
Abstract: Non-Volatile Memory Express (NVMe) is a specification for next-generation solid-state disks (SSDs) Benefited from the massive internal parallelism and the high-speed PCIe bus, NVMe SSDs achieve extremely high data transfer rates, and they are an ideal solution of shared storage in virtualization environments Providing virtual machines with Service Level Objective (SLO) compliance on NVMe SSDs is a challenging task, because garbage collection activities inside of NVMe SSDs globally affect the I/O performance of all virtual machines In this study, we introduce a novel approach, called parallelism reservation, which is inspired by the rich internal parallelism of NVMe SSDs The degree of parallelism stands for how many flash chips are concurrently active Our basic idea is to reserve sufficient degrees of parallelism for read, write, and garbage collection operations, making sure that an NVMe SSD delivers stable read and write throughput and reclaims free space at a constant rate The stable read and write throughput are proportionally distributed among virtual machines for SLO compliance Our experimental results show that our parallelism reservation approach delivered satisfiable throughput and highly predictable response to virtual machines

Proceedings ArticleDOI
01 Aug 2018
TL;DR: FPGA is used to accelerate the Spark tasks developed with Python, and in this way, the main computing load is performed on FPGA instead of CPU.
Abstract: Apache Spark is an efficient distributed computing framework for big data processing. It supports in-memory computation of RDDs (Resilient Distributed Dataset) and provides a provision of reusability, fault tolerance, and real-time stream processing. However, the tasks in Spark framework are only performed on CPU. The low degree of parallelism and power inefficiency of CPU may restrict the performance and scalability of the cluster. In order to improve the performance and power dissipation of the data center, heterogeneous accelerators such as FPGA, GPU, MIC (Many Integrated Core) exhibit more efficient performance than the general-purpose processor in big data processing. In this work, we propose a framework to integrate FPGA accelerator into a Spark cluster. We use FPGA to accelerate the Spark tasks developed with Python, and in this way, the main computing load is performed on FPGA instead of CPU. We illustrate the performance of the FPGA based Spark framework with a case study of 2D-FFT algorithm acceleration. The results showed that FPGA based Spark implementation acquires 1.79x speedup than CPU implementation.

Journal ArticleDOI
TL;DR: The attempt for designing a parallel variant of SAIS on a multicore machine which is considered as a shared memory parallel model, called pSAIS, has a high degree of parallelism and achieves the best average time and space performance among all the parallel algorithms in comparison.
Abstract: Sorting the suffixes of an input string is a fundamental task in many applications such as data compression, genome alignment, and full-text search. The induced sorting (IS) method has been successfully applied to design a number of state-of-the-art suffix sorting algorithms. In particular, the SAIS algorithm designed by the IS method is not only linear in time but also fast in practice. However, the parallelization of SAIS remains a challenge due to that the IS process in the algorithm is inherently sequential and the performance bottleneck of the whole algorithm. This article presents our attempt for designing a parallel variant of SAIS on a multicore machine which is considered as a shared memory parallel model, called pSAIS. By a combined use of multithreading and pipelining, the inducing process is accelerated by fully utilizing the machine's parallel computing power. An experimental study is conducted for performance evaluation of pSAIS with the other existing parallel suffix sorting algorithms, on a set of realistic inputs with varying sizes and alphabets. The experiment results show that our program for pSAIS has a high degree of parallelism and achieves the best average time and space performance among all the parallel algorithms in comparison. While pSAIS is designed for quickly building big suffix arrays in a multicore machine, our study may give some hints for extending the induced sorting method to GPU for constructing small suffix arrays even faster.

Proceedings ArticleDOI
14 May 2018
TL;DR: This paper analyzes the relationship between synchronization cost and event efficiency, and first looks at how these two characteristics are coupled via the computation of Global Virtual Time (GVT), then introduces dynamic load balancing, and shows how this can achieve higher efficiency with less synchronization cost.
Abstract: Parallel Discrete Event Simulations (PDES) running at large scales involve the coordination of billions of very fine grain events distributed across a large number of processes. At such large scales optimistic synchronization protocols, such as TimeWarp, allow for a high degree of parallelism between processes, but with the additional complexity of managing event rollback and cancellation. This can become especially problematic in models that exhibit imbalance resulting in low event efficiency, which increases the total amount of work required to run a simulation to completion. Managing this complexity becomes key to achieving a high degree of performance across a wide range of models. In this paper, we address this issue by analyzing the relationship between synchronization cost and event efficiency. We first look at how these two characteristics are coupled via the computation of Global Virtual Time (GVT). We then introduce dynamic load balancing, and show how, when combined with low overhead GVT computation, we can achieve higher efficiency with less synchronization cost. In doing so, we achieve up to 2x better performance on a variety of benchmarks and models of practical importance.

Proceedings ArticleDOI
21 May 2018
TL;DR: This paper presents several implementations of Distributed Control, a data-driven unordered approach with work prioritization and demonstrates that customizable scheduling policies result in the most efficient implementation, outperforming the well-known ?
Abstract: In this paper we explore scheduling and runtime system support for unordered distributed graph computations that rely on optimistic (speculative) execution. Performance of such algorithms is impacted by two competing trends: the higher degree of parallelism enabled by optimistic execution in turn requires substantial runtime support. To address the potentially high overhead and scheduling complexity introduced by the runtime, we investigate customizable scheduling policies that augment the scheduler of the underlying runtime to adapt it to a specific graph application. We present several implementations of Distributed Control (DC), a data-driven unordered approach with work prioritization and demonstrate that customizable scheduling policies result in the most efficient implementation, outperforming the well-known ?-stepping Single-Source Shortest Paths (SSSP) and Jones-Plassmann vertex-coloring algorithms. We apply two scheduling techniques, flow control and adaptive frequency of network progress, which allow application-level control over the balance of domain work and the runtime work. Experimental results show the benefit of such application-aware scheduling for irregular distributed graph algorithms.

Journal ArticleDOI
TL;DR: In this article, the authors present an efficient implementation of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where they achieve performance improvement of 3-90x in terms of GFLops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700.
Abstract: QR factorization is a ubiquitous operation in many engineering and scientific applications. In this paper, we present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity to re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12 percent better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive.

Book ChapterDOI
27 Aug 2018
TL;DR: This work proposes an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints and experimentally evaluated the implemented strategy and demonstrated its effectiveness on a real-world application.
Abstract: Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency.

Journal ArticleDOI
TL;DR: A model that embeds service policies into formulae to calculate composite service performance and predicts the optimal DOP for the composite service, where it attains the best performance is proposed.
Abstract: With the increasing volume of data to be analysed, one of the challenges in Service Oriented Architecture (SOA) is to make web services efficient in processing large-scale data. Parallel execution and cloud technologies are the keys to speed-up the service invocation. In SOA, service providers typically employ policies to limit parallel execution of the services based on arbitrary decisions. In order to attain optimal performance improvement, users need to adapt to the services policies. A composite service is a combination of several atomic services provided by various providers. To use parallel execution for greater composite service efficiency, the degree of parallelism (DOP) of the composite services need to be optimized by considering the policies of all atomic services. We propose a model that embeds service policies into formulae to calculate composite service performance. From the calculation, we predict the optimal DOP for the composite service, where it attains the best performance. Extensive experiments are conducted on real-world translation services. We use several measures such as mean prediction error (MPE), mean absolute deviation (MAD) and tracking signal (TS) to evaluate our model. The analysis results show that our proposed model has good prediction accuracy in identifying optimal DOPs for composite services.

Proceedings ArticleDOI
26 Jun 2018
TL;DR: Simulation results show that the scheduling strategy that takes into account the degree of parallelism of a job performs better than the other method that does not take into account individual job characteristics.
Abstract: Effective scheduling techniques are very important in distributed systems, as they directly affect the system performance and the utilization of resources. Particularly important is the scheduling of jobs in the case of complex workloads. This paper concentrates on the study of workloads in a distributed system, which consist of parallel jobs of gang type, as well as single-task computationally intensive jobs. Simulation is employed to evaluate the performance of two scheduling techniques for different cases of system load and service time variability. The simulation results show that the scheduling strategy that takes into account the degree of parallelism of a job performs better than the other method that does not take into account individual job characteristics.

Journal ArticleDOI
TL;DR: A false history filtering technique is proposed that is used by a parallel hardware accelerator to avoid excessive hardware resource cost and detects unnecessary string comparison operations that generate meaningless or unused results.

Journal ArticleDOI
27 Apr 2018
TL;DR: The design and implementation of C-Stream are described, which is an elastic stream processing engine that varies the degree of parallelism to resolve bottlenecks by both dynamically changing the number of threads used to execute an application and adjusting thenumber of replicas of data-parallel operators.
Abstract: Stream processing is a computational paradigm for on-the-fly processing of live data. This paradigm lends itself to implementations that can provide high throughput and low latency by taking advantage of various forms of parallelism that are naturally captured by the stream processing model of computation, such as pipeline, task, and data parallelism. In this article, we describe the design and implementation of C-Stream, which is an elastic stream processing engine. C-Stream encompasses three unique properties. First, in contrast to the widely adopted event-based interface for developing streaming operators, C-Stream provides an interface wherein each operator has its own driver loop and relies on data availability application programming interfaces (APIs) to decide when to perform its computations. This self-control-based model significantly simplifies the development of operators that require multiport synchronization. Second, C-Stream contains a dynamic scheduler that manages the multithreaded execution of the operators. The scheduler, which is customizable via plug-ins, enables the execution of the operators as co-routines, using any number of threads. The base scheduler implements back-pressure, provides data availability APIs, and manages preemption and termination handling. Last, C-Stream varies the degree of parallelism to resolve bottlenecks by both dynamically changing the number of threads used to execute an application and adjusting the number of replicas of data-parallel operators. We provide an experimental evaluation of C-Stream. The results show that C-Stream is scalable, highly customizable, and can resolve bottlenecks by dynamically adjusting the level of data parallelism used.

Proceedings ArticleDOI
10 Jul 2018
TL;DR: Experimental results show that C-Storm offers a significant 4.7x speedup over a commonly used sequential baseline and higher degree of parallelism leads to better performance.
Abstract: This paper presents a novel parallel platform, C-Storm (Copula-based Storm), for the computationally complex problem of fusion of heterogeneous data streams for inference C-Storm is designed by marrying copula-based dependence modeling for highly accurate inference and a highly-regarded parallel computing platform Storm for fast stream data processing C-Storm has the following desirable features: 1) C-Storm offers fast inference responses 2) C-Storm provides high inference accuracies 3) C-Storm is a general-purpose inference platform that can support data fusion applications 4) C-Storm is easy to use and its users do not need to know deep knowledge of Storm or copula theory We implemented C-Storm based on Apache Storm 102 and conducted extensive experiments using a typical data fusion application Experimental results show that C-Storm offers a significant 47x speedup over a commonly used sequential baseline and higher degree of parallelism leads to better performance

01 Jan 2018
TL;DR: An FPGA-based odd-even merge sorter which features throughput of 27.18 GB/s when merging 4 streams and presents stable throughput performance when the number of input streams is increased due to its high degree of parallelism.
Abstract: As database systems have shifted from disk-based to in-memory, and the scale of the database in big data analysis increases significantly, the workloads analyzing huge datasets are growing. Adopting FPGAs as hardware accelerators improves the flexibility, parallelism and power consumption versus CPU-only systems. The accelerators are also required to keep up with high memory bandwidth provided by advanced memory technologies and new interconnect interfaces. Sorting is the most fundamental database operation. In multiple-pass merge sorting, the final pass of the merge operation requires significant throughput performance to keep up with the high memory bandwidth. We study the state-of-the-art hardware-based sorters and present an analysis of our own design. In this thesis, we present an FPGA-based odd-even merge sorter which features throughput of 27.18 GB/s when merging 4 streams. Our design also presents stable throughput performance when the number of input streams is increased due to its high degree of parallelism. Thanks to such a generic design, the odd-even merge sorter does not suffer throughput drop for skewed data distributions and presents constant performance over various kinds of input distributions.

Posted Content
TL;DR: A modularize graph processing framework, which focus on the whole executing procedure with the extremely different degree of parallelism, and design a novel conversion dispatcher to change processing module, to match the corresponding exchange point.
Abstract: High parallel framework has been proved to be very suitable for graph processing. There are various work to optimize the implementation in FPGAs, a pipeline parallel device. The key to make use of the parallel performance of FPGAs is to process graph data in pipeline model and take advantage of on-chip memory to realize necessary locality process. This paper proposes a modularize graph processing framework, which focus on the whole executing procedure with the extremely different degree of parallelism. The framework has three contributions. First, the combination of vertex-centric and edge-centric processing framework can been adjusting in the executing procedure to accommodate top-down algorithm and bottom-up algorithm. Second, owing to the pipeline parallel and finite on-chip memory accelerator, the novel edge-block, a block consist of edges vertex, achieve optimizing the way to utilize the on-chip memory to group the edges and stream the edges in a block to realize the stream pattern to pipeline parallel processing. Third, depending to the analysis of the block structure of nature graph and the executing characteristics during graph processing, we design a novel conversion dispatcher to change processing module, to match the corresponding exchange point.