scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2017"


Proceedings ArticleDOI
14 Oct 2017
TL;DR: PRA as mentioned in this paper uses serial-parallel shift-and-add multiplication while skipping the zero bits of the serial input, eliminating most of the ineffectual computations on-the-fly.
Abstract: Deep Neural Networks expose a high degree of parallelism, making them amenable to highly data parallel architectures However, data-parallel architectures often accept inefficiency in individual computations for the sake of overall efficiency We show that on average, activation values of convolutional layers during inference in modern Deep Convolutional Neural Networks (CNNs) contain 92% zero bits Processing these zero bits entails ineffectual computations that could be skipped We propose Pragmatic (PRA), a massively data-parallel architecture that eliminates most of the ineffectual computations on-the-fly, improving performance and energy efficiency compared to state-of-the-art high-performance accelerators [5] The idea behind PRA is deceptively simple: use serial-parallel shift-and-add multiplication while skipping the zero bits of the serial input However, a straightforward implementation based on shift-and-add multiplication yields unacceptable area, power and memory access overheads compared to a conventional bit-parallel design PRA incorporates a set of design decisions to yield a practical, area and energy efficient design Measurements demonstrate that for convolutional layers, PRA is 431$\times$ faster than DaDianNao [5] (DaDN) using a 16-bit fixed-point representation While PRA requires 168$\times$ more area than DaDN, the performance gains yield a 170$\times$ increase in energy efficiency in a 65nm technology With 8-bit quantized activations, PRA is 225$\times$ faster and 131$\times$ more energy efficient than an 8-bit version of DaDN CCS CONCEPTS • Computing methodologies $\rightarrow$ Machine learning; Neural networks; • Computer systems organization $\rightarrow$ Single instruction, multiple data; • Hardware $\rightarrow$ Arithmetic and datapath circuits;

196 citations


Proceedings ArticleDOI
14 Jun 2017
TL;DR: This paper presents the design and implementation of three key features of Futhark that seek a suitable middle ground with imperative approaches and presents a flattening transformation aimed at enhancing the degree of parallelism.
Abstract: Futhark is a purely functional data-parallel array language that offers a machine-neutral programming model and an optimising compiler that generates OpenCL code for GPUs. This paper presents the design and implementation of three key features of Futhark that seek a suitable middle ground with imperative approaches. First, in order to express efficient code inside the parallel constructs, we introduce a simple type system for in-place updates that ensures referential transparency and supports equational reasoning. Second, we furnish Futhark with parallel operators capable of expressing efficient strength-reduced code, along with their fusion rules. Third, we present a flattening transformation aimed at enhancing the degree of parallelism that (i) builds on loop interchange and distribution but uses higher-order reasoning rather than array-dependence analysis, and (ii) still allows further locality-of-reference optimisations. Finally, an evaluation on 16 benchmarks demonstrates the impact of the language and compiler features and shows application-level performance competitive with hand-written GPU code.

111 citations


Journal ArticleDOI
TL;DR: The parallel neuromorphic processor architectures for spiking neural networks on FPGA address several critical issues pertaining to efficient parallelization of the update of membrane potentials, on-chip storage of synaptic weights and integration of approximate arithmetic units.

88 citations


Journal ArticleDOI
TL;DR: This paper presents a resource management technique that introduces power density as a novel system level constraint, and provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime.
Abstract: Increasing power densities have led to the dark silicon era, for which heterogeneous multicores with different power and performance characteristics are promising architectures. This paper focuses on maximizing the overall system performance under a critical temperature constraint for heterogeneous tiled multicores, where all cores or accelerators inside a tile share the same voltage and frequency levels. For such architectures, we present a resource management technique that introduces power density as a novel system level constraint, in order to avoid thermal violations. The proposed technique then assigns applications to tiles by choosing their degree of parallelism and the voltage/frequency levels of each tile, such that the power density constraint is satisfied. Moreover, our technique provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime. Thus, the available thermal headroom is exploited to maximize the overall system performance.

62 citations


Proceedings ArticleDOI
27 Mar 2017
TL;DR: This work presents an implementation of the pure version of the Smith-Waterman algorithm, including the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations.
Abstract: Smith-Waterman is a dynamic programming algorithm that plays a key role in the modern genomics pipeline as it is guaranteed to find the optimal local alignment between two strings of data. The state of the art presents many hardware acceleration solutions that have been implemented in order to exploit the high degree of parallelism available in this algorithm. The majority of these implementations use heuristics to increase the performance of the system at the expense of the accuracy of the result. In this work, we present an implementation of the pure version of the algorithm. We include the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations. To achieve scalability, our custom design comprises of systolic arrays, data compression features and shift registers, while a custom port mapping strategy aims to maximize performance. Our designs are built leveraging an OpenCL-based design entry, namely Xilinx SDAccel, in conjunction with a Xilinx Virtex 7 and Kintex Ultrascale platform. Our final design achieves a performance of 42.47 GCUPS (giga cell updates per second) with an energy efficiency of 1.6988 GCUPS/W. This represents an improvement of 1.72x in performance and energy efficiency over previously published FPGA implementations and 8.49x better in energy efficiency over comparable GPU implementations.

45 citations


Journal ArticleDOI
TL;DR: A novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses and lays a critical foundation for many other power system applications that need to deal with massive subtasks.
Abstract: Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The proposed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.

44 citations


Journal ArticleDOI
TL;DR: The findings prove, that the choice between explicit and implicit time integration relies mainly on the convergence of explicit solvers and the efficiency of preconditioners on the GPU.
Abstract: A computational Fluid Dynamics (CFD) code for steady simulations solves a set of non-linear partial differential equations using an iterative time stepping process, which could follow an explicit or an implicit scheme. On the CPU, the difference between both time stepping methods with respect to stability and performance has been well covered in the literature. However, it has not been extended to consider modern high-performance computing systems such as Graphics Processing Units (GPU). In this work, we first present an implementation of the two time-stepping methods on the GPU, highlighting the different challenges on the programming approach. Then we introduce a classification of basic CFD operations, found on the degree of parallelism they expose, and study the potential of GPU acceleration for every class. The classification provides local speedups of basic operations, which are finally used to compare the performance of both methods on the GPU. The target of this work is to enable an informed-decision on the most efficient combination of hardware and method when facing a new application. Our findings prove, that the choice between explicit and implicit time integration relies mainly on the convergence of explicit solvers and the efficiency of preconditioners on the GPU.

26 citations


Proceedings ArticleDOI
24 Jul 2017
TL;DR: It is concluded that a careful evaluation, based on accuracy and latency requirements must be performed, and that exact matrix inversion is in fact viable in many more cases than the current literature claims.
Abstract: Approximate matrix inversion based on Neumann series has seen a recent increased interest motivated by massive MIMO systems. There, the matrices are in many cases diagonally dominant, and, hence, a reasonable approximation can be obtained within a few iterations of a Neumann series. In this work, we clarify that the complexity of exact methods are about the same as when three terms are used for the Neumann series, so in this case, the complexity is not lower as often claimed. The second common argument for Neumann series approximation, higher parallelism, is indeed correct. However, in most current practical use cases, such a high degree of parallelism is not required to obtain a low latency realization. Hence, we conclude that a careful evaluation, based on accuracy and latency requirements must be performed and that exact matrix inversion is in fact viable in many more cases than the current literature claims.

23 citations


Proceedings ArticleDOI
23 Apr 2017
TL;DR: Pandia is a system for modeling the performance of in-memory parallel workloads that accounts for contention at multiple resources such as processor functional units and memory channels, and can be used to optimize the performance and identify opportunities for reducing resource consumption.
Abstract: Pandia is a system for modeling the performance of in-memory parallel workloads. It generates a description of a workload from a series of profiling runs, and combines this with a description of the machine's hardware to model the workload's performance over different thread counts and different placements of those threads.The approach is "comprehensive" in that it accounts for contention at multiple resources such as processor functional units and memory channels. The points of contention for a workload can shift between resources as the degree of parallelism and thread placement changes. Pandia accounts for these changes and provides a close correspondence between predicted performance and actual performance. Testing a set of 22 benchmarks on 2 socket Intel machines fitted with chips ranging from Sandy Bridge to Haswell we see median differences of 1.05% to 0% between the fastest predicted placement and the fastest measured placement, and median errors of 8% to 4% across all placements.Pandia can be used to optimize the performance of a given workload---for instance, identifying whether or not multiple processor sockets should be used, and whether or not the workload benefits from using multiple threads per core. In addition, Pandia can be used to identify opportunities for reducing resource consumption where additional resources are not matched by additional performance---for instance, limiting a workload to a small number of cores when its scaling is poor.

21 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly and shows the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.
Abstract: State Machine Replication (SMR) is a well-known technique to implement fault-tolerant systems. In SMR, servers are replicated and client requests are deterministically executed in the same order by all replicas. To improve performance in multi-processor systems, some approaches have proposed to parallelize the execution of non-conflicting requests. Such approaches perform remarkably well in workloads dominated by non-conflicting requests. Conflicting requests introduce expensive synchronization and result in considerable performance loss. Current approaches to parallel SMR define the degree of parallelism statically. However, it is often difficult to predict the best degree of parallelism for a workload and workloads experience variations that change their best degree of parallelism. This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly. Experiments show the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.

20 citations


Proceedings ArticleDOI
19 May 2017
TL;DR: PGX.D/Async is presented, a scalable distributed pattern matching engine for property graphs that is able to handle very large datasets and implements pattern matching operations with asynchronous depth-first traversal, allowing for a high degree of parallelism and precise control over memory consumption.
Abstract: Graph querying and pattern matching is becoming an important feature of graph processing as it allows data analysts to easily collect and understand information about their graphs in a way similar to SQL for databases. One of the key challenges in graph pattern matching is to process increasingly large graphs that often do not fit in the memory of a single machine. In this paper, we present PGX.D/Async, a scalable distributed pattern matching engine for property graphs that is able to handle very large datasets. PGX.D/Async implements pattern matching operations with asynchronous depth-first traversal, allowing for a high degree of parallelism and precise control over memory consumption. In PGX.D/Async, developers can query graphs with PGQL, an SQL-like query language for property graphs. Essentially, PGX.D/Async provides an intuitive, distributed, in-memory pattern matching engine for very large graphs.

Journal ArticleDOI
TL;DR: A new parallel solver for the volumetric integral equations of electrodynamics is presented, based on the Galerkin method, which ensures convergent numerical solution and exhibits perfect scalability on different hardware platforms.
Abstract: A new parallel solver for the volumetric integral equations (IE) of electrodynamics is presented. The solver is based on the Galerkin method, which ensures convergent numerical solution. The main features include: (i) memory usage eight times lower compared with analogous IE-based algorithms, without additional restrictions on the background media; (ii) accurate and stable method to compute matrix coefficients corresponding to the IE; and (iii) high degree of parallelism. The solver’s computational efficiency is demonstrated on a problem of magnetotelluric sounding of media with large conductivity contrast, revealing good agreement with results obtained using the second-order finite-element method. Due to the effective approach to parallelization and distributed data storage, the program exhibits perfect scalability on different hardware platforms.

Proceedings ArticleDOI
09 May 2017
TL;DR: PACMAN as mentioned in this paper is a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging, which leverages a combination of static and dynamic analyses to parallelize the log recovery.
Abstract: Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery without introducing any costly logging overhead to the execution of concurrent transactions. We propose PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging. PACMAN leverages a combination of static and dynamic analyses to parallelize the log recovery: at compile time, PACMAN decomposes stored procedures by carefully analyzing dependencies within and across programs; at recovery time, PACMAN exploits the availability of the runtime parameter values to attain an execution schedule with a high degree of parallelism. As such, recovery performance is remarkably increased. We evaluated PACMAN in a fully-fledged main-memory DBMS running on a 40-core machine. Compared to several state-of-the-art database recovery mechanisms, can significantly reduce recovery time without compromising the efficiency of transaction processing.

Patent
10 May 2017
TL;DR: In this paper, the authors proposed a self-adaptive rate control method for stream data processing, based on a common data receiving message queue and a big data distributor calculating framework. But the authors did not consider the real-time property and the stability of a mass data processing system.
Abstract: The invention belongs to the technical field of computer applications and relates to a self-adaptive rate control method for stream data processing. According to the method, based on a common data receiving message queue and a big data distributor calculating framework, the degree of parallelism of data processing is adjusted through a pre-fragmentation mode according to the condition of a current calculating colony and the quantity of current processed data of the colony is dynamically adjusted according to a self adaptive real time rate control method, so that the stability of the calculating colony is ensured, and the delay of data stream processing is reduced. Along with gradual penetration of big data into the industries, the application range of the real-time processing of mass data is gradually expanded. The real time property and the stability of a mass data processing system are quite important. On the premise of not increasing the quantity of the colony hardware and the task programming complexity, the stability and the processing efficiency of the calculating colony are enhanced to a certain extent.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: WASP is introduced, a workload-aware task scheduler and partitioner, which jointly optimizes both parameters at runtime and improves performance and reduces the cluster operating cost on cloud by up to 40%, over the baseline following Spark Tuning Guidelines.
Abstract: Recently, in-memory big data processing frameworks have emerged, such as Apache Spark and Ignite, to accelerate workloads requiring frequent data reuse. With effective in-memory caching these frameworks eliminate most of I/O operations, which would otherwise be necessary for communication between producer and consumer tasks. However, this performance benefit is nullified if the memory footprint exceeds available memory size, due to excessive spill and garbage collection (GC) operations. To fit the working set in memory, two system parameters play an important role: number of data partitions (N partitions ) specifying task granularity, and number of tasks per each executor (N threads ) specifying the degree of parallelism in execution. Existing approaches to optimizing these parameters either do not take into account workload characteristics, or optimize only one of the parameters in isolation, thus yielding suboptimal performance. This paper introduces WASP, a workload-aware task scheduler and partitioner, which jointly optimizes both parameters at runtime. To find an optimal setting, WASP first analyzes the DAG structure of a given workload, and uses an analytical model to predict optimal settings of N partitions and N threads for all stages based on their computation types. Taking this as input, the WASP scheduler employs a hill climbing algorithm to find an optimal N threads for each stage, thus maximizing concurrency while minimizing data spills and GCs. We prototype WASP on Spark and evaluate it using six workloads on three different parallel platforms. WASP improves performance by up to 3.22× and reduces the cluster operating cost on cloud by up to 40%, over the baseline following Spark Tuning Guidelines and provides robust performance for both shuffle-heavy and shuffle-light workloads.

Proceedings ArticleDOI
05 Jun 2017
TL;DR: This paper makes the preliminary attempt to develop the dataflow insight into a specialized graph accelerator and believes that this work would open a wide range of opportunities to improve the performance of computation and memory access for large-scale graph processing.
Abstract: Existing graph processing frameworks greatly improve the performance of memory subsystem, but they are still subject to the underlying modern processor, resulting in the potential inefficiencies for graph processing in the sense of low instruction level parallelism and high branch misprediction. These inefficiencies, in accordance with our comprehensive micro-architectural study, mainly arise out of a wealth of dependencies, serial semantic of instruction streams, and complex conditional instructions in graph processing. In this paper, we propose that a fundamental shift of approach is necessary to break through the inefficiencies of the underlying processor via the dataflow paradigm. It is verified that the idea of applying dataflow approach into graph processing is extremely appealing for the following two reasons. First, as the execution and retirement of instructions only depend on the availability of input data in dataflow model, a high degree of parallelism can be therefore provided to relax the heavy dependency and serial semantic. Second, dataflow is guaranteed to make it possible to reduce the costs of branch misprediction by simultaneously executing all branches of a conditional instruction. Consequently, we make the preliminary attempt to develop the dataflow insight into a specialized graph accelerator. We believe that our work would open a wide range of opportunities to improve the performance of computation and memory access for large-scale graph processing.

Journal ArticleDOI
TL;DR: A grid-based parallel algorithm for particle insertion, named WI-USHER, is proposed to improve the efficiency of the particle insertion operation when restricting the size of the region to be inserted or with higher number density.
Abstract: The hybrid atomistic-continuum coupling method based on domain decomposition serves as an important tool for the micro-fluid simulation. There exists a certain degree of parallelism load imbalance when directly using the USHER algorithm in the domain decomposition–based hybrid atomistic-continuum coupling method. In this article, we propose a grid-based parallel algorithm for particle insertion, named WI-USHER, to improve the efficiency of the particle insertion operation when restricting the size of the region to be inserted or with higher number density. The WI-USHER algorithm slices the region to be inserted into finer grids with proper spacing scale, marks parts of finer grids in black according to three exclusive rules, that is, Single Particle Occupation (SPO), Single Particle Coverage (SPC), and Multi-Particles Coverage (MPC), and finds the target insertion point in the remained white grids. We use two test cases to show the superiority of our WI-USHER algorithm over the USHER algorithm. The WI-USH...

Journal ArticleDOI
TL;DR: This work proposes a new approach to analyzing degree of parallelism for concurrent workflow processes with shared resources and demonstrates the application and evaluates the effectiveness in a real-world business scenario.
Abstract: Degree of parallelism is an important factor in workflow process management, because it is useful to accurately estimate the server costs and schedule severs in workflow processes. However, existing methods that are developed to compute degree of parallelism neglect to consider activities with uncertain execution time. In addition, these methods are limited in dealing with the situation where activities in multiple concurrent workflow processes use shared resources. To address the limitations, we propose a new approach to analyzing degree of parallelism for concurrent workflow processes with shared resources. Superior over the existing methods, our approach can compute degree of parallelism for multiple concurrent workflow processes that have activities with uncertain execution time and shared resources. Expectation degree of parallelism is useful to estimate the server costs of the workflow processes, and maximum degree of parallelism can guide managers to allocate severs or virtual machines based on the business requirement. We demonstrate the application of the approach and evaluate the effectiveness in a real-world business scenario.

Proceedings ArticleDOI
19 Apr 2017
TL;DR: In this article, the authors introduce Posterior Snapshot Isolation (PostSI), an SI mechanism that allows transactions to determine their timestamps autonomously, without relying on centralized coordination.
Abstract: Snapshot Isolation (SI) is a widely adopted concurrency control mechanism in database systems, which utilizes timestamps to resolve conflicts between transactions. However, centralized allocation of timestamps is a potential bottleneck for parallel transaction management. This bottleneck is becoming increasingly visible with the rapidly growing degree of parallelism of today's computing platforms. This paper introduces Posterior Snapshot Isolation (PostSI), an SI mechanism that allows transactions to determine their timestamps autonomously, without relying on centralized coordination. As such, PostSI can scale well, rendering it suitable for various multi-core and MPP platforms. Extensive experiments are conducted to demonstrate its advantage over existing approaches.

Proceedings ArticleDOI
01 Nov 2017
TL;DR: Two types of extensions in ERS are proposed to support dynamic modelling using Event-B, supporting data-dependent workflows where data changes are possible and improving ERS by providing exception handling support.
Abstract: Event-B is a state-based formal method for modelling and verifying the consistency of discrete systems. Event refinement structures (ERS) augment Event-B with hierarchical diagrams, providing explicit support for workflows and refinement relationships. Despite the variety of ERS combinators, ERS still lacks the flexibility to model dynamic workflows that support dynamic changes in the degree of concurrency. Specifically in the cases where the degree of parallelism is data dependent and data values can change during execution. In this paper, we propose two types of extensions in ERS to support dynamic modelling using Event-B. The first extension is supporting data-dependent workflows where data changes are possible. The second extension improves ERS by providing exception handling support. Semantics are given to an ERS diagram by generating an Event-B model from it. We demonstrate the Event-B encodings of the proposed ERS extensions by modelling a concurrent emergency dispatch case study.

DOI
01 Jan 2017
TL;DR: The objective of this Thesis is to provide a complete framework to analyze, evaluate and refactor DDF applications expressed using the RVC-CAL language, which relies on a systematic design space exploration (DSE) examining different design alternatives to optimize the chosen objective function while satisfying the constraints.
Abstract: The limitations of clock frequency and power dissipation of deep sub-micron CMOS technology have led to the development of massively parallel computing platforms. They consist of dozens or hundreds of processing units and offer a high degree of parallelism. Taking advantage of that parallelism and transforming it into high program performances requires the usage of appropriate parallel programming models and paradigms. Currently, a common practice is to develop parallel applications using methods evolving directly from sequential programming models. However, they lack the abstractions to properly express the concurrency of the processes. An alternative approach is to implement dataflow applications, where the algorithms are described in terms of streams and operators thus their parallelism is directly exposed. Since algorithms are described in an abstract way, they can be easily ported to different types of platforms. Several dataflow models of computation (MoCs) have been formalized so far. They differ in terms of their expressiveness (ability to handle dynamic behavior) and complexity of analysis. So far, most of the research efforts have focused on the simpler cases of static dataflow MoCs, where many analyses are possible at compile-time and several optimization problems are greatly simplified. At the same time, for the most expressive and the most difficult to analyze dynamic dataflow (DDF), there is still a dearth of tools supporting a systematic and automated analysis minimizing the programming efforts of the designer. The objective of this Thesis is to provide a complete framework to analyze, evaluate and refactor DDF applications expressed using the RVC-CAL language. The methodology relies on a systematic design space exploration (DSE) examining different design alternatives in order to optimize the chosen objective function while satisfying the constraints. The research contributions start from a rigorous DSE problem formulation. This provides a basis for the definition of a complete and novel analysis methodology enabling systematic performance improvements of DDF applications. Different stages of the methodology include exploration heuristics, performance estimation and identification of refactoring directions. All of the stages are implemented as appropriate software tools. The contributions are substantiated by several experiments performed with complex dynamic applications on different types of physical platforms.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: This paper validate with experimental results the moldable parallel computing model where the dynamic energy consumption of a moldable job increases with the number of parallel tasks, and proposes a semi-dynamic online scheduling algorithm based on adaptive task partitioning to reduce dynamicEnergy consumption while meeting performance requirements from a global perspective.
Abstract: Big data workflows comprised of moldable parallel MapReduce programs running on a large number of processors have become a main consumer of energy at data centers. The degree of parallelism of each moldable job in such workflows has a significant impact on the energy efficiency of parallel computing systems, which remains largely unexplored. In this paper, we validate with experimental results the moldable parallel computing model where the dynamic energy consumption of a moldable job increases with the number of parallel tasks. Based on our validation, we construct rigorous cost models and formulate a dynamic scheduling problem of deadline-constrained MapReduce workflows to minimize energy consumption in Hadoop systems. We propose a semi-dynamic online scheduling algorithm based on adaptive task partitioning to reduce dynamic energy consumption while meeting performance requirements from a global perspective, and also design the corresponding system modules for algorithm implementation in Hadoop architecture. The performance superiority of the proposed algorithm in terms of dynamic energy saving and deadline violation is illustrated by extensive simulation results in Hadoop/YARN in comparison with existing algorithms, and the core module of adaptive task partitioning is further validated through real-life workflow implementation and experimental results using the Oozie workflow engine in Hadoop/YARN systems.

Patent
28 Nov 2017
TL;DR: In this paper, the authors proposed a heterogeneous computing platform consisting of a host and a plurality of programmable devices, wherein the host is connected with each of the programmable device, and each device is used for processing the distributed computing data in parallel.
Abstract: The invention discloses a heterogeneous computing platform. The platform comprises a host and a plurality of programmable devices, wherein the host is connected with each of the programmable devices; the host is used for initializing the programmable devices, carrying out parallel scheduling on each programmable device, sending computing data to each programmable device and obtaining a computing result; and each programmable device is used for processing the distributed computing data in parallel. The plurality of programmable devices of the heterogeneous computing platform can carry out computation at the same time, and the operation speed of the whole heterogeneous computing platform is equal to the sum of operation speeds of the programmable devices; compared with the heterogeneous computing platform with only one programmable device in the prior art, the whole operation speed and degree of parallelism of the heterogeneous computing platform are improved, so that the computing efficiency is improved, and the requirements, for the operation speed of the heterogeneous computing platform, of more and more complicated algorithms and data with larger and larger scales can be better satisfied. The invention furthermore provides an acceleration method on the basis of the heterogeneous computing platform.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: A design methodology of a dataflow accelerator for the implementation of CNNs on FPGAs that ensures scalability – and achieve a higher degree of parallelism as the size of the CNN increases – and an efficient exploitation of the available resources is presented.
Abstract: In the past few years we have experienced an extremely rapid growth of modern applications based on deep learning algorithms such as Convolutional Neural Network (CNN), and consequently, an intensification of academic and industrial research focused on the optimization of their imple- mentation. Among the different alternatives that have been ex- plored, FPGAs seems to be one of the most attractive, as they are able to deliver high performance and energy-efficiency, thanks to their inherent parallelism and direct hardware execution, while retaining extreme flexibility due to their reconfigurability.In this paper we present a design methodology of a dataflow accelerator for the implementation of CNNs on FPGAs, that ensures scalability – and achieve a higher degree of parallelism as the size of the CNN increases – and an efficient exploitation of the available resources. Furthermore, we analyze resource consumption of the layers of the CNN as well as latency in relation to the implementation's hyperparameters. Finally, we show that the proposed design implements a high-level pipeline between the different network layers, and as a result, we can improve the latency to process an image by feeding the CNN with batches of multiple images.

Patent
11 Aug 2017
TL;DR: In this article, a Storm task expansion scheduling algorithm based on data stream prediction is proposed, which belongs to the field of data exchange networks and is used to reduce inter-node network communication to the largest degree and guarantee the load balance of a cluster.
Abstract: The invention relates to a Storm task expansion scheduling algorithm based on data stream prediction, and belongs to the field of data exchange networks. Through a monitoring module, the real-time operation data of a Topology task submitted by a user can be obtained, the degree of parallelism of a connected component in Topology under a situation that a component load is met is solved, and then, the degrees of parallelism of all components in the Topology can be solved through iteration. A time series model is used for predicting a data size which needs to be processed by the Topology, the optimal degree of parallelism of a startup component spout in the Topology under the situation is solved, the optimal degree of parallelism of each component in the Topology under a prediction condition is obtained, and scheduling is carried out. In scheduling, an on-line scheduling algorithm is used to reduce inter-node network communication to a largest degree and guarantee the load balance of a cluster. By use of the algorithm, a deficiency that relevance among all components in the Topology is not fully considered is overcome, the deficiency that the optimal degree of parallelism of each component in the Topology submitted by the user can not be quickly and efficiently solved is mad up, and the algorithm has the advantages that change is predicted in advance, handling capacity is improved and handling time delay is lowered.

Book ChapterDOI
23 Oct 2017
TL;DR: DBCA is proposed, which defines the concept of dependency correlation to measure the similarity between tasks in terms of data dependencies, and shows that it significantly reduced the execution time of the whole workflow.
Abstract: Scientific workflow applications consist of many fine-grained computational tasks with dependencies, whose runtime varies widely. When executing these fine-grained tasks in a cloud computing environment, significant scheduling overheads are generated. Task clustering is a key technology to reduce scheduling overhead and optimize process execution time. Unfortunately, the attempts of task clustering often cause the problems of runtime and dependency imbalance. However, the existing task clustering strategies mainly focus on how to avoid the runtime imbalance, but rarely deal with the data dependency between tasks. Without considering the data dependency, task clustering will lead to the poor degree of parallelism during task execution due to the introduced data locality. In order to address the problem of dependency imbalance, we propose Dependency Balance Clustering Algorithm (DBCA), which defines the concept of dependency correlation to measure the similarity between tasks in terms of data dependencies. The tasks with high dependency correlation are clustered together so as to avoid the dependency imbalance. We conducted the experiments on the WorkflowSim platform and compared our method with the existing task clustering method. The results showed that it significantly reduced the execution time of the whole workflow.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: This paper adopts the three-layer wavelet neural network structure, and the real coding method is adopted, which is combined with the genetic algorithm, which makes the whole control system have good local characteristics and learning classification ability, and can also be quickly optimized.
Abstract: Wavelet neural network has a slow convergence rate, weak global search capability and easy to search the search results to a minimum, while the genetic algorithm has a high degree of parallelism, randomness, adaptive search and global optimization. The wavelet neural network is transformed and transformed to obtain the discretized wavelet neural network. In this paper, the three-layer wavelet neural network structure is adopted, and the real coding method is adopted, which is combined with the genetic algorithm, which makes the whole control system have good local characteristics and learning classification ability, and can also be quickly optimized. The simulation shows that the feasibility of this design is proved by comparing the obtained curves with two inverted pendulums.

Journal ArticleDOI
TL;DR: This paper proposes an implementation-oriented construction scheme for NQC-LDPC codes to avoid memory-access conflict in the partly parallel decoder and proposes a Modified Overlapped Message-Passing (MOMP) algorithm that doubles the hardware utilization efficiency and supports a higher degree of parallelism than that used in the Overlapping Message Passing technique proposed in previous works.

Proceedings ArticleDOI
05 Jun 2017
TL;DR: This paper proposes an approximate analytical model that can guide cloud operators to determine a proper setting, such as the number of servers, the buffer size and the degree of parallelism, for achieving specific performance levels.
Abstract: Performance analysis is crucial to the successful development of cloud computing paradigm. And it is especially important for a cloud computing center serving parallelizable application jobs, for determining a proper degree of parallelism could reduce the mean service response time and thus improve the performance of cloud computing obviously. In this paper, taking the cloud based rendering service platform as an example application, we propose an approximate analytical model for cloud computing centers serving parallelizable jobs using M/M/c/r queuing systems, by modeling the rendering service platform as a multi-station multi-server system. We solve the proposed analytical model to obtain a complete probability distribution of response time, blocking probability and other important performance metrics for given cloud system settings. Thus this model can guide cloud operators to determine a proper setting, such as the number of servers, the buffer size and the degree of parallelism, for achieving specific performance levels. Through extensive simulations based on both synthetic data and real-world workload traces, we show that our proposed analytical model can provide approximate performance prediction results for cloud computing centers serving parallelizable jobs, even those job arrivals follow different distributions.

Posted Content
05 Apr 2017
TL;DR: Posterior Snapshot Isolation is introduced, an SI mechanism that allows transactions to determine their timestamps autonomously, without relying on centralized coordination, and can scale well, rendering it suitable for various multi-core and MPP platforms.
Abstract: Snapshot Isolation (SI) is a widely adopted concurrency control mechanism in database systems, which utilizes timestamps to resolve conflicts between transactions. However, centralized allocation of timestamps is a potential bottleneck for parallel transaction management. This bottleneck is becoming increasingly visible with the rapidly growing degree of parallelism of today's computing platforms. This paper introduces Posterior Snapshot Isolation (PostSI), an SI mechanism that allows transactions to determine their timestamps autonomously, without relying on centralized coordination. As such, PostSI can scale well, rendering it suitable for various multi-core and MPP platforms. Extensive experiments are conducted to demonstrate its advantage over existing approaches.