scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2014"


Journal ArticleDOI
TL;DR: This paper analyzes the ME structure in HEVC and proposes a parallel framework to decouple ME for different partitions on many-core processors and achieves more than 30 and 40 times speedup for 1920 × 1080 and 2560 × 1600 video sequences, respectively.
Abstract: High Efficiency Video Coding (HEVC) provides superior coding efficiency than previous video coding standards at the cost of increasing encoding complexity. The complexity increase of motion estimation (ME) procedure is rather significant, especially when considering the complicated partitioning structure of HEVC. To fully exploit the coding efficiency brought by HEVC requires a huge amount of computations. In this paper, we analyze the ME structure in HEVC and propose a parallel framework to decouple ME for different partitions on many-core processors. Based on local parallel method (LPM), we first use the directed acyclic graph (DAG)-based order to parallelize coding tree units (CTUs) and adopt improved LPM (ILPM) within each CTU (DAGILPM), which exploits the CTU-level and prediction unit (PU)-level parallelism. Then, we find that there exist completely independent PUs (CIPUs) and partially independent PUs (PIPUs). When the degree of parallelism (DP) is smaller than the maximum DP of DAGILPM, we process the CIPUs and PIPUs, which further increases the DP. The data dependencies and coding efficiency stay the same as LPM. Experiments show that on a 64-core system, compared with serial execution, our proposed scheme achieves more than 30 and 40 times speedup for 1920 × 1080 and 2560 × 1600 video sequences, respectively.

366 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: The morsel-driven query execution framework is presented, where scheduling becomes a fine-grained run-time task that is NUMA-aware and the degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload.
Abstract: With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.

243 citations


Book ChapterDOI
13 Sep 2014
TL;DR: A new black-box complexity model for search algorithms evaluating λ search points in parallel that captures the inertia caused by offspring populations in evolutionary algorithms and the total computational effort in parallel metaheuristics is proposed.
Abstract: We propose a new black-box complexity model for search algorithms evaluating λ search points in parallel. The parallel unbiased black-box complexity gives lower bounds on the number of function evaluations every parallel unbiased black-box algorithm needs to optimise a given problem. It captures the inertia caused by offspring populations in evolutionary algorithms and the total computational effort in parallel metaheuristics. Our model applies to all unary variation operators such as mutation or local search. We present lower bounds for the LeadingOnes function and general lower bound for all functions with a unique optimum that depend on the problem size and the degree of parallelism, λ. The latter is tight for OneMax; we prove that a (1+λ) EA with adaptive mutation rates is an optimal parallel unbiased black-box algorithm.

85 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.
Abstract: Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitivity of the parallel application. An analysis on the overall GPU radiation sensitivity dependence on the code DOP is provided and the most reliable configuration is experimentally detected. Finally, modifying the parallel management affects the GPU cross section but also the code execution time and, thus, the exposure to radiation required to complete computation. The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.

72 citations


Proceedings Article
08 Dec 2014
TL;DR: This paper proposes a new L- BFGS algorithm, called Vector-free L-BFGS, which avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism.
Abstract: L-BFGS has been applied as an effective parameter estimation method for various machine learning algorithms since 1980s. With an increasing demand to deal with massive instances and variables, it is important to scale up and parallelize L-BFGS effectively in a distributed system. In this paper, we study the problem of parallelizing the L-BFGS algorithm in large clusters of tens of thousands of shared-nothing commodity machines. First, we show that a naive implementation of L-BFGS using Map-Reduce requires either a significant amount of memory or a large number of map-reduce steps with negative performance impact. Second, we propose a new L-BFGS algorithm, called Vector-free L-BFGS, which avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism. The algorithm scales very well and enables a variety of machine learning algorithms to handle a massive number of variables over large datasets. We prove the mathematical equivalence of the new Vector-free L-BFGS and demonstrate its excellent performance and scalability using real-world machine learning problems with billions of variables in production clusters.

56 citations


Journal ArticleDOI
TL;DR: A multi-GPU version of GPUSPH, a CUDA implementation of fluid-dynamics models based on the smoothed particle hydrodynamics (SPH) numerical method, is presented and the Karp-Flatt metric is used to formally estimate the overall efficiency of the parallelization.
Abstract: We present a multi-GPU version of GPUSPH, a CUDA implementation of fluid-dynamics models based on the smoothed particle hydrodynamics (SPH) numerical method. The SPH is a well-known Lagrangian model for the simulation of free-surface fluid flows; it exposes a high degree of parallelism and has already been successfully ported to GPU. We extend the GPU-based simulator to run simulations on multiple GPUs simultaneously, to obtain a gain in speed and overcome the memory limitations of using a single device. The computational domain is spatially split with minimal overlapping and shared volume slices are updated at every iteration of the simulation. Data transfers are asynchronous with computations, thus completely covering the overhead introduced by slice exchange. A simple yet effective load balancing policy preserves the performance in case of unbalanced simulations due to asymmetric fluid topologies. The obtained speedup factor (up to 4.5x for 6 GPUs) closely follows the expected one (5x for 6 GPUs) and it is possible to run simulations with a higher number of particles than would fit on a single device. We use the Karp-Flatt metric to formally estimate the overall efficiency of the parallelization.

54 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: This work presents a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs and compares the performance of the automatically selected mappings to hand-optimized implementations on multiple benchmarks and shows that the average performance gap on 7 out of 8 benchmarks is 24%.
Abstract: Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.

46 citations


Journal ArticleDOI
TL;DR: A new SISO-decoder architecture is proposed that leads to significant throughput gains and better hardware efficiency compared to existing architectures for the full range of code rates.
Abstract: Turbo decoders for modern wireless communication systems have to support high throughput over a wide range of code rates. In order to support the peak throughputs specified by modern standards, parallel turbo-decoding has become a necessity, rendering the corresponding VLSI implementation a highly challenging task. In this paper, we explore the implementation trade-offs of parallel turbo decoders based on sliding-window soft-input soft-output (SISO) maximum a-posteriori (MAP) component decoders. We first introduce a new approach that allows for a systematic throughput comparison between different SISO-decoder architectures, taking their individual trade-offs in terms of window length, error-rate performance and throughput into account. A corresponding analysis of existing architectures clearly shows that the latency of the sliding-window SISO decoders causes diminishing throughput gains with increasing degree of parallelism. In order to alleviate this parallel turbo-decoder predicament, we propose a new SISO-decoder architecture that leads to significant throughput gains and better hardware efficiency compared to existing architectures for the full range of code rates.

39 citations


Journal ArticleDOI
TL;DR: A detailed guide with emphasis on the design and outlining of each module of FCS-MPC in an FPGA XC3S3500E to control a power cell of a cascaded half bridge (CHB) converter in order to reduce the developing times for researchers in the area of power converters.
Abstract: In the last decades, finite control set-model predictive control (FCS-MPC) has been extensively investigated. This control strategy uses the system model to determine the optimal input that minimizes a predefined cost function. This procedure is performed by evaluating the admissible inputs of the converter model. In order to achieve a proper performance of the control strategy, a small sampling time must be used. In practice, the computational effort and processing time are critical factors that depend directly on the number of admissible states of the topology. With the objective of reducing the execution time needed by the algorithm, it is possible to take advantage of the high degree of parallelism that can be obtained from a field-programmable gate array (FPGA). This paper deals with the implementation of FCS-MPC in an FPGA XC3S3500E to control a power cell of a cascaded half bridge (CHB) converter. The design is divided into subparts, called modules, which can be projected to control multiple cells using only one chip. In addition, with the aim to restrict the use of hardware to the available resources, the document presents a series of considerations related to the algorithm and its implementation. This paper presents a detailed guide with emphasis on the design and outlining of each module in order to reduce the developing times for researchers in the area of power converters.

36 citations


Proceedings ArticleDOI
19 May 2014
TL;DR: This work is using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems, and uses task superscalar concepts to enable the developer to write serial code while providing parallel execution.
Abstract: Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.

35 citations


Patent
04 Jun 2014
TL;DR: In this paper, a system and method for processing multi-core parallel assembly line signals of a 4G broadband communication system based on a GPP is presented, where a large volume of data and computing tasks are divided into reasonable granularities through a scheduler and are distributed to assembly lines of all levels to be processed respectively.
Abstract: The invention discloses a system and method for processing multi-core parallel assembly line signals of a 4G broadband communication system based on a GPP. In order to meet the strict requirement for the real-time performance of the 4G communication system, the cloud computing idea is utilized, the GPP serves as a computing resource, communication data are processed by utilizing an assembly line processing mode based on the GPP, and a large volume of data and computing tasks are divided into reasonable granularities through a scheduler and are distributed to assembly lines of all levels to be processed respectively. Under the condition that hardware performance is limited, the assembly line mode can meet the requirement for the real-time performance more easily, meanwhile, the time affluence amount is led in, and large delay variation can be borne by the system. Computing resources are fully utilized through reasonable dispatching. According to the system and method for processing the 4G broadband communication system multi-core parallel assembly line signals, three kinds of assembly lines are designed totally, one assembly line is suitable for processing a large data volume and higher in reliability, another assembly line is suitable for processing a small data volume and high in speed and flexibility, and the third assembly line is a composite assembly line based on twp application scenes and with a high degree of parallelism, wherein the performance of the third assembly line is obviously improved.

Journal ArticleDOI
TL;DR: The generalized multiprocessor periodic resource model (GMPR) is proposed that is strictly superior to the MPR model without requiring a too detailed description and a method to compute the interface from the application specification is derived.
Abstract: Composition is a practice of key importance in software engineering. When real-time applications are composed, it is necessary that their timing properties (such as meeting the deadlines) are guaranteed. The composition is performed by establishing an interface between the application and the physical platform. Such an interface typically contains information about the amount of computing capacity needed by the application. For multiprocessor platforms, the interface should also present information about the degree of parallelism. Several interface proposals have recently been put forward in various research works. However, those interfaces are either too complex to be handled or too pessimistic. In this paper we propose the generalized multiprocessor periodic resource model (GMPR) that is strictly superior to the MPR model without requiring a too detailed description. We then derive a method to compute the interface from the application specification. This method has been implemented in Matlab routines that are publicly available.

Book ChapterDOI
25 Aug 2014
TL;DR: A performance assessment of a massively parallel and portable Lattice Boltzmann code, based on the Open Computing Language (OpenCL) and the Message Passing Interface (MPI), and techniques to move data between accelerators minimizing overheads of communication latencies are presented.
Abstract: High performance computing increasingly relies on heterogeneous systems, based on multi-core CPUs, tightly coupled to accelerators: GPUs or many core systems. Programming heterogeneous systems raises new issues: reaching high sustained performances means that one must exploit parallelism at several levels; at the same time the lack of a standard programming environment has an impact on code portability. This paper presents a performance assessment of a massively parallel and portable Lattice Boltzmann code, based on the Open Computing Language (OpenCL) and the Message Passing Interface (MPI). Exactly the same code runs on standard clusters of multi-core CPUs, as well as on hybrid clusters including accelerators. We consider a state-of-the-art Lattice Boltzmann model that accurately reproduces the thermo-hydrodynamics of a fluid in 2 dimensions. This algorithm has a regular structure suitable for accelerator architectures with a large degree of parallelism, but it is not straightforward to obtain a large fraction of the theoretically available performance. In this work we focus on portability of code across several heterogeneous architectures preserving performances and also on techniques to move data between accelerators minimizing overheads of communication latencies. We describe the organization of the code and present and analyze performance and scalability results on a cluster of nodes based on NVIDIA K20 GPUs and Intel Xeon-Phi accelerators.

Journal ArticleDOI
01 Oct 2014
TL;DR: Empirical studies with the above and three existing proposals conducted on modern processors show that the proposals scale near-linearly with the number of hardware threads and thus are able to benefit from increasing on-chip parallelism.
Abstract: The efficient processing of workloads that interleave moving-object updates and queries is challenging. In addition to the conflicting needs for update-efficient versus query-efficient data structures, the increasing parallel capabilities of multi-core processors yield challenges. To prevent concurrency anomalies and to ensure correct system behavior, conflicting update and query operations must be serialized. In this setting, it is a key concern to avoid that operations are blocked, which leaves processing cores idle. To enable efficient processing, we first examine concurrency degrees from traditional transaction processing in the context of our target domain and propose new semantics that enable a high degree of parallelism and ensure up-to-date query results. We define the new semantics for range and $$k$$ k -nearest neighbor queries. Then, we present a main-memory indexing technique called parallel grid that implements the proposed semantics as well as two other variants supporting different semantics. This enables us to quantify the effects that different degrees of consistency have on performance. We also present an alternative time-partitioning approach. Empirical studies with the above and three existing proposals conducted on modern processors show that our proposals scale near-linearly with the number of hardware threads and thus are able to benefit from increasing on-chip parallelism.

Journal ArticleDOI
01 Jan 2014
TL;DR: Empirically study how the energy efficiency of a map-reduce job varies with increase in parallelism and network bandwidth on a HPC cluster and suggest strategies for configuring the degree of parallelism, network bandwidth and power management features in a HPS cluster for energy efficient execution of map- reduce jobs.
Abstract: Map-Reduce programming model is commonly used for efficient scientific computations, as it executes tasks in parallel and distributed manner on large data volumes. The HPC infrastructure can effectively increase the parallelism of map-reduce tasks. However such an execution will incur high energy and data transmission costs. Here we empirically study how the energy efficiency of a map-reduce job varies with increase in parallelism and network bandwidth on a HPC cluster. We also investigate the effectiveness of power-aware systems in managing the energy consumption of different types of map-reduce jobs. We comprehend that for some jobs the energy efficiency degrades at high degree of parallelism, and for some it improves at low CPU frequency. Consequently we suggest strategies for configuring the degree of parallelism, network bandwidth and power management features in a HPC cluster for energy efficient execution of map-reduce jobs.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: To overcome NTC barriers, Accordion is introduced, a novel, light-weight framework, which exploits weak scaling along with inherent fault tolerance of emerging R(ecognition), M(ining), S(ynthesis) applications.
Abstract: While more cores can find place in the unit chip area every technology generation, excessive growth in power density prevents simultaneous utilization of all. Due to the lower operating voltage, Near-Threshold Voltage Computing (NTC) promises to fit more cores in a given power envelope. Yet NTC prospects for energy efficiency disappear without mitigating (i) the performance degradation due to the lower operating frequency; (ii) the intensified vulnerability to parametric variation. To compensate for the first barrier, we need to raise the degree of parallelism - the number of cores engaged in computation. NTC-prompted power savings dominate the power cost of increasing the core count. Hence, limited parallelism in the application domain constitutes the critical barrier to engaging more cores in computation. To avoid the second barrier, the system should tolerate variation-induced errors. Unfortunately, engaging more cores in computation exacerbates vulnerability to variation further. To overcome NTC barriers, we introduce Accordion, a novel, light-weight framework, which exploits weak scaling along with inherent fault tolerance of emerging R(ecognition), M(ining), S(ynthesis) applications. The key observation is that the problem size not only dictates the number of cores engaged in computation, but also the application output quality. Consequently, Accordion designates the problem size as the main knob to trade off the degree of parallelism (i.e. the number of cores engaged in computation), with the degree of vulnerability to variation (i.e. the corruption in application output quality due to variation-induced errors). Parametric variation renders ample reliability differences between the cores. Since RMS applications can tolerate faults emanating from data-intensive program phases as opposed to control, variation-afflicted Accordion hardware executes fault-tolerant data-intensive phases on error-prone cores, and reserves reliable cores for control.

Proceedings ArticleDOI
05 Jan 2014
TL;DR: A novel Givens Rotation (GR) based QRD (GR-QRD) where the computational complexity of GR is reduced and the algorithm is implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA).
Abstract: QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beamforming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR-QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-wise GR (CGR) can annihilate multiple elements of a column of a matrix simultaneously. The algorithm is first realized on a Two-Dimensional (2D) systolic array and then implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA). We benchmark the proposed implementation against state-of-the-art implementations to report better throughput, convergence and scalability.

Patent
19 Feb 2014
TL;DR: In this paper, a data placement method based on a distributed cluster is proposed, where load balancing of data placement can be achieved, and the degree of parallelism is improved when data read-write is carried out.
Abstract: The invention discloses a data placement method based on a distributed cluster. In order to solve the problem that the loading condition, the computing power of a computational node and movement of mass data can have an influence on operational performance, the three factors are effectively combined to compute an evaluation value of data placement, and then a node is selected according to the evaluation value. The data placement method based on the distributed cluster has the advantages that load balancing of data placement can be achieved, and the degree of parallelism is improved when data read-write is carried out; the computing power of the node can be well used, corresponding computation tasks are distributed according to the computing power, and the time of operation is reduced; good transmission performance is achieved, data are stored in the nearby computational node, data transmission can be minimized, and efficiency is improved.

Proceedings ArticleDOI
13 Apr 2014
TL;DR: A new 3D stochastic transformation called DOZEN, inspired by AES cipher, and two new constructions of S-box, called 2D and 3D S-boxes respectively are presented, offering a solution for information security problems.
Abstract: This paper describes new three-dimensional algorithms of stochastic data transformation, offering a solution for information security problems The most important feature of these algorithms is a high degree of parallelism at the level of elementary operations In this paper we present a new 3D stochastic transformation called DOZEN, inspired by AES cipher, and two new constructions of S-box, called 2D and 3D S-boxes respectively

Patent
24 Dec 2014
TL;DR: In this paper, a two-stage scheduling method of parallel test tasks facing a spacecraft automation test is presented, where the constraint relation among a plurality of test tasks is quickly established, the independence between the test tasks are analyzed, the degree of parallelism of the test task is increased, the optimal scheduling of the tasks on the equipment is realized when constraint conditions are satisfied, and the test efficiency is improved.
Abstract: The invention relates to a two-stage scheduling method of parallel test tasks facing a spacecraft automation test, which belongs to the field of parallel tests. The method comprises the following stages: in the first stage, the test tasks, task instructions and tested parameters are analyzed and determined, a constraint relation between the tasks is defined, a time sequence constraint matrix and a parameter competitive relation matrix are established, the tasks and the constraint relation between the tasks are changed into undirected graphs, a parallel task scheduling problem is changed into a minimum coloring problem in the sequence of the tops of the graphs, a method based on the combination of a particle swarm and simulated annealing is used for solving, and then a test task group with the maximal degree of parallelism is obtained; in the second stage, the obtained test task group with the maximal degree of parallelism is distributed on limited test equipment, and then an optimal scheduling scheme is obtained. According to the two-stage scheduling method, the constraint relation among a plurality of test tasks is quickly established, the independence between the test tasks is analyzed, the degree of parallelism of the test tasks is increased, the optimal scheduling of the tasks on the equipment is realized when constraint conditions are satisfied, and the test efficiency is improved.

21 May 2014
TL;DR: A comparative experimental analysis shows that the resulting procedure is competitive with other state-of-the-art conformant planners on domains with a high degree of parallelism.
Abstract: Planning as satisfiability is an efficient technique for classical planning. In previous work by the second author, this approach has been extended to conformant planning, that is, to planning domains having incomplete information about the initial state and/or the effects of actions. In this paper we present some domain independent optimizations to the basic procedure described in the previous work. A comparative experimental analysis shows that the resulting procedure is competitive with other state-of-the-art conformant planners on domains with a high degree of parallelism.

Proceedings ArticleDOI
27 Jun 2014
TL;DR: This paper presents a formal model for BPMN processes in terms of Labelled Transition Systems, which are obtained through process algebra encodings and proposes an approach for automatically computing the degree of parallelism by using model checking techniques and dichotomic search.
Abstract: A business process is a set of structured, related activities that aims at fulfilling a specific organizational goal for a customer or market. An important metric when developing a business process is its degree of parallelism, i.e., the maximum number of tasks that are executable in parallel in that process. The degree of parallelism determines the peak demand on tasks, providing a valuable guide for the problem of resource allocation in business processes. In this paper, we investigate how to automatically measure the degree of parallelism for business processes, described using the BPMN standard notation. We first present a formal model for BPMN processes in terms of Labelled Transition Systems, which are obtained through process algebra encodings. We then propose an approach for automatically computing the degree of parallelism by using model checking techniques and dichotomic search. We implemented a tool for automating this check and we applied it successfully to more than one hundred BPMN processes.

Patent
19 Feb 2014
TL;DR: In this article, the authors propose a distributed data stream processing method and system, which includes the steps that the degree of parallelism corresponding to a designated operation is determined through the receiving rate of target logical tasks and the processing rate in logical tasks received by working nodes from a main node.
Abstract: The invention provides a distributed data stream processing method and system The method includes the steps that the degree of parallelism corresponding to a designated operation is determined through the receiving rate of target logical tasks and the processing rate in logical tasks received by working nodes from a main node, wherein the receiving rate is used for indication of conducting the designed operation, and the designated operation is conducted on the target logical tasks at the processing rate; physical tasks are acquired by integrating the target logical tasks according to the degree of parallelism, the number of the physical tasks is the degree of parallelism, and the designated operation is executed on the physical tasks in parallel The degrees of parallelism of operations are dynamically determined according to the receiving rate of the logical tasks and the processing rate of the logical tasks, and therefore the technical problems that in the prior art, system resources are wasted or data streams are delayed due to the fact that the fixed degrees of parallelism can not adapt to the time-varying characteristics of the data streams and external load change are solved

Proceedings ArticleDOI
12 Oct 2014
TL;DR: AdaPNet is introduced, a run-time system to execute streaming applications, which are modeled as process networks, efficiently on platforms with dynamic resource al-location, and outperforms comparable run- time systems, which do not adapt the degree of parallelism.
Abstract: A widely considered strategy to prevent interference issues on multi-processor systems is to isolate the execution of the individual applications by running each of them on a dedicated virtual guest machine. The amount of computing power available to a single application, however, depends on the other applications running on the system and may change over time. A promising approach to maximize the performance under such conditions is to adapt the application's degree of parallelism when the resources allocated to the application are changed. This enables an application to exploit not more parallelism than required, thereby reducing inter-process communication and scheduling overheads. In this paper, we introduce AdaPNet, a run-time system to execute streaming applications, which are modeled as process networks, efficiently on platforms with dynamic resource al-location. AdaPNet responds to changes in the available resources by first calculating a process network that maximizes the performance of the application on the new resources. Then, AdaPNet transparently transforms the application into the alternative network without discarding the program state. Targeting two many-core systems, we demonstrate that AdaPNet outperforms comparable run-time systems, which do not adapt the degree of parallelism, in terms of speed-up and memory usage.

Proceedings ArticleDOI
28 Jan 2014
TL;DR: This paper uses an estimated MVP on GPU and the accurate MVP to refine the motion vector on CPU to overcome the constraint from MVP and presents a high quality H.265/HEVC motion estimation implementation with the cooperation of CPU and GPU.
Abstract: This paper presents a high quality H.265/HEVC motion estimation implementation with the cooperation of CPU and GPU. The data dependency from MVP (Motion Vector Predictor) restricts the degree of parallelism on GPU. To overcome the constraint from MVP, we propose to use an estimated MVP on GPU and the accurate MVP to refine the motion vector on CPU. GPU fully utilizes its tremendous parallel computing ability without the restriction from MVP. CPU makes up for the deviation from GPU with a small range refinement. Encoding speed benefits from the high degree of parallelism and compression performance is maintained by the CPU refinement. Experimental result shows that the speedup achieves 2.39 times and 32.77 times in the whole ×265 encoder with CPU SIMD (Single Instruction Multiple Data) on and off, respectively. On the other hand, the quality degradation is negligible with only 0.05% increase of BD-rate.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: A pattern-sensitive partitioning model for data streams that is capable of achieving a high degree of parallelism for event patterns which formerly could only be consistently detected in a sequential manner or at a low parallelization degree is proposed.
Abstract: Complex Event Processing (CEP) systems enable applications to react to live-situations by detecting event patterns (complex events) in data streams. With the increasing number of data sources and the increasing volume at which data is produced, parallelization of event detection is becoming of tremendous importance to limit the time events need to be buffered before they actually can be processed by an event detector — named event processing operator. In this paper, we propose a pattern-sensitive partitioning model for data streams that is capable of achieving a high degree of parallelism for event patterns which formerly could only be consistently detected in a sequential manner or at a low parallelization degree. Moreover, we propose methods to dynamically adapt the parallelization degree to limit the buffering imposed on event detection in the presence of dynamic changes to the workload. Extensive evaluations of the system behavior show that the proposed partitioning model allows for a high degree of parallelism and that the proposed adaptation methods are able to meet the buffering level for event detection under high and dynamic workloads.

Proceedings ArticleDOI
11 Sep 2014
TL;DR: This paper proposes two different approaches to implement the filter operator for the Semantic Web database LUPOSDATE and shows that both approaches of the hardware-implemented filter operator defeat the comparable software solution written in C.
Abstract: In this paper, we investigate the use of Field Programmable Gate Arrays (FPGAs) to enhance the performance of filter expressions in Semantic Web databases. The filter operator is a central part of query evaluation. Its main objective is to reduce the amount of data as early as possible in order to reduce the calculation costs for succeeding and more complex operators such as join operators. Due to the proximity to the data source it is essential for the overall query performance that the filter operator is able to evaluate single data items as fast as possible. In this work, the advantages of using FPGAs in query evaluation are outlined and an overview about the provided degree of parallelism is given. We propose two different approaches to implement the filter operator for the Semantic Web database LUPOSDATE. The Fully-Parallel Filter evaluates all conditions by dividing the input into several sub-items which are evaluated by dedicated sub-filters in parallel. The second approach creates a pipeline of sub-filters to evaluate the filter expression step-by-step. If an item reaches the end of this pipeline then it complies the whole filter expression. The final evaluation shows that both approaches of the hardware-implemented filter operator defeat the comparable software solution written in C running at 2.66 GHz. Processing 100M items per second, the hardware-accelerated filter running at 200 MHz provides a more than 5 times higher throughput than the general-purpose CPU. In contrast to the software solution, the total throughput is independent of the match rate and the structure of the filter expression, and is a valuable contribution to the hardware-accelerated query evaluation.

Journal ArticleDOI
06 Apr 2014
TL;DR: A research prototype is presented that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems, and is aptly capable of full utilization of a wide range of accelerator hardware.
Abstract: Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale. In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design. Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors CPUs. Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns. This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads. In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems. Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed. Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles. We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware.

Proceedings ArticleDOI
05 Feb 2014
TL;DR: The results hereby provided show that adaptively is a strictly necessary requirement to reduce energy consumption in STM systems: Without it, it is not possible to reach any acceptable level of energy efficiency at all.
Abstract: Energy efficiency is becoming a pressing issue, especially in large data centers where it entails, at the same time, a non-negligible management cost, an enhancement of hardware fault probability, and a significant environmental footprint. In this paper, we study how Software Transactional Memories (STM)can provide benefits on both power saving and the overall applications' execution performance. This is related to the fact that encapsulating shared-data accesses within transactions gives the freedom to the STM middleware to both ensure consistency and reduce the actual data contention, the latter having been shown to affect the overall power needed to complete the application's execution. We have selected a set of self-adaptive extensions to existing STM middle wares (namely, TinySTM and R-STM) to prove how self-adapting computation can capture the actual degree of parallelism and/or logical contention on shared data in a better way, enhancing even more the intrinsic benefits provided by STM. Of course, this benefit comes at a cost, which is the actual execution time required by the proposed approaches to precisely tune the execution parameters for reducing power consumption and enhancing execution performance. Nevertheless, the results hereby provided show that adaptively is a strictly necessary requirement to reduce energy consumption in STM systems: Without it, it is not possible to reach any acceptable level of energy efficiency at all.

Journal ArticleDOI
TL;DR: The results based on the 250,000 Microblog items published by Sina show that the proposed parallel schema for Web services and the Benefit Ratio of Composite Services (BROCS) could effectively improve the efficiency of the composite Web service.
Abstract: The accessing, transferring and processing of data need to occur parallel in order to tackle the problem brought on by the increasing volume of data on the Internet. In this paper, we put forward the parallel schema for data-intensive Web services and the Benefit Ratio of Composite Services (BROCS) to balance the throughput and cost. Furthermore, we present the method to determine the degree of parallelism (DOP) based on the BROCS model to optimize the quality of the composite services. The experiment demonstrates how DOP affects the benefit ratio of the composite service. Meanwhile, the results based on the 250,000 Microblog items published by Sina show that our proposed parallel schema for Web services could effectively improve the efficiency of the composite Web service.