scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2013"


Journal ArticleDOI
TL;DR: A systematic comparison between DSP and FPGA technologies, depicting the main advantages and drawbacks of each one is presented.
Abstract: Digital signal processors (DSPs) and field-programmable gate arrays (FPGAs) are predominant in the implementation of digital controllers and/or modulators for power converter applications. This paper presents a systematic comparison between these two technologies, depicting the main advantages and drawbacks of each one. Key programming and implementation aspects are addressed in order to give an overall idea of their most important features and allow the comparison between DSP and FPGA devices. A classical linear control strategy for a well-known voltage-source-converter (VSC)-based topology used as Static Compensator (STATCOM) is considered as a driving example to evaluate the performance of both approaches. A proof-of-concept laboratory prototype is separately controlled with the TMS320F2812 DSP and the Spartan-3 XCS1000 FPGA to illustrate the characteristics of both technologies. In the case of the DSP, a virtual floating-point library is used to accelerate the control routines compared to double precision arithmetic. On the other hand, two approaches are developed for the FPGA implementation, the first one reduces the hardware utilization and the second one reduces the computation time. Even though both boards can successfully control the STATCOM, results show that the FPGA achieves the best computation time thanks to the high degree of parallelism available on the device.

61 citations


Proceedings ArticleDOI
29 May 2013
TL;DR: This paper considers general-purpose multi-threaded applications with a varying degree of parallelism (DOP) that can be set at run-time, and proposes an accurate analytical model to predict the execution time of such applications on heterogeneous CMPs.
Abstract: In this paper, we propose an efficient iterative optimization based approach for architectural synthesis of dark silicon heterogeneous chip multi-processors (CMPs). The goal is to determine the optimal number of cores of each type to provision the CMP with, such that the area and power budgets are met and the application performance is maximized. We consider general-purpose multi-threaded applications with a varying degree of parallelism (DOP) that can be set at run-time, and propose an accurate analytical model to predict the execution time of such applications on heterogeneous CMPs. Our experimental results illustrate that the synthesized heterogeneous dark silicon CMPs provide between 19% to 60% performance improvements over conventional homogeneous designs for variable and fixed DOP scenarios, respectively.

59 citations


Proceedings ArticleDOI
Myeongjae Jeon1, Yuxiong He2, Sameh Elnikety2, Alan L. Cox1, Scott Rixner1 
15 Apr 2013
TL;DR: The issues that make the parallelization of an individual query within a server challenging are described, and a parallelization approach is presented that effectively addresses these challenges and is implemented in Bing servers and evaluated experimentally with production workloads.
Abstract: A web search query made to Microsoft Bing is currently parallelized by distributing the query processing across many servers. Within each of these servers, the query is, however, processed sequentially. Although each server may be processing multiple queries concurrently, with modern multicore servers, parallelizing the processing of an individual query within the server may nonetheless improve the user's experience by reducing the response time. In this paper, we describe the issues that make the parallelization of an individual query within a server challenging, and we present a parallelization approach that effectively addresses these challenges. Since each server may be processing multiple queries concurrently, we also present a adaptive resource management algorithm that chooses the degree of parallelism at run-time for each query, taking into account system load and parallelization efficiency. As a result, the servers now execute queries with a high degree of parallelism at low loads, gracefully reduce the degree of parallelism with increased load, and choose sequential execution under high load. We have implemented our parallelization approach and adaptive resource management algorithm in Bing servers and evaluated them experimentally with production workloads. The experimental results show that the mean and 95th-percentile response times for queries are reduced by more than 50% under light or moderate load. Moreover, under high load where parallelization adversely degrades the system performance, the response times are kept the same as when queries are executed sequentially. In all cases, we observe no degradation in the relevance of the search results.

50 citations


Proceedings ArticleDOI
10 Jun 2013
TL;DR: Experimental results show that the prototypes of ParallelismDial outperform the state-of-the-art approaches, on average, by 15% on time and 31% on energy efficiency, in the dedicated environment and in the multiprogrammed environment.
Abstract: The ubiquity of parallel machines will necessitate time- and energy-efficient parallel execution of a program in a wide range of hardware and software environments. Prevalent parallel execution models can fail to be efficient. Unable to account for dynamic changes in operating conditions, they may create non-optimum parallelism, leading to underutilization or contention of resources. We propose ParallelismDial (PD), a model to dynamically, continuously and judiciously adapt a program's degree of parallelism to a given dynamic operating environment. PD uses a holistic metric to measure system-efficiency. The metric is used to systematically optimize the program's execution.We apply PD to two diverse parallel programming models: Intel TBB, an industry standard, and Prometheus, a recent research effort. Two prototypes of PD have been implemented. The prototypes are evaluated on two stock multicore workstations. Dedicated and multiprogrammed environments were considered. Experimental results show that the prototypes outperform the state-of-the-art approaches, on average, by 15% on time and 31% on energy efficiency, in the dedicated environment. In the multiprogrammed environment, the savings are to the tune of 19% and 21% in time and energy, respectively.

47 citations


Journal ArticleDOI
TL;DR: In this article, an iterative method by which a test can be dichotomized in parallel halves and ensures maximum split-half reliability was presented. But, no assumption was made regarding form or availability of reference test.
Abstract: The paper addresses an iterative method by which a test can be dichotomized in parallel halves and ensures maximum split-half reliability. The method assumes availability of data on scores of binary items. Since, it was aiming at splitting a test in parallel halves, no assumption was made regarding form or availability of reference test. Empirical verification is also provided. Other properties of the iterative methods discussed. New measures of degree of parallelism given. Simultaneous testing of single multidimensional hypothesis of equality of mean, variance and correlation of parallel tests can also be carried out by testing equality of regression lines of test scores on scores of each of the parallel halves, ANOVA or by Mahalanobis�� 2 . The iterative method can be extended to find split-half reliability of a battery of tests. The method thus provides answer to much needed question of splitting a test uniquely in parallel halves ensuring maximum value of the split-half reliability. The method may be adopted while reporting a test.

44 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This work argues that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge, and demonstrates these techniques on several tree traversal algorithms, achieving speedups of up to 38× over 32-thread CPU versions.
Abstract: With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, there has been significantly less work in the area of irregular algorithms and even less so when pointer-based dynamic data structures are involved. Recently, irregular algorithms such as Barnes-Hut and kd-tree traversals have been implemented on GPUs, yielding significant performance gains over CPU implementations. However, the implementations often rely on exploiting application-specific semantics to get acceptable performance. We argue that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38x over 32--thread CPU versions.

42 citations


Book ChapterDOI
02 May 2013
TL;DR: A novel hybrid approach that combines model-driven performance forecasting techniques and on-line exploration in order to take the best of the two techniques, namely enhancing robustness despite model's inaccuracies, and maximizing convergence speed towards optimum solutions is introduced.
Abstract: In this paper we investigate the issue of automatically identifying the "natural" degree of parallelism of an application using software transactional memory STM, i.e., the workload-specific multiprogramming level that maximizes application's performance. We discuss the importance of adapting the concurrency level to the workload in two different scenarios, a shared-memory and a distributed STM infrastructure. We propose and evaluate two alternative self-tuning methodologies, explicitly tailored for the considered scenarios. In shared-memory STM, we show that lightweight, black-box approaches relying solely on on-line exploration can be extremely effective. For distributed STMs, we introduce a novel hybrid approach that combines model-driven performance forecasting techniques and on-line exploration in order to take the best of the two techniques, namely enhancing robustness despite model's inaccuracies, and maximizing convergence speed towards optimum solutions.

35 citations


Journal ArticleDOI
TL;DR: This paper presents the parallelization on several current task- and data-parallel platforms, covering multi-core CPUs with vector units, GPUs, and hybrid systems, and analyzes the suitability of parallel programming languages for the implementation.
Abstract: Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is thus well-suited for huge amounts of data. Due to the recursive nature of sparse grid algorithms and their classical random memory access pattern, they impose a challenge for the parallelization on modern hardware architectures such as accelerators. In this paper, we present the parallelization on several current task- and data-parallel platforms, covering multi-core CPUs with vector units, GPUs, and hybrid systems. We demonstrate that a less efficient implementation from an algorithmical point of view can be beneficial if it allows vectorization and a higher degree of parallelism instead. Furthermore, we analyze the suitability of parallel programming languages for the implementation. Considering hardware, we restrict ourselves to the x86 platform with SSE and AVX vector extensions and to NVIDIA’s Fermi architecture for GPUs. We consider both multi-core CPU and GPU architectures independently, as well as hybrid systems with up to 12 cores and 2 Fermi GPUs. With respect to parallel programming, we examine both the open standard OpenCL and Intel Array Building Blocks, a recently introduced high-level programming approach, and comment on their ease of use. As the baseline, we use the best results obtained with classically parallelized sparse grid algorithms and their OpenMP-parallelized intrinsics counterpart (SSE and AVX instructions), reporting both single and double precision measurements. The huge data sets we use are a real-life dataset stemming from astrophysics and artificial ones, all of which exhibit challenging properties. In all settings, we achieve excellent results, obtaining speedups of up to 188 × using single precision on a hybrid system.

29 citations


Proceedings ArticleDOI
23 Oct 2013
TL;DR: This paper considers a state-of-theart two-dimensional LB model based on 37 populations, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation- of-state of a perfect gas, and breaks the 1 double-precision Tflops barrier on a single-host system with two GPUs.
Abstract: Accelerators are an increasingly common option to boost performance of codes that require extensive number crunching. In this paper we report on our experience with NVIDIA accelerators to study fluid systems using the Lattice Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for processor architectures with a large degree of parallelism, such as recent multi- and many-core processors and GPUs; however, the challenge of exploiting a large fraction of the theoretically available performance of this new class of processors is not easily met. We consider a state-of-theart two-dimensional LB model based on 37 populations (a D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation-of-state of a perfect gas. The computational features of this model make it a significant benchmark to analyze the performance of new computational platforms, since critical kernels in this code require both high memory-bandwidth on sparse memory addressing patterns and floating-point throughput. In this paper we consider two recent classes of GPU boards based on the Fermi and Kepler architectures; we describe in details all steps done to implement and optimize our LB code and analyze its performance first on single- GPU systems, and then on parallel multi-GPU systems based on one node as well as on a cluster of many nodes; in the latter case we use CUDA-aware MPI as an abstraction layer to assess the advantages of advanced GPU-to-GPU communication technologies like GPUDirect. On our implementation, aggregate sustained performance of the most compute intensive part of the code breaks the 1 double-precision Tflops barrier on a single-host system with two GPUs.

28 citations


Journal ArticleDOI
TL;DR: Using the data parallelism paradigm, a general strategy which can be used to speed up any multiple sequence alignment method is proposed, which can achieve up to 151-fold improvements in execution time while losing 2.19% accuracy on average.

27 citations


Proceedings ArticleDOI
Chuntao Hong1, Dong Zhou2, Mao Yang1, Carbo Kuo2, Lintao Zhang1, Lidong Zhou1 
08 Apr 2013
TL;DR: KuaFu is proposed to close the parallelism gap on replicated database systems by enabling concurrent replay of transactions on a backup, and maintains write consistency on backups by tracking transaction dependencies.
Abstract: Database systems are nowadays increasingly deployed on multi-core commodity servers, with replication to guard against failures. Database engine is best designed to scale with the number of cores to offer a high degree of parallelism on a modern multi-core architecture. On the other hand, replication traditionally resorts to a certain form of serialization for data consistency among replicas. In the widely used primary/backup replication with log shipping, concurrent executions on the primary and the serialized log replay on a backup creates a serious parallelism gap. Our experiment on MySQL with a 16-core configuration shows that the serial replay of a backup can sustain only less than one third of the throughput achievable on the primary under an OLTP workload. This paper proposes KuaFu to close the parallelism gap on replicated database systems by enabling concurrent replay of transactions on a backup. KuaFu maintains write consistency on backups by tracking transaction dependencies. Concurrent replay on a backup does introduce read inconsistency between the primary and backups. KuaFu further leverages multi-version concurrency control to produce snapshots in order to restore the consistency semantics. We have implemented KuaFu on MySQL; our evaluations show that KuaFu allows a backup to keep up with the primary while preserving replication consistency.

01 Jan 2013
TL;DR: In this paper, a self-configuration protocol for distributed applications in the cloud is presented, which is able to configure a whole distributed application without requiring any centralized server, but the high degree of parallelism involved in this protocol makes its design complicated and error-prone.
Abstract: Distributed applications in the cloud are composed of a set of virtual machines running a set of interconnected software components. In this context, the task of automatically configuring distributed applications is a very difficult issue. In this paper, we focus on such a self-configuration protocol, which is able to configure a whole distributed application without requiring any centralized server. The high degree of parallelism involved in this protocol makes its design complicated and error-prone. In order to check that this protocol works as expected, we specify it in Lotos NT and verify it using the Cadp toolbox. The use of these formal techniques and tools helped to detect a bug in the protocol, and served as a workbench to experiment with several possible communication models.

Journal ArticleDOI
TL;DR: This paper presents the implementation of the Lattice Boltzmann code on the Sandy Bridge processor, and assess the efficiency of several programming strategies and data-structure organizations, both in terms of memory access and computing performance.

Proceedings ArticleDOI
04 Mar 2013
TL;DR: This paper presents a self adaptive architecture to enhance the energy efficiency of coarse-grained reconfigurable architectures (CGRAs) by exploiting the reconfiguration features of modern CGRAs and relies on dynamically reconfigured isolation cells and autonomous parallelism, voltage, and frequency selection algorithm (APVFS).
Abstract: This paper presents a self adaptive architecture to enhance the energy efficiency of coarse-grained reconfigurable architectures (CGRAs). Today, platforms host multiple applications, with arbitrary inter-application communication and concurrency patterns. Each application itself can have multiple versions (implementations with different degree of parallelism) and the optimal version can only be determined at runtime. For such scenarios, traditional worst case designs and compile time mapping decisions are neither optimal nor desirable. Existing solutions to this problem employ costly dedicated hardware to configure the operating point at runtime (using DVFS). As an alternative to dedicated hardware, we propose exploiting the reconfiguration features of modern CGRAs. Our solution relies on dynamically reconfigurable isolation cells (DRICs) and autonomous parallelism, voltage, and frequency selection algorithm (APVFS). The DRICs reduce the overheads of DVFS circuitry by configuring the existing resources as isolation cells. APVFS ensures high efficiency by dynamically selecting the parallelism, voltage and frequency trio, which consumes minimum power to meet the deadlines on available resources. Simulation results using representative applications (Matrix multiplication, FIR, and FFT) showed up to 23% and 51% reduction in power and energy, respectively, compared to traditional DVFS designs. Synthesis results have confirmed significant reduction in area overheads compared to state of the art DVFS methods.

Proceedings ArticleDOI
13 May 2013
TL;DR: An interconnection network specifically for high message rates is designed, which reduces the burden on the software stack by relying on communication engines that perform a large fraction of the send and receive functionality in hardware and supports multi-core environments very efficiently through hardware-level virtualization of the communication engines.
Abstract: Computer systems continue to increase in parallelism in all areas. Stagnating single thread performance as well as power constraints prevent a reversal of this trend, on the contrary, current projections show that the trend towards parallelism will accelerate. In cluster computing, scalability, and therefore the degree of parallelism, is limited by the network interconnect and more specifically by the message rate it provides. We designed an interconnection network specifically for high message rates. Among other things, it reduces the burden on the software stack by relying on communication engines that perform a large fraction of the send and receive functionality in hardware. It also supports multi-core environments very efficiently through hardware-level virtualization of the communication engines. We provide details on the overall architecture, the thin software stack, performance results for a set of MPI-based benchmarks, and an in-depth analysis of how application performance depends on the message rate. We vary the message rate by software and hardware techniques, and measure the application-level impact of different message rates. We are also using this analysis to extrapolate performance for technologies with wider data paths and higher line rates.

Proceedings ArticleDOI
06 May 2013
TL;DR: This paper targets different encryption algorithms (TEA and XTEA) on GPU and FPGA platforms in terms of latency, throughput, gate equivalence, cost and ease of mapping on both platforms and proposes a tool called Cryptographic Hardware Acceleration and Analysis Tool (CHAAT) that selects an optimal algorithm depending on the user's constraints.
Abstract: Cryptography algorithms are ranked by their speed in encrypting/decrypting data and their robustness to withstand attacks. Real-time processing of data encryption/decryption is essential in network based applications to keep pace with the input data inhalation rate. The encryption/decryption steps are computationally intensive and exhibit high degree of parallelism. Field programmable gate arrays (FPGA) and graphics processing units (GPU) are being employed as cryptographic coprocessors to target different cryptography algorithms. In this paper, we target different encryption algorithms (TEA and XTEA) on GPU and FPGA platforms. We investigate the performance of the algorithms in terms of latency, throughput, gate equivalence, cost and ease of mapping on both platforms. We employ optimization techniques to realize high throughput in our custom configured implementations for coarse-grained parallel architectures. We propose a tool called Cryptographic Hardware Acceleration and Analysis Tool (CHAAT) that selects an optimal algorithm depending on the user's constraints with respect to hardware utilization, cost and security.

Journal ArticleDOI
01 Nov 2013
TL;DR: A GPU-based simulation kernel (gDES) to support DES is presented and three algorithms to support high efficiency are proposed to increase the degree of parallelism while retaining the number of synchronizations.
Abstract: The graphic processing unit (GPU) can perform some large-scale simulations in an economical way. However, harnessing the power of a GPU to discrete event simulation (DES) is difficult because of the mismatch between GPU's synchronous execution mode and DES's asynchronous time advance mechanism. In this paper, we present a GPU-based simulation kernel (gDES) to support DES and propose three algorithms to support high efficiency. Since both limited parallelism and redundant synchronization affect the performance of DES based on a GPU, we propose a breadth-expansion conservative time window algorithm to increase the degree of parallelism while retaining the number of synchronizations. By using the expansion method, it can import as many as possible 'safe' events. The irregular and dynamic requirement for storing the events leads to uneven and sparse memory usage, thereby causing waste of memory and unnecessary overhead. A memory management algorithm is proposed to store events in a balanced and compact way by using a lightweight stochastic method. When events processed by threads in a warp have different types, the performance of gDES decreases rapidly because of branch divergence. An event redistribution algorithm is proposed by reassigning events of the same type to neighboring threads to reduce the probability of branch divergence. We analyze the superiority of the proposed algorithms and gDES with a series of experiments. Compared to a CPU-based simulator on a multicore platform, the gDES can achieve up to 11A—, 5A—, and 8A— speedup in PHOLD, QUEUING NETWORK, and epidemic simulation, respectively.

Patent
18 Sep 2013
TL;DR: In this paper, the authors describe a data mapping describing an association between one or more fields of a data storage location of a source and one or multiple fields of data storage locations of a target destination, and generate a data transfer execution plan from the data mapping to transfer data from the source to the target destination.
Abstract: Methods, systems and computer program products for high performance data streaming are provided. A computer-implemented method may include receiving a data mapping describing an association between one or more fields of a data storage location of a data source and one or more fields of a data storage location of a target destination, generating a data transfer execution plan from the data mapping to transfer data from the data source to the target destination where the data transfer execution plan comprises a determined degree of parallelism to use when transferring the data, and transferring the data from the storage location of the data source to the data storage location of the target destination using the generated data transfer execution plan.

Proceedings ArticleDOI
23 Feb 2013
TL;DR: The pattern-supported parallelization approach, which is introduced here, eases the transition from sequential to parallel software and is a novel model-based approach with clear methodology and the use of parallel design patterns as known building blocks.
Abstract: In the embedded systems domain a trend towards multi-and many-core processors is evident. For the exploitation of these additional processing elements parallel software is inevitable. The pattern-supported parallelization approach, which is introduced here, eases the transition from sequential to parallel software. It is a novel model-based approach with clear methodology and the use of parallel design patterns as known building blocks.First the Activity and Pattern Diagram is created revealing the maximum degree of parallelism expressed by parallel design patterns. Second the degree of parallelism is reduced to the optimal level providing best performance by agglomeration of activities and patterns. By this, trade-offs are respected that are caused by the target platform, e.g. the computation-communication-ratio.As implementation for the parallel design patterns a library with algorithmic skeletons can be used. This leverages development effort and simplifies the transition from sequential to parallel code effectively.

Patent
11 Mar 2013
TL;DR: In this article, a computer system and method are provided to assess a proper degree of parallelism in executing programs to obtain efficiency objectives, including but not limited to increases in processing speed or reduction in computational resource usage.
Abstract: A computer system and method are provided to assess a proper degree of parallelism in executing programs to obtain efficiency objectives, including but not limited to increases in processing speed or reduction in computational resource usage. This assessment of proper degree of parallelism may be used to actively moderate the requests for threads by application processes to control parallelism when those efficiency objectives would be furthered by this control.

Journal ArticleDOI
TL;DR: A collection of parallel sentences extracted from the entire Wikipedia collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish is presented.
Abstract: Parallel corpora are essential resources for certain Natural Language Processing tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents a collection of parallel sentences extracted from the entire Wikipedia collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish. Our work began with the processing of the publically available Wikipedia static dumps for the three languages in- volved. The existing text was stripped of the specific mark-up, cleaned of non- textual entries like images or tables and sentence-split. Then, corresponding documents for the above mentioned pairs of languages were identified using the cross-lingual Wikipedia links embedded within the documents themselves. Considering them comparable documents, we further employed a publically available tool named LEXACC, developed during the ACCURAT project, to extract parallel sentences from the preprocessed data. LEXACC assigns a score to each extracted pair, which is a measure of the degree of parallelism between the two sentences in the pair. These scores allow researchers to select only those sentences having a certain degree of parallelism suited for their intended purposes. This resource is publically available at: http://ws.racai.ro:9191/repository/search/?q=Parallel+Wiki

Proceedings ArticleDOI
08 Apr 2013
TL;DR: This paper defines a novel cost model for intra-node parallel dataflow programs with user-defined functions and introduces different batching schemes to reduce the number of output buffers.
Abstract: The performance of intra-node parallel dataflow programs in the context of streaming systems depends mainly on two parameters: the degree of parallelism for each node of the dataflow program as well as the batching size for each node. In the state-of-the-art systems the user has to specify those values manually. Manual tuning of both parameters is necessary in order to get good performance. However, this process is difficult and time consuming-even for experts. In this paper we introduce and optimization algorithm that optimizes both parameters automatically. We define a novel cost model for intra-node parallel dataflow programs with user-defined functions. Furthermore, we introduce different batching schemes to reduce the number of output buffers, i. e., main memory consumption. We implemented our approach on top of the open source system Storm and ran experiments with different workloads. Our results show a throughput improvement of more than one order of magnitude while the optimization time is less than a second.

Book ChapterDOI
TL;DR: This work studies the implementation of Multi-Objective DE (MODE) on the GPU with C-CUDA, evaluating the gain in processing time against the sequential version and shows that the approach achieves an expressive speed up and a highly efficient processing power.
Abstract: In some applications, evolutionary algorithms may require high computational resources and high processing power, sometimes not producing a satisfactory solution after running for a considerable amount of time. One possible improvement is a parallel approach to reduce the response time. This work proposes to study a parallel multi-objective algorithm, the multi-objective version of Differential Evolution (DE). The generation of trial individuals can be done in parallel, greatly reducing the overall processing time of the algorithm. A novel approach to parallelize this algorithm is the implementation on the Graphic Processing Units (GPU). These units present high degree of parallelism and they were initially developed for image rendering. However, NVIDIA has released a framework, named CUDA, which allows developers to use GPU for general-purpose computing (GPGPU). This work studies the implementation of Multi-Objective DE (MODE) on the GPU with C-CUDA, evaluating the gain in processing time against the sequential version. Benchmark functions are used to validate the implementation and to confirm the efficiency of MODE on the GPU. The results show that the approach achieves an expressive speed up and a highly efficient processing power.

Proceedings ArticleDOI
17 Nov 2013
TL;DR: The semantics ofhyperqueues are defined and their implementation in a work-stealing scheduler is described, finding that hyperqueues provide comparable or up to 30% better performance than POSIX threads and Intel's Threading Building Blocks.
Abstract: Ubiquitous parallel computing aims to make parallel programming accessible to a wide variety of programming areas using deterministic and scale-free programming models built on a task abstraction. However, it remains hard to reconcile these attributes with pipeline parallelism, where the number of pipeline stages is typically hard-coded in the program and defines the degree of parallelism. This paper introduces hyperqueues, a programming abstraction that enables the construction of deterministic and scale-free pipeline parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues require shared concurrent views on the underlying data structure. We define the semantics of hyperqueues and describe their implementation in a work-stealing scheduler. We demonstrate scalable performance on pipeline-parallel PARSEC benchmarks and find that hyperqueues provide comparable or up to 30% better performance than POSIX threads and Intel's Threading Building Blocks. The latter are highly tuned to the number of available processing cores, while programs using hyperqueues are scale-free.

Proceedings ArticleDOI
08 Apr 2013
TL;DR: Aeolus is a prototype implementation of a topology optimizer on top of the distributed streaming system Storm that extends Storm with a batching layer which can increase the topology's throughput by more than one order of magnitude.
Abstract: Aeolus is a prototype implementation of a topology optimizer on top of the distributed streaming system Storm. Aeolus extends Storm with a batching layer which can increase the topology's throughput by more than one order of magnitude. Furthermore, Aeolus implements an optimization algorithm that computes the optimal batch size and degree of parallelism for each node in the topology automatically. Even if Aeolus is built on top of Storm, the developed concepts are not limited to Storm and can be applied to any distributed intra-node-parallel streaming system. We propose to demo Aeolus using an interactive Web UI. One part of the Web UI is a topology builder allowing the user to interact with the system. Topologies can be created from scratch and their structure and/or parameters can be modified. Furthermore, the user is able to observe the impact of the changes on the optimization decisions and runtime behavior. Additionally, the Web UI gives a deep insight in the optimization process by visualizing it. The user can interactively step through the optimization process while the UI shows the optimizer's state, computations, and decisions. The Web UI is also able to monitor the execution of a non-optimized and optimized topology simultaneously showing the advantage of using Aeolus.

Book ChapterDOI
25 Mar 2013
TL;DR: An FPGA implementation of k-NN search operation in kd-tree of photons with k-near neighbor search (k-NN) is proposed and maximized the effective throughput of the block RAM by connecting multiple Query Modules to both ports of RAM.
Abstract: Photon mapping is a kind of rendering techniques which enables depicting complicated light concentrations for 3D graphics. Searching kd-tree of photons with k-near neighbor search (k-NN) requires a large amount of computations. As k-NN search includes high degree of parallelism, the operation can be accelerated by GPU and recent multi-core microprocessors. However, memory access bottleneck will limit their computation speed. Here, as an alternative approach, an FPGA implementation of k-NN search operation in kd-tree is proposed. In the proposed design, we maximized the effective throughput of the block RAM by connecting multiple Query Modules to both ports of RAM. Furthermore, an implementation of the discovery process of the max distance which is not depending on the number of Estimate-Photons is proposed. Through the implementation on Spartan6, Virtex6 and Virtex7, it appears that 26 fundamental modules can be mounted on Virtex7. As a result, the proposed module achieved the throughput of approximately 282 times as that of software execution at maximum.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: A novel framework for implementing portable and scalable data-intensive applications on reconfigurable hardware featuring Field-Programmable Gate Arrays and memory and a new method to automatically select a task's optimal degree of parallelism on an FPGA for a given hardware platform is presented.
Abstract: This paper presents a novel framework for implementing portable and scalable data-intensive applications on reconfigurable hardware. Instead of using expensive “reconfigurable supercomputers”, we focus our work on standard PCs and PCI-Express extension cards featuring Field-Programmable Gate Arrays (FPGAs) and memory. In our framework, we exploit task-level parallelism by manually partitioning applications into several parallel tasks using a communication API for data streams. This also allows pure software implementations on PCs without FPGA cards. If an FPGA accelerator is present, the same API calls transfer data between the PC's CPU and the FPGA. Then, the tasks implemented in hardware can exploit instruction-level and pipelining parallelsims as well. Furthermore, the framework consists of hardware implementation rules which enable portable and scalable designs. Device specific hardware wrappers hide the FPGA's and board's idiosyncrasies from the application developer. We also present a new method to automatically select a task's optimal degree of parallelism on an FPGA for a given hardware platform, i. e. to generate a hardware design which uses the available communication bandwidth between the PC and the FPGA optimally. Experimental results show the feasibility of our approach.

Patent
03 Apr 2013
TL;DR: In this article, a visual processing device based on multi-layer parallel processing is presented, which comprises a high speed image sensor array, multiple layers of processor unit arrays and a reduced instruction-set computer (RISC) microprocessor subsystem.
Abstract: The invention discloses a visual processing device based on multi-layer parallel processing. The device comprises a high speed image sensor array, multiple layers of processor unit arrays and a reduced instruction-set computer (RISC) microprocessor subsystem. An image sensor is used for acquiring images of an actual world, a bottommost low-level processor unit array has a highest degree of parallelism and a relatively weak operational capability, and the degree of parallelism of the processor arrays is gradually lowered and the operational capability of the processor arrays is gradually improved with increasing of layers. A tight coupling between a hardware structure and various image processing algorithms with different degrees of parallelism and algorithm complexity is facilitated by the aid of the layered architecture. A RISC processor is used for performing system control and scheduling of image processing threads. By means of the visual processing device based on the multi-layer parallel processing, the system has high flexibility and high data throughput rate, a multi-thread concurrent working mode is achieved, image processing capacity is greatly improved, and speeds are greatly increased.

Patent
21 Aug 2013
TL;DR: In this paper, a parallel gene splicing method based on the De Bruijn graph is proposed, which is based on a trunking system and a depth graph traversal method.
Abstract: The invention relates to the technical field of gene sequencing, and provides a parallel gene splicing method based on a De Bruijn graph. The parallel gene splicing method based on the De Bruijn graph comprises the following steps that S1, the distributed De Bruijn graph is built in parallel; S2, error paths are removed; S3, the De Bruijn graph is simplified on the base of a depth graph traversal method; S4, contig is combined, and scaffold is generated; S5, the scaffold is output. The parallel gene splicing method is based on a trunking system, the De Bruijn graph is built in parallel, and the problems that when large genomes are spliced, as the data volume of the large genomes is too large, graphs cannot be built and further processing cannot be executed in traditional single-machine serial gene splicing algorithms are solved. Meanwhile, in the simplifying process, parallel simplification based on depth graph traversal is carried out, the graph simplifying process is simple, the degree of parallelism is high, and the splicing speed is high.

Proceedings ArticleDOI
06 Nov 2013
TL;DR: An updated survey of palmprint feature extraction using state-of-the-art algorithms into four categories: structure-based, statistics- based, subspace-based and texture & transform domain feature based methods.
Abstract: Trimedia has widely used in Monitoring software of network in expressway at present, especially in video conference system, digital video monitoring system, digital hard disk video based on DSP (DVR), video on demand (VOD), remote multimedia database has been large-scale application When TriMedia was used for statistic, system optimization is very necessary, structrual optimization is changing the order of operations based on specific hardware structure, so as to improve the degree of parallelism, make full use of system resources, and improve the efficiency The paper discussed differ end code optimiazation to make full use of system resources, and improve the efficiency Those different optimization methods has widly used to develop highway network monitoring software