scispace - formally typeset
Search or ask a question

Showing papers presented at "Parallel and Distributed Processing Techniques and Applications in 2009"


Proceedings Article
01 Jan 2009
TL;DR: Several three-dimensional Cahn-Hilliard simulations are presented to explore the challenges and the performance of the different memory types in three-dimensions and show that the simulation design with the best performance in threedimensions uses a different memory type to the optimal two-dimensional simulation.
Abstract: Computational scientific simulations have long used parallel computers to increase their performance. Recently graphics cards have been utilised to provide this functionality. GPGPU APIs such as NVidia’s CUDA can be used to harness the power of GPUs for purposes other than computer graphics. GPUs are designed for processing twodimensional data. In previous work we have presented several two-dimensional Cahn-Hilliard simulations that each utilise different CUDA memory types and compared their results. In this paper we extend these ideas to three dimensions. As GPUs are not intended for processing threedimensional data arrays, the performance of the memory optimisations is expected to change. Here we present several three-dimensional Cahn-Hilliard simulations to explore the challenges and the performance of the different memory types in three-dimensions. The results show that the simulation design with the best performance in threedimensions uses a different memory type to the optimal two-dimensional simulation.

38 citations


Proceedings Article
01 Jan 2009
TL;DR: In-core optimization techniques to high-order stencil computations, including cache blocking for efficient L2 cache use; register blocking and data-level parallelism via single-instruction multipledata (SIMD) techniques to increase L1 cache efficiency; and software prefetching techniques are applied.
Abstract: In this paper, we apply in-core optimization techniques to high-order stencil computations, including: (1) cache blocking for efficient L2 cache use; (2) register blocking and data-level parallelism via single-instruction multipledata (SIMD) techniques to increase L1 cache efficiency; and (3) software prefetching techniques. Our generic approach is tested with a kernel extracted from a 6 th -order stencil based seismic wave propagation code on a suite of Intel Xeon architectures. Cache blocking and prefetching techniques are found to achieve modest performance improvement, whereas register blocking and SIMD implementation reduce L1 cache line miss dramatically accompanied by moderate decrease in L2 cache miss rate. Optimal register blocking sizes are determined through analysis of cache performance of the stencil kernel for different sizes of register blocks, thereby achieving over 4.3fold speedup on Intel Harpertown. We also examine lower precision (3 rd , 4 th , and 5 th orders) stencil computations to analyze the dependency of data-level parallel efficiency on the stencil order.

36 citations


Proceedings Article
01 Jan 2009
TL;DR: A set of parallel algorithms to determine the hydrological flow direction and contributing area of each cell in a digital elevation model (DEM) using cluster computers in an MPI programming model are introduced.
Abstract: This paper introduces a set of parallel algorithms to determine the hydrological flow direction and contributing area of each cell in a digital elevation model (DEM) using cluster computers in an MPI programming model. DEMs are partitioned across processes relevant to the physical layout of the terrain such that processes with adjacent ranks calculate flow direction and contributing areas for physically adjacent partitions of the DEM. The contributing area algorithm makes use of a queue to order the consideration of cells such that each cell is visited only once for calculation and cross-partition calculations are handled in an efficient and order-independent manner. This algorithm replaces a serial recursive algorithm included as part of the TauDEM hydrology analysis package.

28 citations


Proceedings Article
01 Jan 2009
TL;DR: In this paper, a new algorithm for implementing multi-word compare-and-swap functionality supporting the Read and CASN operations is presented, which is wait-free under reasonable assumptions on execution time.
Abstract: We present a new algorithm for implementing a multi-word compare-and-swap functionality supporting the Read and CASN operations. The algorithm is wait-free under reasonable assumptions on execution ...

21 citations


Proceedings Article
01 Jan 2009
TL;DR: A solution from cooperative game theory based on the concept of Nash Bargaining Solution is proposed to solve the problem of mapping tasks onto a computational grid subject to the constraints of deadlines and architectural requirements.
Abstract: The problem of mapping tasks onto a computational grid with the aim to minimize the power consumption and the makespan subject to the constraints of deadlines and architectural requirements is considered in this paper. To solve this problem, we propose a solution from cooperative game theory based on the concept of Nash Bargaining Solution. The proposed game theoretical technique is compared against several traditional techniques. The experimental results show that when the deadline constraints are tight, the proposed technique achieves superior performance and reports competitive performance relative to the optimal solution.

17 citations


Proceedings Article
01 Jan 2009
TL;DR: Fast Collective-Network Protocol allows bandwidths over 50% beyond the LOFAR requirements, so that the telescope can observe proportionally more sources or frequencies and becomes a much more efficient system.
Abstract: This paper describes the Fast Collective-Network Protocol (FCNP). FCNP is a low-overhead, high-bandwidth network protocol that we developed for fast communication between the Blue Gene/P compute nodes and I/O nodes. The CPU cores in this system are hardly able to keep up with the high-speed internal network, and any protocol overhead significantly slows down the achieved bandwidths. FCNP minimizes overhead and approaches the link speed for large messages. FCNP is of critical importance to the correlator of the LOFAR radio telescope, that will process hundreds of gigabits of real-time telescope data per second. Without FCNP, the correlator would not even achieve the required data rates. However, FCNP allows bandwidths over 50% beyond the LOFAR requirements, so that the telescope can observe proportionally more sources or frequencies and becomes a much more efficient

14 citations


Proceedings Article
01 Jan 2009
TL;DR: The results show that parallelizing consistency can provide the programmer with a robust scalability for regular problems with global constraint, specifically the case of parallel consistency.
Abstract: Program parallelization becomes increasingly important when new multi-core architectures provide ways to improve performance. One of the greatest challenges of this development lies in programming parallel applications. Using declarative languages, such as constraint programming, can make the transition to parallelism easier by hiding the parallelization details in a framework. Automatic parallelization in constraint programming has previously focused on data parallelism. In this paper, we look at task parallelism, specifically the case of parallel consistency. We have developed two models of parallel consistency, one that shares intermediate results and one that does not. We evaluate which model is better in our experiments. Our results show that parallelizing consistency can provide the programmer with a robust scalability for regular problems with global constraints.

14 citations


Proceedings Article
01 Jan 2009
TL;DR: A simple technique based on multi-objective goal programming that guarantees Pareto optimal solution with excellence in convergence process is proposed that achieves superior performance compared to the min-min heuristics and competitive performance relative to the optimal solution implemented in LINDO for small-scale problems.
Abstract: We model the process of a data center as a multiobjective problem of mapping independent tasks onto a set of data center machines that simultaneously minimizes the energy consumption and response time (makespan) subject to the constraints of deadlines and architectural requirements. A simple technique based on multi-objective goal programming is proposed that guarantees Pareto optimal solution with excellence in convergence process. The proposed technique also is compared with other traditional approach. The simulation results show that the proposed technique achieves superior performance compared to the min-min heuristics, and competitive performance relative to the optimal solution implemented in LINDO for small-scale problems.

10 citations


Proceedings Article
01 Jan 2009
TL;DR: A scalable hierarchical parallelization framework for molecular dynamics simulation on emerging multicore clusters that combines inter-node level parallelism by spatial decomposition using message passing, and intra-node (inter-core)level parallelism through a master/worker paradigm and cellular decomposition through critical section-free multithreading.
Abstract: We have developed a scalable hierarchical parallelization framework for molecular dynamics (MD) simulation on emerging multicore clusters The framework combines: (1) inter-node level parallelism by spatial decomposition using message passing; (2) intra-node (inter-core) level parallelism through a master/worker paradigm and cellular decomposition using critical section-free multithreading; and (3) intra-core level parallelism via single-instruction multiple-data (SIMD) techniques Our multithreading scheme takes account of cache coherency to maximize performance For data-level parallelism via SIMD, zero padding is used to solve the alignment issue for complex data type as array, and simple data-type reformatting is used to solve the alignment issue for data with irregular memory accessing By combining a hierarchy of parallelism, the framework exposes maximal concurrency and data locality, thereby achieving: (1) inter-node weak-scaling parallel efficiency 0975 on 32,768 BlueGene/P nodes and 0985 on 106,496 BlueGene/L nodes; (2) inter-node strong-scaling parallel efficiency 090 on 32 dual quadcore AMD Opteron nodes and 094 on 32 dual quadcore Intel Xeon nodes; (3) inter-core multithread parallel efficiency 065 for the whole program (089 for two-body force calculation) for eight threads on a dual quadcore Xeon platform; and (4) SIMD speedup 135 for the whole program (142 for the twobody force calculation)

8 citations


Proceedings Article
01 Jan 2009
TL;DR: The proposed pseudo-random approach based on Borel Cayley graphs yields a 2 to 100 times faster convergence than the small-world network and has the best scalability over different graph sizes and degrees.
Abstract: In this paper, we focus on the design of network topology to achieve fast information distribution. We present the information distribution performance of Borel Cayley graphs, a family of pseudo-random graphs, is far superior than that of other well-known graph families. To demonstrate the effectiveness of this pseudo-random approach, we compare the convergence speed of the average consensus protocol on Borel Cayley graphs against that of a wide range of graph families with the sizes ranging from around 100 nodes to 5,000 nodes. In the comparison study, we compare the convergence speed of Borel Cayley graphs, regular ring lattices, Erdos-Renyi random graphs, Watts-Strogatz smallworld networks, and toroidal and diagonal meshes. Our results indicate that the proposed pseudo-random approach based on Borel Cayley graphs yields a 2 to 100 times faster convergence than the small-world network (rewiring probability p = 0.01, 0.1 and 0.2) does. More importantly, Borel Cayley graph has the best scalability over different graph sizes and degrees.

7 citations



Proceedings Article
01 Jan 2009
TL;DR: Performance measurements, using four clients to solve a number of benchmark problems, show that Picoso yields (almost) linear speedup compared to the sequential interval constraint solver iSAT, on which the clients of Picoso are based.
Abstract: This paper describes the parallel interval constraint solver Picoso, which can decide (a subclass of) boolean combinations of linear and non-linear constraints. Picoso follows a master/client model based on message passing, making it suitable for any kind of workstation cluster as well as for multi-processor machines. To run several clients in parallel, an efficient work stealing mechanism has been integrated, dividing the overall search space into disjoint parts. Additionally, to prevent the clients from running into identical conflicts, information about conflicts in form of conflict clauses is exchanged among the clients. Performance measurements, using four clients to solve a number of benchmark problems, show that Picoso yields (almost) linear speedup compared to the sequential interval constraint solver iSAT, on which the clients of Picoso are based.


Proceedings Article
01 Jan 2009
TL;DR: Five scheduling policies are evaluated; four of these policies are known from the literature and one policy is newly proposed, determined through simulation studies.
Abstract: Workflows are modeled with directed acyclic graphs in which vertices represent computational tasks, referred to as requests, and edges represent precedent constraints among requests. Associated with each workflow is a deadline that defines the time by which all computations of a workflow should be complete. Workflows are submitted by numerous clients to a centralized scheduler that assigns workflow requests to a cluster of memory managed multicore machines for execution. The objective of the scheduler is to minimize missed workflow deadlines. The characteristics of workflows are assumed to vary along several dimensions, including: arrival rate, periodicity, degree of parallelism, and number of requests. Five scheduling policies are evaluated; four of these policies are known from the literature and one policy is newly proposed. The advantages and disadvantages of each policy is determined through simulation studies.


Proceedings Article
01 Jan 2009
TL;DR: An exhaustive search algorithm is proposed as a starting point into open ended research on this topic on algorithmic scalability and feasibility conditions for system models whose timing relationships may not have an optimal solution.
Abstract: An interesting optimization problem is examined where optimal load allocations depend on processor release times, but the timing of release times depends on the load allocation scenario used. This circular relationship between an optimal solution and release times causes system models whose timing relationships may not have an optimal solution. To obtain an optimal solution based on an assumed model and its arbitrary timing relationships, we propose an exhaustive search algorithm as a starting point into open ended research on this topic on algorithmic scalability and feasibility conditions. Through simulation, the behavior of the exhaustive search algorithm is investigated and load scheduling trends with arbitrary release times are verified. A bus network (homogeneous single-level tree network) with arbitrary processor release times is considered. For the scheduling strategy, a sequential distribution with a staggered start scheduling scenario to minimize total processing finish time is assumed.

Proceedings Article
01 Jan 2009
TL;DR: The iteration bound for a CSDFG is presented which is used to find the integral static schedule and determine whether a csdf is live or not based on some calculation.
Abstract: There are few processes which display cyclically changing but predefined behavior. These processes can be represented using cyclo static data flow graphs (CSDFG). This capability results in a higher degree of parallelism. In this paper we present the iteration bound for a CSDFG which is used to find the integral static schedule and determine whether a csdf is live or not based on some calculation. We also present an algorithm that schedules cyclo static data flow graphs without converting to their equivalent homogeneous graphs (EHG's) and demonstrate it with a suitable example.




Proceedings Article
01 Jan 2009
TL;DR: The proposed game theoretical technique in which players continuously compete in a non-cooperative environment to improve data accessibility by replicating data objects outperforms the four techniques in both the execution time and solution quality.
Abstract: In this paper, a mathematical model for data object replication in ad hoc networks is formulated. The derived model is general, flexible and adaptable to cater for various applications in ad hoc networks. We propose a game theoretical technique in which players (mobile hosts) continuously compete in a non-cooperative environment to improve data accessibility by replicating data objects. The technique incorporates the access frequency from mobile hosts to each data object, the status of the network connectivity, and communication costs. The proposed technique is extensively evaluated against four well-known ad hoc network replica allocation methods. The experimental results reveal that the proposed approach outperforms the four techniques in both the execution time and solution quality.






Proceedings Article
01 Jan 2009
TL;DR: Experimental results reveal that this mechanism provides excellent solution quality, while maintaining fast execution time, and a bidding mechanism that encapsulates the selfishness of the agents, while having a controlling hand over them.
Abstract: Fine-grained data replication over the Internet allows duplication of frequently accessed data objects, as opposed to entire sites, to certain locations so as to improve the performance of large-scale content distribution systems. In a distributed system, agents representing their sites try to maximize their own benefit since they are driven by different goals such as to minimize their communication costs, latency, etc. In this paper, we will use game theoretical techniques and in particular auctions to identify a bidding mechanism that encapsulates the selfishness of the agents, while having a controlling hand over them. In essence, the proposed game theory based mechanism is the study of what happens when independent agents act selfishly and how to control them to maximize the overall performance. A bidding mechanism asks how one can design systems so that agents’ selfish behavior results in the desired systemwide goals. Experimental results reveal that this mechanism provides excellent solution quality, while maintaining fast execution time. The comparisons are recorded against some well known techniques such as greedy, branch and bound, game theoretical auctions and

Proceedings Article
01 Jan 2009
TL;DR: Different central moments used to quantify the heterogeneity of ETC matrices obtained from real world systems and benchmark data are identified and the effect of these moments on the performance of heuristics both through simple examples and simulations are shown.
Abstract: One type of heterogeneous computing (HC) systems consists of machines with diverse capabilities harnessed together to execute a set of tasks that vary in their computational complexity. An HC system can be characterized using an Estimated Time to Compute (ETC) matrix. Each value in this matrix represents the ETC of a specific task on a specific machine when executed exclusively. Heuristics use the values in the ETC matrix to allocate tasks to machines in the HC system. The performance of resource allocation heuristics can be affected significantly by factors such as task and machine heterogeneities. Therefore, quantifying heterogeneity will allow a system to select a heuristic appropriate for the given heterogeneous environment. In this paper, we identify different central moments used to quantify the heterogeneity of ETC matrices obtained from real world systems and benchmark data, and show the effect of these moments on the performance of heuristics both through simple examples and simulations.


Proceedings Article
George Wells1
01 Jan 2009
TL;DR: A library of classes providing support for interprocess communication in Java programs, using the mechanisms present in the native operating system, is described, showing significant performance improvements over the standard Java mechanisms available for such systems.
Abstract: This paper describes a library of classes providing support for interprocess communication in Java programs, using the mechanisms present in the native operating system. This approach is particularly well-suited for use with independent Java processes running on a single multicore (or multiprocessor) computer. At this stage, a comprehensive class library has been implemented for the Linux operating system, allowing access to the rich set of interprocess communication mechanisms provided by it. Initial testing shows significant performance improvements over the standard Java mechanisms available for such systems.