Showing papers presented at "Parallel and Distributed Processing Techniques and Applications in 2009"

PDF

Open Access

Proceedings Article•

Data Parallel Three-Dimensional Cahn-Hilliard Field Equation Simulation on GPUs with CUDA

[...]

D.P. Playne¹, Ken A. Hawick¹•Institutions (1)

01 Jan 2009

TL;DR: Several three-dimensional Cahn-Hilliard simulations are presented to explore the challenges and the performance of the different memory types in three-dimensions and show that the simulation design with the best performance in threedimensions uses a different memory type to the optimal two-dimensional simulation.

...read moreread less

Abstract: Computational scientific simulations have long used parallel computers to increase their performance. Recently graphics cards have been utilised to provide this functionality. GPGPU APIs such as NVidia’s CUDA can be used to harness the power of GPUs for purposes other than computer graphics. GPUs are designed for processing twodimensional data. In previous work we have presented several two-dimensional Cahn-Hilliard simulations that each utilise different CUDA memory types and compared their results. In this paper we extend these ideas to three dimensions. As GPUs are not intended for processing threedimensional data arrays, the performance of the memory optimisations is expected to change. Here we present several three-dimensional Cahn-Hilliard simulations to explore the challenges and the performance of the different memory types in three-dimensions. The results show that the simulation design with the best performance in threedimensions uses a different memory type to the optimal two-dimensional simulation.

...read moreread less

38 citations

Proceedings Article•

In-Core Optimization of High-Order Stencil Computations.

[...]

Hikmet Dursun¹, Ken-ichi Nomura¹, Weiqiang Wang, Manaschai Kunaseth¹, Liu Peng², Richard Seymour¹, Rajiv K. Kalia¹, Aiichiro Nakano¹, Priya Vashishta¹ - Show less +5 more•Institutions (2)

University of Southern California¹, Sewanee: The University of the South²

01 Jan 2009

TL;DR: In-core optimization techniques to high-order stencil computations, including cache blocking for efficient L2 cache use; register blocking and data-level parallelism via single-instruction multipledata (SIMD) techniques to increase L1 cache efficiency; and software prefetching techniques are applied.

...read moreread less

Abstract: In this paper, we apply in-core optimization techniques to high-order stencil computations, including: (1) cache blocking for efficient L2 cache use; (2) register blocking and data-level parallelism via single-instruction multipledata (SIMD) techniques to increase L1 cache efficiency; and (3) software prefetching techniques. Our generic approach is tested with a kernel extracted from a 6 th -order stencil based seismic wave propagation code on a suite of Intel Xeon architectures. Cache blocking and prefetching techniques are found to achieve modest performance improvement, whereas register blocking and SIMD implementation reduce L1 cache line miss dramatically accompanied by moderate decrease in L2 cache miss rate. Optimal register blocking sizes are determined through analysis of cache performance of the stencil kernel for different sizes of register blocks, thereby achieving over 4.3fold speedup on Intel Harpertown. We also examine lower precision (3 rd , 4 th , and 5 th orders) stencil computations to analyze the dependency of data-level parallel efficiency on the stencil order.

...read moreread less

36 citations

Proceedings Article•

Parallel Flow-Direction and Contributing Area Calculation for Hydrology Analysis in Digital Elevation Models

[...]

Chase Wallis¹, Daniel W. Watson¹, David G. Tarboton¹, Robert M. Wallace²•Institutions (2)

Utah State University¹, Engineer Research and Development Center²

01 Jan 2009

TL;DR: A set of parallel algorithms to determine the hydrological flow direction and contributing area of each cell in a digital elevation model (DEM) using cluster computers in an MPI programming model are introduced.

...read moreread less

Abstract: This paper introduces a set of parallel algorithms to determine the hydrological flow direction and contributing area of each cell in a digital elevation model (DEM) using cluster computers in an MPI programming model. DEMs are partitioned across processes relevant to the physical layout of the terrain such that processes with adjacent ranks calculate flow direction and contributing areas for physically adjacent partitions of the DEM. The contributing area algorithm makes use of a queue to order the consideration of cells such that each cell is visited only once for calculation and cross-partition calculations are handled in an efficient and order-independent manner. This algorithm replaces a serial recursive algorithm included as part of the TauDEM hydrology analysis package.

...read moreread less

28 citations

Proceedings Article•

Wait-Free Multi-Word Compare-And-Swap using Greedy Helping and Grabbing

[...]

Håkan Sundell¹•Institutions (1)

University of Borås¹

01 Jan 2009

TL;DR: In this paper, a new algorithm for implementing multi-word compare-and-swap functionality supporting the Read and CASN operations is presented, which is wait-free under reasonable assumptions on execution time.

...read moreread less

Abstract: We present a new algorithm for implementing a multi-word compare-and-swap functionality supporting the Read and CASN operations. The algorithm is wait-free under reasonable assumptions on execution ...

...read moreread less

21 citations

Proceedings Article•

A Game Theoretical Energy Efficient Resource Allocation Technique for Large Distributed Computing Systems.

[...]

Samee U. Khan¹•Institutions (1)

North Dakota State University¹

01 Jan 2009

TL;DR: A solution from cooperative game theory based on the concept of Nash Bargaining Solution is proposed to solve the problem of mapping tasks onto a computational grid subject to the constraints of deadlines and architectural requirements.

...read moreread less

Abstract: The problem of mapping tasks onto a computational grid with the aim to minimize the power consumption and the makespan subject to the constraints of deadlines and architectural requirements is considered in this paper. To solve this problem, we propose a solution from cooperative game theory based on the concept of Nash Bargaining Solution. The proposed game theoretical technique is compared against several traditional techniques. The experimental results show that when the deadline constraints are tight, the proposed technique achieves superior performance and reports competitive performance relative to the optimal solution.

...read moreread less

17 citations

Proceedings Article•

FCNP: Fast I/O on the Blue Gene/P.

[...]

John W. Romein

01 Jan 2009

TL;DR: Fast Collective-Network Protocol allows bandwidths over 50% beyond the LOFAR requirements, so that the telescope can observe proportionally more sources or frequencies and becomes a much more efficient system.

...read moreread less

Abstract: This paper describes the Fast Collective-Network Protocol (FCNP). FCNP is a low-overhead, high-bandwidth network protocol that we developed for fast communication between the Blue Gene/P compute nodes and I/O nodes. The CPU cores in this system are hardly able to keep up with the high-speed internal network, and any protocol overhead significantly slows down the achieved bandwidths. FCNP minimizes overhead and approaches the link speed for large messages. FCNP is of critical importance to the correlator of the LOFAR radio telescope, that will process hundreds of gigabits of real-time telescope data per second. Without FCNP, the correlator would not even achieve the required data rates. However, FCNP allows bandwidths over 50% beyond the LOFAR requirements, so that the telescope can observe proportionally more sources or frequencies and becomes a much more efficient

...read moreread less

14 citations

Proceedings Article•

Parallel consistency in constraint programming

[...]

Carl Christian Rolf, Krzysztof Kuchcinski

01 Jan 2009

TL;DR: The results show that parallelizing consistency can provide the programmer with a robust scalability for regular problems with global constraint, specifically the case of parallel consistency.

...read moreread less

Abstract: Program parallelization becomes increasingly important when new multi-core architectures provide ways to improve performance. One of the greatest challenges of this development lies in programming parallel applications. Using declarative languages, such as constraint programming, can make the transition to parallelism easier by hiding the parallelization details in a framework. Automatic parallelization in constraint programming has previously focused on data parallelism. In this paper, we look at task parallelism, specifically the case of parallel consistency. We have developed two models of parallel consistency, one that shares intermediate results and one that does not. We evaluate which model is better in our experiments. Our results show that parallelizing consistency can provide the programmer with a robust scalability for regular problems with global constraints.

...read moreread less

14 citations

Proceedings Article•

A Multi-Objective Programming Approach for Resource Allocation in Data Centers.

[...]

Samee U. Khan¹•Institutions (1)

North Dakota State University¹

01 Jan 2009

TL;DR: A simple technique based on multi-objective goal programming that guarantees Pareto optimal solution with excellence in convergence process is proposed that achieves superior performance compared to the min-min heuristics and competitive performance relative to the optimal solution implemented in LINDO for small-scale problems.

...read moreread less

Abstract: We model the process of a data center as a multiobjective problem of mapping independent tasks onto a set of data center machines that simultaneously minimizes the energy consumption and response time (makespan) subject to the constraints of deadlines and architectural requirements. A simple technique based on multi-objective goal programming is proposed that guarantees Pareto optimal solution with excellence in convergence process. The proposed technique also is compared with other traditional approach. The simulation results show that the proposed technique achieves superior performance compared to the min-min heuristics, and competitive performance relative to the optimal solution implemented in LINDO for small-scale problems.

...read moreread less

10 citations

Proceedings Article•

A Scalable Hierarchical Parallelization Framework for Molecular Dynamics Simulation on Multicore Clusters.

[...]

Liu Peng¹, Manaschai Kunaseth², Hikmet Dursun², Ken-ichi Nomura², Weiqiang Wang, Rajiv K. Kalia², Aiichiro Nakano², Priya Vashishta² - Show less +4 more•Institutions (2)

Sewanee: The University of the South¹, University of Southern California²

01 Jan 2009

TL;DR: A scalable hierarchical parallelization framework for molecular dynamics simulation on emerging multicore clusters that combines inter-node level parallelism by spatial decomposition using message passing, and intra-node (inter-core)level parallelism through a master/worker paradigm and cellular decomposition through critical section-free multithreading.

...read moreread less

Abstract: We have developed a scalable hierarchical parallelization framework for molecular dynamics (MD) simulation on emerging multicore clusters The framework combines: (1) inter-node level parallelism by spatial decomposition using message passing; (2) intra-node (inter-core) level parallelism through a master/worker paradigm and cellular decomposition using critical section-free multithreading; and (3) intra-core level parallelism via single-instruction multiple-data (SIMD) techniques Our multithreading scheme takes account of cache coherency to maximize performance For data-level parallelism via SIMD, zero padding is used to solve the alignment issue for complex data type as array, and simple data-type reformatting is used to solve the alignment issue for data with irregular memory accessing By combining a hierarchy of parallelism, the framework exposes maximal concurrency and data locality, thereby achieving: (1) inter-node weak-scaling parallel efficiency 0975 on 32,768 BlueGene/P nodes and 0985 on 106,496 BlueGene/L nodes; (2) inter-node strong-scaling parallel efficiency 090 on 32 dual quadcore AMD Opteron nodes and 094 on 32 dual quadcore Intel Xeon nodes; (3) inter-core multithread parallel efficiency 065 for the whole program (089 for two-body force calculation) for eight threads on a dual quadcore Xeon platform; and (4) SIMD speedup 135 for the whole program (142 for the twobody force calculation)

...read moreread less

8 citations

Proceedings Article•

Pseudo-Random Graphs for Fast Consensus Protocol.

[...]

Jaewook Yu¹, Eric Noel¹, K. Wendy Tang•Institutions (1)

State University of New York System¹

01 Jan 2009

TL;DR: The proposed pseudo-random approach based on Borel Cayley graphs yields a 2 to 100 times faster convergence than the small-world network and has the best scalability over different graph sizes and degrees.

...read moreread less

Abstract: In this paper, we focus on the design of network topology to achieve fast information distribution. We present the information distribution performance of Borel Cayley graphs, a family of pseudo-random graphs, is far superior than that of other well-known graph families. To demonstrate the effectiveness of this pseudo-random approach, we compare the convergence speed of the average consensus protocol on Borel Cayley graphs against that of a wide range of graph families with the sizes ranging from around 100 nodes to 5,000 nodes. In the comparison study, we compare the convergence speed of Borel Cayley graphs, regular ring lattices, Erdos-Renyi random graphs, Watts-Strogatz smallworld networks, and toroidal and diagonal meshes. Our results indicate that the proposed pseudo-random approach based on Borel Cayley graphs yields a 2 to 100 times faster convergence than the small-world network (rewiring probability p = 0.01, 0.1 and 0.2) does. More importantly, Borel Cayley graph has the best scalability over different graph sizes and degrees.

...read moreread less

7 citations

Proceedings Article•

Modular Structures in the Distributed and Decentralized Architecture.

[...]

Luige Vladareanu¹, Gabriela Tont², Radu Munteanu³, Zachei Podea, D. Popovici⁴ - Show less +1 more•Institutions (4)

Romanian Academy¹, Information Technology University², Technical University of Cluj-Napoca³, University of Oradea⁴

01 Jan 2009

Proceedings Article•

Picoso - A Parallel Interval Constraint Solver.

[...]

Natalia Kalinnik¹, Tobias Schubert¹, Erika Ábrahám¹, Ralf Wimmer¹, Bernd Becker² - Show less +1 more•Institutions (2)

University of Freiburg¹, RWTH Aachen University²

01 Jan 2009

TL;DR: Performance measurements, using four clients to solve a number of benchmark problems, show that Picoso yields (almost) linear speedup compared to the sequential interval constraint solver iSAT, on which the clients of Picoso are based.

...read moreread less

Abstract: This paper describes the parallel interval constraint solver Picoso, which can decide (a subclass of) boolean combinations of linear and non-linear constraints. Picoso follows a master/client model based on message passing, making it suitable for any kind of workstation cluster as well as for multi-processor machines. To run several clients in parallel, an efficient work stealing mechanism has been integrated, dividing the overall search space into disjoint parts. Additionally, to prevent the clients from running into identical conflicts, information about conflicts in form of conflict clauses is exchanged among the clients. Performance measurements, using four clients to solve a number of benchmark problems, show that Picoso yields (almost) linear speedup compared to the sequential interval constraint solver iSAT, on which the clients of Picoso are based.

...read moreread less

Proceedings Article•

Power Aware Scheduling in Computational Grids.

[...]

Abdul Aziz¹, Hesham El-Rewini¹•Institutions (1)

Southern Methodist University¹

01 Jan 2009

Proceedings Article•

Scheduling workflows on a cluster of memory managed multicore machines

[...]

Hira Shrestha, Nicolas G. Grounds, Jason Madden, Matthew Martin, John K. Antonio¹, Jay Sachs, Josh Zuech, Carlos Sanchez - Show less +4 more•Institutions (1)

University of Oklahoma¹

01 Jan 2009

TL;DR: Five scheduling policies are evaluated; four of these policies are known from the literature and one policy is newly proposed, determined through simulation studies.

...read moreread less

Abstract: Workflows are modeled with directed acyclic graphs in which vertices represent computational tasks, referred to as requests, and edges represent precedent constraints among requests. Associated with each workflow is a deadline that defines the time by which all computations of a workflow should be complete. Workflows are submitted by numerous clients to a centralized scheduler that assigns workflow requests to a cluster of memory managed multicore machines for execution. The objective of the scheduler is to minimize missed workflow deadlines. The characteristics of workflows are assumed to vary along several dimensions, including: arrival rate, periodicity, degree of parallelism, and number of requests. Five scheduling policies are evaluated; four of these policies are known from the literature and one policy is newly proposed. The advantages and disadvantages of each policy is determined through simulation studies.

...read moreread less

Proceedings Article•

Pandemic Simulations by MADE: A Combination of Multi-agent and Differential Euations.

[...]

Yuki Toyosaka¹, Hideo Hirose¹•Institutions (1)

Kyushu Institute of Technology¹

01 Jan 2009

Proceedings Article•

An Exhaustive Approach to Release Time Aware Divisible Load Scheduling.

[...]

Kijeung Choi¹, Thomas G. Robertazzi¹•Institutions (1)

Stony Brook University¹

01 Jan 2009

TL;DR: An exhaustive search algorithm is proposed as a starting point into open ended research on this topic on algorithmic scalability and feasibility conditions for system models whose timing relationships may not have an optimal solution.

...read moreread less

Abstract: An interesting optimization problem is examined where optimal load allocations depend on processor release times, but the timing of release times depends on the load allocation scenario used. This circular relationship between an optimal solution and release times causes system models whose timing relationships may not have an optimal solution. To obtain an optimal solution based on an assumed model and its arbitrary timing relationships, we propose an exhaustive search algorithm as a starting point into open ended research on this topic on algorithmic scalability and feasibility conditions. Through simulation, the behavior of the exhaustive search algorithm is investigated and load scheduling trends with arbitrary release times are verified. A bus network (homogeneous single-level tree network) with arbitrary processor release times is considered. For the scheduling strategy, a sequential distribution with a staggered start scheduling scenario to minimize total processing finish time is assumed.

...read moreread less

Proceedings Article•

Static Scheduling for Cyclo Static Data Flow Graphs.

[...]

Sukumar Reddy Anapalli, Krishna Chaithanya Chakilam, Timothy W. O'Neil

01 Jan 2009

TL;DR: The iteration bound for a CSDFG is presented which is used to find the integral static schedule and determine whether a csdf is live or not based on some calculation.

...read moreread less

Abstract: There are few processes which display cyclically changing but predefined behavior. These processes can be represented using cyclo static data flow graphs (CSDFG). This capability results in a higher degree of parallelism. In this paper we present the iteration bound for a CSDFG which is used to find the integral static schedule and determine whether a csdf is live or not based on some calculation. We also present an algorithm that schedules cyclo static data flow graphs without converting to their equivalent homogeneous graphs (EHG's) and demonstrate it with a suitable example.

...read moreread less

Proceedings Article•

Linear Algorithm for Broadcasting in Networks With No Intersecting Cycles.

[...]

Hovhannes A. Harutyunyan¹, Edward Maraachlian¹•Institutions (1)

Concordia University¹

01 Jan 2009

Proceedings Article•

The Stable Conditions of a Task-Pair with Helper-Thread in CMP.

[...]

Zhimin Gu, Ninghan Zheng¹, Yi Zhang, Changding Liu, Jie Tang, Yan Huang¹ - Show less +2 more•Institutions (1)

Beijing Institute of Technology¹

01 Jan 2009

Proceedings Article•

Recognition of Multi-Fonts Character in Early-Modern Printed Books.

[...]

Chisato Ishikawa, Naomi Ashida, Yurie Enomoto, Masami Takata¹, Tsukasa Kimesawa, Kazuki Joe¹ - Show less +2 more•Institutions (1)

Nara Women's University¹

01 Jan 2009

Proceedings Article•

On a Game Theoretical Methodology for Data Replication in Ad Hoc Networks.

[...]

Samee U. Khan¹•Institutions (1)

North Dakota State University¹

01 Jan 2009

TL;DR: The proposed game theoretical technique in which players continuously compete in a non-cooperative environment to improve data accessibility by replicating data objects outperforms the four techniques in both the execution time and solution quality.

...read moreread less

Abstract: In this paper, a mathematical model for data object replication in ad hoc networks is formulated. The derived model is general, flexible and adaptable to cater for various applications in ad hoc networks. We propose a game theoretical technique in which players (mobile hosts) continuously compete in a non-cooperative environment to improve data accessibility by replicating data objects. The technique incorporates the access frequency from mobile hosts to each data object, the status of the network connectivity, and communication costs. The proposed technique is extensively evaluated against four well-known ad hoc network replica allocation methods. The experimental results reveal that the proposed approach outperforms the four techniques in both the execution time and solution quality.

...read moreread less

Proceedings Article•

Shaping Parallelism in Pervasive Spider-Web Communication Networks.

[...]

Li-Yen Hsu

01 Jan 2009

Proceedings Article•

Modeling and Throughput Analysis of Grid Task Scheduling Using Stochastic Petri Nets.

[...]

Saeed Parsa¹, Reza Entezari-Maleki¹•Institutions (1)

Iran University of Science and Technology¹

01 Jan 2009

Proceedings Article•

On Parallelization of the I-SVD Algorithm and its Evaluation for Clustered Singular Values.

[...]

Hiroki Toyokawa¹, Kinji Kimura¹, Masami Takata², Yoshimasa Nakamura¹•Institutions (2)

Kyoto University¹, Nara Women's University²

01 Jan 2009

Proceedings Article•

The architecture of visualization system using memory with memory-side gathering and CPUs with DMA-type memory accessing.

[...]

Noboru Tanabe¹, Manami Sasaki, Hironori Nakajo², Masami Takata³, Kazuki Joe³ - Show less +1 more•Institutions (3)

Toshiba¹, University of Tokyo², Nara Women's University³

01 Jan 2009

Proceedings Article•

A Course-Grain Multicomputer Algorithm for the Minimum Cost Parenthesization Problem.

[...]

Mounir Kechid¹, Jean Frédéric Myoupo•Institutions (1)

University of Picardie Jules Verne¹

01 Jan 2009

Proceedings Article•

A Frugal Auction Technique for Data Replication in Large Distributed Computing Systems.

[...]

Samee U. Khan¹•Institutions (1)

North Dakota State University¹

01 Jan 2009

TL;DR: Experimental results reveal that this mechanism provides excellent solution quality, while maintaining fast execution time, and a bidding mechanism that encapsulates the selfishness of the agents, while having a controlling hand over them.

...read moreread less

Abstract: Fine-grained data replication over the Internet allows duplication of frequently accessed data objects, as opposed to entire sites, to certain locations so as to improve the performance of large-scale content distribution systems. In a distributed system, agents representing their sites try to maximize their own benefit since they are driven by different goals such as to minimize their communication costs, latency, etc. In this paper, we will use game theoretical techniques and in particular auctions to identify a bidding mechanism that encapsulates the selfishness of the agents, while having a controlling hand over them. In essence, the proposed game theory based mechanism is the study of what happens when independent agents act selfishly and how to control them to maximize the overall performance. A bidding mechanism asks how one can design systems so that agents’ selfish behavior results in the desired systemwide goals. Experimental results reveal that this mechanism provides excellent solution quality, while maintaining fast execution time. The comparisons are recorded against some well known techniques such as greedy, branch and bound, game theoretical auctions and

...read moreread less

Proceedings Article•

Task and Machine Heterogeneities: Higher Momenets Matter.

[...]

Abdulla M. Al-Qawasmeh¹, Anthony A. Maciejewski², Howard Jay Siegel², Jay Smith², Jerry Potter² - Show less +1 more•Institutions (2)

University of Houston–Clear Lake¹, Colorado State University²

01 Jan 2009

TL;DR: Different central moments used to quantify the heterogeneity of ETC matrices obtained from real world systems and benchmark data are identified and the effect of these moments on the performance of heuristics both through simple examples and simulations are shown.

...read moreread less

Abstract: One type of heterogeneous computing (HC) systems consists of machines with diverse capabilities harnessed together to execute a set of tasks that vary in their computational complexity. An HC system can be characterized using an Estimated Time to Compute (ETC) matrix. Each value in this matrix represents the ETC of a specific task on a specific machine when executed exclusively. Heuristics use the values in the ETC matrix to allocate tasks to machines in the HC system. The performance of resource allocation heuristics can be affected significantly by factors such as task and machine heterogeneities. Therefore, quantifying heterogeneity will allow a system to select a heuristic appropriate for the given heterogeneous environment. In this paper, we identify different central moments used to quantify the heterogeneity of ETC matrices obtained from real world systems and benchmark data, and show the effect of these moments on the performance of heuristics both through simple examples and simulations.

...read moreread less

Proceedings Article•

Molecular Solutions for the Bin-packing and Minimum Makespan Scheduling Problems on DNA-based Supercomputing.

[...]

Saeid Safaei, Babak Dalvand, Zahra Derakhshandeh¹, Vahid Safaei•Institutions (1)

Sheikh Bahaei University¹

01 Jan 2009

Proceedings Article•

Interprocess Communication in Java.

[...]

George Wells¹•Institutions (1)

Rhodes University¹

01 Jan 2009

TL;DR: A library of classes providing support for interprocess communication in Java programs, using the mechanisms present in the native operating system, is described, showing significant performance improvements over the standard Java mechanisms available for such systems.

...read moreread less

Abstract: This paper describes a library of classes providing support for interprocess communication in Java programs, using the mechanisms present in the native operating system. This approach is particularly well-suited for use with independent Java processes running on a single multicore (or multiprocessor) computer. At this stage, a comprehensive class library has been implemented for the Linux operating system, allowing access to the rich set of interprocess communication mechanisms provided by it. Initial testing shows significant performance improvements over the standard Java mechanisms available for such systems.

...read moreread less