scispace - formally typeset
Search or ask a question

Showing papers presented at "Parallel and Distributed Processing Techniques and Applications in 2008"


Proceedings Article
01 Dec 2008
TL;DR: This paper introduces a first step towards building an efficient GPU-based parallel implementation of a commonly used clustering algorithm called K-Means on an NVIDIA G80 PCI express graphics board using the CUDA processing extensions.
Abstract: Graphics Processing Units (GPU) have recently been the subject of attention in research as an efficient coprocessor for implementing many classes of highly parallel applications The GPUs design is engineered for graphics applications, where many independent SIMD workloads are simultaneously dispatched to processing elements While parallelism has been explored in the context of traditional CPU threads and SIMD processing elements, the principles involved in dividing the steps of a parallel algorithm for execution on GPU architectures remains a significant challenge In this paper, we introduce a first step towards building an efficient GPU-based parallel implementation of a commonly used clustering algorithm called K-Means on an NVIDIA G80 PCI express graphics board using the CUDA processing extensions Clustering algorithms are important for search, data mining, spam and intrusion detection applications Modern desktop machines commonly include desktop search software that can be greatly enhanced by these advances, while low-power machines such as laptops can reduce power consumption by utilizing the video chip for these clustering and indexing operations Our preliminary results show over a 13x performance improvement compared to a baseline 3 GHz Intel Pentium(R) based PC running the same algorithm with an average spec G80 graphics card, the NVIDIA 8600GT The low cost of these video cards (less than $100 market price as of 2008), and the high performance gains suggest that our approach is both practical and economical for common applications

138 citations


Proceedings Article
01 Jan 2008
TL;DR: This paper investigates dynamic load- balancing algorithm for heterogeneous distributed systems where half of the processors have double the speed of the others and some simulation results are presented to show the effectiveness of genetic algorithms for dynamic load balancing.
Abstract: Load balancing is a crucial issue in parallel and distributed systems to ensure fast processing and optimum utilization of computing resources. Load balancing strategies try to ensure that every processor in the system does almost the same amount of work at any point of time. This paper investigates dynamic load- balancing algorithm for heterogeneous distributed systems where half of the processors have double the speed of the others. Two job classes are considered for the study, the jobs of first class are dedicated to fast processors. While second job classes are generic in the sense they can be allocated to any processor. The performance of the scheduler has been verified under scalability. Some simulation results are presented to show the effectiveness of genetic algorithms for dynamic load balancing.

17 citations


Proceedings Article
01 Jan 2008
TL;DR: Using the MOT prefix, a parallel algorithm for prefix computation on an n × n mesh of trees (MOT) is developed and this algorithm for n4 data elements is shown to map in 13 log n + O(1) electronic moves + 2 OTIS moves using n processors.
Abstract: $ The corresponding author, Member, IEEE and IEEE Computer Society Abstract: In this paper, we first develop a parallel algorithm for prefix computation on an n × n mesh of trees (MOT). For n data elements, the algorithm requires 4 log n + O(1) time using n2 processors. Using the MOT prefix, we next propose a prefix algorithm on an n × n OTIS mesh of trees. This algorithm for n4 data elements is shown to map in 13 log n + O(1) electronic moves + 2 OTIS moves using n processors.

13 citations


Proceedings Article
01 Jan 2008
TL;DR: The libAuToti library as mentioned in this paper is an open-source parallel library for implementing models based on the cellular automata approach for the simulation of lava flows as defined by the SCIARA model.
Abstract: Cellular Automata, a computational science branch used for the modeling and simulation of complex systems, are parallel computing models which are continuously gaining attention from the Scientific Community for their potentiality and efficiency This work presents the application of libAuToti, an open-source parallel library for implementing models based on the Cellular Automata approach, on the simulation of lava flows as defined by the SCIARA model The library permits a straightforward and simple implementation of Macroscopic Cellular Automata models, which are appropriate for the simulation of spatial extended dynamical systems Experiments have demonstrated the elevated computational efficiency of the library, executed both on an HPC machine and a standard multi-core PC, confirming the reliability of the library and goodness of simulation results Eventually, future improvements and possible applications are discussed at the end of the paper

12 citations


Proceedings Article
01 Jan 2008

12 citations


Proceedings Article
01 Jan 2008
TL;DR: This paper presents a Globus XIO based client to GridFTP that provides a simple Open/Close/Read/Write (OCRW) interface to the users and shows that the performance of this OCRW client is comparable to that of globus-url-copy.
Abstract: GridFTP is a high-performance, reliable data transfer protocol optimized for high-bandwidth wide-area networks. Based on the Internet FTP protocol, it defines extensions for high-performance operation and security. The Globus implementation of GridFTP provides a modular and extensible data transfer system architecture suitable for wide area and high-performance environments. GridFTP is the de facto standard in projects requiring secure, robust, high-speed bulk data transport. For example, the high energy physics community is basing its entire tiered data movement infrastructure for the Large Hadron Collider computing Grid on GridFTP; the Laser Interferometer Gravitational Wave Observatory routinely uses GridFTP to move 1 TB a day during production runs; and GridFTP is the recommended data transfer mechanism to maximize data transfer rates on the TeraGrid. Commonly used GridFTP clients include globus-url-copy, uberftp, and the Globus Reliable File Transfer service. In this paper, we present a Globus XIO based client to GridFTP that provides a simple Open/Close/Read/Write (OCRW) interface to the users. Such a client greatly eases the addition of GridFTP support to third-party programs, such as SRB and MPICH-G2. Further, this client provides an easier and familiar interface for applications to efficiently access remote files. We compare the performance of this client with thatmore » of globus-url-copy on multiple endpoints in the TeraGrid infrastructure. We perform both memory-to-memory and disk-to-disk transfers and show that the performance of this OCRW client is comparable to that of globus-url-copy. We also show that our GridFTP client significantly outperforms the GPFS WAN on the TeraGrid.« less

9 citations


Proceedings Article
01 Jan 2008

8 citations


Proceedings Article
01 Jan 2008
TL;DR: The resulting algorithm is the first CGM algorithm for this problem derived from the Yao’s acceleration technique and runs in S super-steps with O(n/P) as time of execution per processor.
Abstract: The matrix chain ordering problem (MCOP) is widely used in computer and specially in combinatorial optimization. Even though there has been intensive work for the parallelization of dynamic programming on PRAM, systolic arrays among others, its parallel version on BSP/CGM is still to be done. In our former work, [10], we proposed a BSP/CGM for this problem running in O(n/p)). Our approach was based on the classical sequential algorithm (running in O(n)), hence the algorithm we obtained is not optimal. In this paper our strategy is based on the Yao’s sequential algorithm for dynamic programming [18] running in O(n). Our resulting algorithm runs in S super-steps with O(n/P) as time of execution per processor. To our knowledge, it is the first CGM algorithm for this problem derived from the Yao’s acceleration technique. Key Works: Dynamic Programming, Parallel Algorithms, BSP/CGM Algorithms.

8 citations



Proceedings Article
01 Jan 2008
TL;DR: Problems surrounding contention management are discussed, related work addressing these problems, a new dynamic contention manager algorithm yielding an Adaptive STM (ASTM) library, experimental results comparing static versus dynamic contention management, and an analysis of the result.
Abstract: Effectively managing shared memory in a multi-threaded environment is critical in order to achieve high performance in multi-core hardware platforms. Software Transactional Memory (STM) is a scheme for managing shared memory in a concurrent programming environment. STM views shared memory in a way similar to that of a database; read and write operations are handled through transactions, with changes to the shared memory becoming permanent through commit operations. Furthermore, its benefits are not attained until larger data structures are used. Currently there are varying methods for collision detection, data validation, and contention management, each of which has different situations in which they become the preferred method. This paper discusses problems surrounding contention management, related work addressing these problems, a new dynamic contention manager algorithm yielding an Adaptive STM (ASTM) library, experimental results comparing static versus dynamic contention management, and an analysis of the result.

8 citations



Proceedings Article
01 Jan 2008
TL;DR: Feed bar advance and return driving apparatus for a transfer press.
Abstract: Feed bar advance and return driving apparatus for a transfer press. An upper slider is reciprocally vertically movable in the press crown in synchronization with the operation of the press, and a pinion rotatably mounted on the upper slider is rotated alternatingly in opposite directions with the upward and downward vertical movement of the upper slider. A drive rack vertically slidably supported on the upper slider is connected to the pinion at a position on the pinion which is eccentric to the axis of rotation of the pinion and is reciprocally vertically driven by the rotation of the pinion. The lower end of the drive rack extends downwardly and engages a drive pinion for rotating the drive pinion alternatingly in opposite directions during the reciprocal vertical motion of the drive rack. A drive lever connected to the drive pinion at a position eccentric to the axis of rotation of the drive pinion and pivotally mounted on the press is driven in a swinging motion and is connected to a lower slider for supporting feed bars of the press to drive the lower slider in advancing and returning directions.

Proceedings Article
01 Jan 2008
TL;DR: This paper addresses an approach of improving the consistency in soft-state based systems by studying how to specify the refresh and timeout periods as functions of the other parameters and evaluating whether such functions are useful to improve the system consistency.
Abstract: This paper addresses an approach of improving the consistency in soft-state based systems. The consistency in such a system is affected by many parameters; some are given by the environment while others are tunable. The environment parameters are the loss probability and latency of the channel between the endpoints, and the change rate of the source state.The tunable parameters are the refresh period and timeout period. We study how to specify the refresh and timeout periods as functions of the other parameters and evaluate whether such functions are useful to improve the system consistency. To that end, we performed simulation experiments with different parameter values and we investigated the relationship between them. Our results show that the consistency of a system using the timeout period based on the proposed function increases significantly, especially when used in a lossy environment. As for the refresh period, our results show that the lowest inconsistency is achieved when the refresh period is about the same as the channel latency.

Proceedings Article
01 Jan 2008
TL;DR: This work proposes an autonomous desktop Grid computing system, Self-Gridron based on a neural overlay network that supports reliable, autonomous, and cost-effective scheduling which includes eligible resource classification and job management.
Abstract: Although desktop Grid computing has been regarded as a cost-efficient computing paradigm, the system has suffered from scalability issues caused by its centralized structure. In addition, resource volatility generates system instability and performance deterioration. However, regarding the provision of a reliable and stable execution environment, resource management becomes more intricate when the system is constructed in a fully decentralized fashion without a central server. Scaling the system numerically and geographically is necessary for autonomous network organization, facile adaptation to execution failure and dynamic self-management of volatile resources. In order to develop a fully decentralized desktop Grid computing system securely, we propose an autonomous desktop Grid computing system, Self-Gridron based on a neural overlay network. SelfGridron supports reliable, autonomous, and cost-effective scheduling which includes eligible resource classification and job management (i.e. allocation, replication, and reassignment). Furthermore, Self-Gridron provides sovereign learning with error correction) and evolves adaptively by itself to system changes or failure on the fly while improving performance.


Proceedings Article
01 Jan 2008
TL;DR: A simulation tool to perform analysis of High Performance Applications, which make a great amount of I/O operation, on large storage networks, and the most interesting features of this simulation tool are its flexibility and scalability.
Abstract: In this paper we present a tool to perform analysis of High Performance Applications, which make a great amount of I/O operation, on large storage networks. In order to perform those analyses we have developed SIMCAN, a simulation tool to analyze High-Performance I/O Architectures. Storage subsystem performance is one of the major concerns that arise on large storage networks. Major requirements for storage networks are scalability and performance. In those kinds of networks, defining an architecture that satisfies those requirements is a very difficult task. With SIMCAN, custom environments can be configured and deployed on a flexible and easy way. In fact, the most interesting features of this simulation tool are its flexibility and scalability, so the simulation of distributed storage environments can be performed with the required detail level. Thus, in order to evaluate the benefits and the accuracy of the proposed tool, we have tested it with a typical high performance application, and compared the results of the simulated architecture with the real one.


Proceedings Article
01 Jan 2008
TL;DR: A highly-parallel methodology for solving large-scale, dense, linear systems is proposed in this thesis by means of the novel application of Cramer’s Rule, yielding an overall computational complexity of O(N) with N2 processing units.
Abstract: Solving linear systems with multiple variables is at the core of many scienti…c problems. Parallel processing techniques for solving such system problems has have received much attention in recent years. A key theme in the literature pertains to the application of Lower triangular matrix and Upper triangular matrix(LU) decomposing, which factorizes an N N square matrix into two triangular matrices. The resulting linear system can be more easily solved in O(N2) work. Inher- ently, the computational complexity of LU decomposition is O(N3). Moreover, it is a challenging process to parallelize. A highly-parallel methodology for solving large-scale, dense, linear systems is proposed in this thesis by means of the novel application of Cramer’s Rule. A numerically stable scheme is described, yielding an overall computational complexity of O(N) with N2 processing units.

Proceedings Article
01 Jan 2008

Proceedings Article
01 Jan 2008
TL;DR: A spring steel clip for connecting a first member to a second member is U-shaped with side tabs on one leg that is bent toward the inner surface of the other leg for limiting the amount the legs may be pressed together by a screw passing through holes in each leg and into an attached threaded nut.
Abstract: A spring steel clip for connecting a first member to a second member is U-shaped with side tabs on one leg that is bent toward the inner surface of the other leg for limiting the amount the legs may be pressed together by a screw passing through holes in each leg and into an attached threaded nut. The screw may pass through a hole in a first member for connecting it to the edge of a second member to which the clip is clamped.

Proceedings Article
01 Jan 2008
TL;DR: This paper has investigated the impact of execution of engineering applications utilizing one and two cores in an Intel Core 2 Duo based Linux cluster and found that having N processes on N computer nodes, only using one core on each node, is significantly faster than running N process on N cores in N/2 computer nodes.
Abstract: With the event of multi-core processors the parallel execution of simulation applications has resulted in new problems and possibilities in resource usage in high performance computing (HPC). In this paper we have investigated the impact of execution of engineering applications utilizing one and two cores in an Intel Core 2 Duo based Linux cluster. In engineering industry the number of licenses puts practical and economical constraints on the maximum number of processes. Consequently the issue of how to distribute a given number of processes over the compute nodes in a HPC resource becomes very important. When distributing the application over multiple nodes we found that having N processes on N computer nodes, only using one core on each node, is significantly faster than running N processes on N cores in N/2 computer nodes. Only in one case out of 32 it was beneficial to use both cores. The “one compute node – one simulation process” approach gave an average cost efficiency increase of 16.5%, and for several sub-cases it is actually costbeneficial to run on more nodes than fewer, which decreases the overall run time.

Proceedings Article
01 Dec 2008
TL;DR: Relying on the intuition that the internal dynamics of a cache can be captured by a first-order time-dependent process, a model called SMCP is developed, based on the well-studied linear Gaussian state space model, to observe, characterize, and predict the hit rates at a Web cache.
Abstract: Accurate analytical models of Web caches are desirable as they can provide inexpensive ways to make resource provisioning decisions at a cache itself as well as at the Web servers it is servicing. Explicitly modeling a Web cache has two major shortcomings: (i) several simplifying assumptions about the operation of the cache for mathematical tractability resulting in loss of accuracy and (ii) measure ments of phenomena internal to the cache that may not always be available without adding monitoring hooks within the cache. Therefore, in this paper, we turn towards statistical techniques to develop a model that is non-intrusiv e (that is, requires no additions to the cache) and treats the Web cache as a black-box (that is, operates solely by observing readily available inputs/outputs and requires no knowledge about the internals of the cache). Relying on the intuition that the internal dynamics of a cache can be captured by a first-order time-dependent process, we develop a model called SMCP, based on the well-studied linear Gaussian state space model, to observe, characterize, and predict the hit rates at a Web cache. A comparison with time-independent models, including one based on Linear Regression (LR), validates our intuition for the need to employ a time-dependent model. A detailed evaluation shows the efficacy of our model with LRU and LFU, two representative cache replacement policies. In our experiments, SMCP predicts hit ratio within 0.1 (absolute value) of their actual value 77.5% and 65% of the times for LRU and LFU, respectively. Secondly, SMCP captures the timevarying behavior more accurately than done by several time-independent models.

Proceedings Article
01 Jan 2008
TL;DR: A novel multi-threading finite state machine (FSM) is proposed, which improves FSM clock frequency and allows multiple packets to be examined by a single FSM simultaneously.
Abstract: This paper presents a string matching hardware on FPGA for network intrusion detection systems. The proposed architecture, consisting of packet classifiers and strings matching verifiers, achieves superb throughput by using several mechanisms. First, based on incoming packet contents, the packet classifiers can dramatically reduce the number of strings to be matched for each packet and, accordingly, feed the packet to a proper verifier to conduct matching. Second, a novel multi-threading finite state machine (FSM) is proposed, which improves FSM clock frequency and allows multiple packets to be examined by a single FSM simultaneously. Design techniques for high-speed interconnect and interface circuits are also presented. Experimental results are presented to explore the trade-offs between system performance, strings partition granularity and hardware resource cost

Proceedings Article
01 Jan 2008
TL;DR: A wall covering is provided for supplying an electric receiver with power at variable positions on the wall, which covering is in the form of elements such as a carpeting tile with textile wearing layer and an underlying insulating foundation layer.
Abstract: PCT No. PCT/FR86/00031 Sec. 371 Date Oct. 8, 1986 Sec. 102(e) Date Oct. 8, 1986 PCT Filed Feb. 5, 1986 PCT Pub. No. WO86/04742 PCT Pub. Date Aug. 14, 1986.A wall covering is provided for supplying an electric receiver with power at variable positions on the wall, which covering is in the form of elements such as a carpeting tile with textile wearing layer (10) and an underlying insulating foundation layer (12) under which are fixed two assemblies of conductors in the form of strips (14, 14') of opposite polarities, covered on their lower face with a covering sheet (15), the tile possibly including electric connection means such as projecting ends (16, 16') of the conducting strips, the strips being designed for supplying with two phase power a current sensor having contact needles passing through the wearing layer (10) and the insulating layer (12) so as to come into contact with the conducting strips.

Proceedings Article
01 Jan 2008
TL;DR: This paper demonstrates the use of dynamic aspects in JAC to solve the problem of load balancing, where the client proxy is modified with an aspect to forward requests to a specific server, but the server is also able to shed load by altering or removing this aspect.
Abstract: Load balancing is the process of distributing client requests over a set of servers, and is a key element of obtaining good performance in a distributed application. Java RMI extends Java with distributed objects whose methods can be called from remote clients. In some Java RMI programs, there may be multiple replicas of a given object that can be the receiver of a remote method invocation. Effectively distributing these requests across these replicas requires either an extra balancer process or additional code on the client for this distribution. In this paper, we demonstrate the use of dynamic aspects in JAC to solve this problem. The client proxy is modified with an aspect to forward requests to a specific server, but the server is also able to shed load by altering or removing this aspect. The overhead of this approach is evaluated using a set of microbenchmarks.

Proceedings Article
01 Jan 2008
TL;DR: The design of the language is described and examples of its use in addressing a range of concurrent programming problems are given and the main feature is the combination of chords and higher-order functions in one language.
Abstract: In this paper we introduce new parallel programming language Parallel C#, the main feature of which is the combination of chords and higher-order functions in one language. This language extends the standard syntax of C# language for the parallel programming needs and simplifies the task of writing complex multithreaded and distributed applications. We describe the design of the language and give examples of its use in addressing a range of concurrent programming problems. Also we introduce new Distributed Runtime Systems for this language both for Windows and Linux machines.

Proceedings Article
01 Jan 2008
TL;DR: This research presents three new heuristics based on rotation scheduling, half-rotation, best span, and random rotation, and compares them with existing methods for producing compact, static schedules for iterative processes on parallel hardware.
Abstract: For an iterative process to be parallelized, the operations that comprise the process must be organized into a correct schedule that will allow the hardware to compute the task. The focus of our research is rotation scheduling, a list-scheduling-based method for producing compact, static schedules for iterative processes on parallel hardware. We present three new heuristics based on rotation scheduling, half-rotation, best span, and random rotation, and compare them with existing methods. We discuss problems with existing methods, and provide statistical evidence supporting random rotation as an effective alternative that avoids

Proceedings Article
01 Jan 2008
TL;DR: This paper proposes a local storage management system and Sensor Node File System (SENFIS), a file system for sensor nodes of Mica family motes developed over TinyOS, and provides an evaluation of the storagemanagement system and SENFIS in terms of code size, foot print, execution time and flash energy consumption.
Abstract: In the last years the Wireless Sensor Networks technology has achieved maturity. The continuous data production through a wide set of versatile applications drives researchers to think about different methods of data storing and recovering, which can provide an efficient abstraction for giving persistent support to the data generated into the sensor node. This paper focuses on the problem of local storage in sensor nodes using a flash memory chip. We propose a local storage management system and Sensor Node File System (SENFIS), a file system for sensor nodes of Mica family motes developed over TinyOS. Finally, we provide an evaluation of the storage management system and SENFIS in terms of code size, foot print, execution time and flash energy consumption.

Proceedings Article
01 Jan 2008
TL;DR: To form a turbine engine component, metal airfoils are positioned in an annular array and a slip joint is provided between at least one end portion of each of theAirfoils and a shroud ring to accommodate thermal expansion of the airfoil relative to the shroud rings.
Abstract: To form a turbine engine component, metal airfoils are positioned in an annular array. Outer end portions of the airfoils are embedded in a wax outer shroud ring pattern and inner end portions of the airfoils are embedded in a wax inner shroud ring pattern. A mold is formed by covering the metal airfoils and the shroud ring patterns with ceramic mold material. The wax of the shroud ring patterns is then removed from the mold to leave inner and outer shroud ring mold ring cavities. The shroud ring mold cavities are filled with molten metal which is solidified to form inner and outer shroud rings interconnecting the airfoils. To accommodate thermal expansion of the airfoils relative to the shroud rings, a slip joint is provided between at least one end portion of each of the airfoils and a shroud ring. To enable the slip joint to be formed, molten metal solidifies in the shroud ring to be formed, molten metal solidifies in the shroud ring cavities free of metallurgical bonds to the airfoils. The shroud rings may be formed of metal having different compositions and crystallographic structures than the metal of the airfoils.