scispace - formally typeset
Search or ask a question

Showing papers presented at "Parallel and Distributed Computing: Applications and Technologies in 2008"


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper presents an effective Particle Swarm Optimization (PSO)-based Localization Scheme using the Radio Signal Strength (RSS) ranging technique, which is unique in adopting the location data of remote anchors provided by the closest neighbor anchors of an unknown node to estimate the unknown nodepsilas position.
Abstract: Wireless sensor networks (WSNs) usually employ different ranging techniques to measure the distance between an unknown node and its neighboring anchor nodes, and based on the measured distance to estimate the position of the unknown node. This paper presents an effective Particle Swarm Optimization (PSO)-based Localization Scheme using the Radio Signal Strength (RSS) ranging technique. Modified from the iterative multilateration algorithm, our scheme is unique in adopting the location data of remote anchors provided by the closest neighbor anchors of an unknown node to estimate the unknown nodepsilas position and using the PSO algorithm to further reduce error accumulation. The new scheme meanwhile takes in a modified DV-distance approach to raise the success ratios of locating unknown nodes. Compared with related schemes, our scheme is shown through simulations to perform constantly better in increasing localization success ratios and decreasing location errors -- at reduced cost.

63 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: A Reputation-based Secure Data Aggregation for WSNs (RSDA) that integrates aggregation functionality with the advantages provided by a reputation system to enhance the network lifetime and the accuracy of the aggregated data.
Abstract: Wireless Sensor Networks (WSNs) are a new technology that is expected to be used in the near future due to its cheap cost and data processing ability. However, securing WSNs with traditional cryptographic mechanism is insufficient because of the existing limited resources and the lack of tamper resistant hardware. In this paper, we propose a Reputation-based Secure Data Aggregation for WSNs (RSDA) that integrates aggregation functionality with the advantages provided by a reputation system to enhance the network lifetime and the accuracy of the aggregated data. We bind symmetric secret keys to geographic locations and assign these keys to sensor nodes based on their locations. RSDA therefore can resist an adversary that is capable to compromise up to W sensor nodes in total with no more than t -1 compromised nodes in any cell.

51 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper study the programmability of CUDA and its GeForce 8 GPU and compare its performance with general purpose processors, in order to investigate its suitability for general purpose computation.
Abstract: In the last few years, GPUs(Graphics Processing Units) have made rapid development. Their ever-increasing computing power and decreasing cost have attracted attention from both industry and academia. In addition to graphics applications, researchers are interested in using them for general purpose computing. Recently, NVIDIA released a new computing architecture, CUDA (compute united device architecture), for its GeForce 8 series, Quadro FX, and Tesla GPU products. This new architecture can change fundamentally the way in which GPUs are used. In this paper, we study the programmability of CUDA and its GeForce 8 GPU and compare its performance with general purpose processors, in order to investigate its suitability for general purpose computation.

39 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper analyzes the usage of an experimental grid over a one-year period and proposes a resource reservation infrastructure which takes into account the energy issue and validates the infrastructure on the large scale experimental Grid5000 platform.
Abstract: The question of energy savings has been a matter of concern since a long time in the mobile distributed systems and battery-constrained systems. However, for large-scale non-mobile distributed systems, which nowadays reach impressive sizes, the energy dimension (electrical consumption) just starts to be taken into account. In this paper, we analyze the usage of an experimental grid over a one-year period. Based on this analysis, we propose a resource reservation infrastructure which takes into account the energy issue. We validate our infrastructure on the large scale experimental Grid5000 platform and present the obtained gains in terms of energy.

32 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: The practical issues of agent migration and communication are considered in light of WSN constraints and a description of approaches adopted by Agent Factory Micro Edition (AFME) is illustrated.
Abstract: Intelligent agents offer a viable paradigm for enabling AmI applications and services. As WSN technologies are anticipated to provide an indispensable component in many application domains, the need for enabling the agent paradigm to encompass such technologies becomes more urgent. The resource-constrained ad-hoc nature of WSNs poses significant challenges to conventional agent frameworks. In particular, the implications for agent functionality and behaviour in a WSN context demand that issues such as unreliable message delivery and limited power resources, amongst others, be considered. In this paper, the practical issues of agent migration and communication are considered in light of WSN constraints. The discussion is illustrated through a description of approaches adopted by Agent Factory Micro Edition (AFME).

31 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This talk will focus on the architecture and system software challenges the authors face as they continue to attack ever-larger computational problems as they scale toward exascale computing.
Abstract: Recently, Argonne National Laboratory installed a half petaflop Blue Gene/P system It is the world's fastest open science supercomputer With 163,840 cores, the machine is beginning to provide insight on how we might build future platforms as we scale toward exascale computing There are many challenges, including the dramatic shift to multicore, the cost of electric power, and the need for robust fault management In this talk I will focus on the architecture and system software challenges we face as we continue to attack ever-larger computational problems

30 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper presents how to place SNs by use of a minimal number to maximize the coverage area when the communication radius of the SN is not less than the sensing radius, which results in the application of regular topology to WSNs deployment.
Abstract: Energy-constraint is a crucial problem in wireless sensor networks (WSNs). Many sensor node (SN) placement schemes and routing protocols are proposed to address this problem. In this paper, we first present how to place SNs by use of a minimal number to maximize the coverage area when the communication radius of the SN is not less than the sensing radius, which results in the application of regular topology to WSNs deployment. With nodes placed at an equal distance and equipped with an equal power supply, we discuss the energy imbalance problem and then give the mathematical formulation for maximizing network lifetime in grid-based WSNs. The formulation shows the problem of maximizing network lifetime is a non-linear programming problem and NP-hard even in the 1-D case. We discuss several heuristic solutions and show that the halving shift data collection scheme is the best solution among them. We also generalize the maximizing network lifetime problem to the randomly-deployed WSNs which shows the significance of our mathematical formulation for this crucial problem.

22 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: A new dynamic replica placement algorithm for hierarchical data grids based on file ldquopopularityrdquo is proposed to reduce access time while using the network and storage efficiently thereby effectively balancing storage cost and access latency.
Abstract: Data grids provide geographically distributed storage for large-scale data-intensive applications. Ensuring efficient access to such large and widely distributed datasets is hindered by high latencies. To speed up data access, data grid systems replicate data in multiple locations so a user can access the data from a nearby site. In addition to reducing data access time, replication also aims to use network and storage resources efficiently. While replication is a well-known technique, the problem of replica placement has not been widely studied for data grid environments. To obtain the best possible gains from replication, strategic placement of the replicas is critical. In a grid environment resource availability, network latency, and userspsila requests can vary. To address these issues a placement strategy is needed that adapts to dynamic behavior. This paper proposes a new dynamic replica placement algorithm for hierarchical data grids based on file ldquopopularityrdquo. Our goal is to place replicas close to the clients to reduce access time while using the network and storage efficiently thereby effectively balancing storage cost and access latency. We evaluate our algorithm using OptorSim which shows that our approach outperforms other techniques in terms of access time and bandwidth used.

19 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: The design and evaluation of the stream processing implementation of the integral image algorithm is presented, which results in significant performance improvement when the Integral Image calculation for large input images is offloaded onto the GPU of the system.
Abstract: This paper presents the design and evaluation of the stream processing implementation of the integral image algorithm. The integral image is a key component of many image processing algorithms in particular the Haar-like feature based systems. Modern GPUs provide a large number of processors with a peak floating point performance that is significantly higher than current general CPUs. This results in significant performance improvement when the Integral Image calculation for large input images is offloaded onto the GPU of the system.

16 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper proposes a distributed sensing data propagation algorithm based on graded residual energy (GRE) of the sensor nodes, in order to achieve balanced energy consumption among sensor nodes and prolong the lifetime of the whole monitoring system.
Abstract: Wireless sensor networks have been applied to monitor pipeline structural health. In these networks, expensive multi-sinks with energy harvesting modules are deployed along the linear pipeline, and battery powered sensor nodes are deployed between the sinks. One of the main problems in such networks is the unbalance of energy consumption of sensor nodes, which makes the whole monitoring system lose its functionality with only a small percentage of sensor nodes depleted of their energy. In this paper, we propose a distributed sensing data propagation algorithm based on graded residual energy (GRE) of the sensor nodes, in order to achieve balanced energy consumption among sensor nodes. The optimum number of energy grades of GRE has been calculated through theoretical analysis in terms of maximizing network lifetime. The simulation results have shown that GRE can achieve balanced energy consumption between the sensor nodes and at the same time prolong the lifetime of the whole monitoring system.

15 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: The problem of scheduling the mobile gatewayspsila path so that the sensorspila visiting frequencies are satisfactory, and the sensorspsila data is uploaded to the sink at least at the frequency they are generated is presented and it is proved that the problem is NP-Hard.
Abstract: Recently, using mobile gateway(s) as a mechanical data carries emerges as a promising approach to prolong sensor network lifetime and relaying information in partitioned network. These mobile gateways, which move in pre-determined paths, visit the sensors to upload their data. As the sensors data generation rate for different sensors may vary based on their locations, sensors need to be visited at different frequencies. In this paper, we present the problem of scheduling the mobile gatewayspsila path so that the sensorspsila visiting frequencies are satisfactory, and the sensorspsila data is uploaded to the sink at least at the frequency they are generated. We also prove that the problem is NP-Hard. In addition to integer linear programming formulation, a practical heuristic is also proposed and its performance is compared against the optimal results.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper provides a solution for application-specific I/O for optimising a search engine and shows a 28% improvement when compared to the general-purpose I/W optimisation of Linux.
Abstract: Operating systems only provide general-purpose I/O optimisation since they have to service various types of applications. However, application level I/O optimisation can achieve better performance since an application has a better knowledge of how to optimise disk I/O for the application. In this paper we provide a solution for application-specific I/O for optimising a search engine. It shows a 28% improvement when compared to the general-purpose I/O optimisation of Linux. Our result also shows a 11% improvement when the Linux I/O optimisation is bypassed.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: Tupleware, a cluster middleware which provides a distributed tuple space intended for use by computationally intensive scientific and numerical applications, uses a decentralised approach and intelligent tuple search and retrieval to provide a scalable and efficient execution environment.
Abstract: This paper presents Tupleware, a cluster middleware which provides a distributed tuple space intended for use by computationally intensive scientific and numerical applications. It aims to add no extra burden to the application programmer due to the distribution of the tuple space, and uses a decentralised approach and intelligent tuple search and retrieval to provide a scalable and efficient execution environment. Tupleware is evaluated using two applications: a modified quicksort and an ocean model, which demonstrates good scalability and a low system overhead.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper describes a new task scheduling algorithm based on clustering that is compared in an extensive experimental evaluation to three other clustering algorithms namely, linear, single edge and dominant sequence clustering.
Abstract: This paper describes a new task scheduling algorithm based on clustering. In this new approach, clustering of the tasks is achieved by applying a force model to the task graph. From an initial configuration of the task graph, forces act upon the nodes to manoeuvre them into a low energy or equilibrium state. Clusters are created from the equilibrium state and scheduled for an unlimited number of processors. This algorithm is compared in an extensive experimental evaluation to three other clustering algorithms namely, linear, single edge and dominant sequence clustering. By keeping the mapping and scheduling phases of the algorithms identical, we compare only the difference in clustering between all algorithms. Results show that force directed clustering is very promising, especially for a limited number of processors.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: In this article, the authors investigated the task scheduling problem using A* search algorithm and showed that the A* scheduling algorithm implemented can produce optimal schedules in reasonable time for small to medium sized task graphs.
Abstract: Scheduling tasks onto the processors of a parallel system is a crucial part of program parallelisation. Due to the NP-hard nature of the task scheduling problem, scheduling algorithms are based on heuristics that try to produce good rather than optimal schedules. Nevertheless, in certain situations it is desirable to have optimal schedules, for example for time critical systems or to evaluate scheduling heuristics. This paper investigates the task scheduling problem using A* search algorithm. The A* scheduling algorithm implemented can produce optimal schedules in reasonable time for small to medium sized task graphs. In comparison to a previous approach, the here presented A* scheduling algorithm has a significantly reduced search space due to a much improved cost function f(s) and additional pruning techniques. Last but not least, the experimental results show that the proposed A* scheduling algorithm significantly outperforms the previous approach.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: In this paper, sensor network made up of MIMO-UWB radar with widely separated antennas to estimate distance and to detect human being existence is studied and the positioning accuracy of the human body and the effectiveness of M IMO- UWB radar are shown.
Abstract: Recently, UWB (ultra wide band) signal that has frequency bandwidth of 500 MHz or more is considered to be using high precision radar because of a very high resolution. The UWB is considered to have fewer influences on human being because of low transmitting power. And, MIMO (multiple-input multiple-output) radar refers to an architecture that employs multiple, spatially distributed transmitters and receivers. In this paper, sensor network made up of MIMO-UWB radar with widely separated antennas to estimate distance and to detect human being existence is studied. Detection performances are evaluated from the computer simulation by using the real propagation measurement result with the network analyzer. The positioning accuracy of the human body and the effectiveness of MIMO-UWB radar are shown.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: By using a low cost power proxy, Jabber clients can maintain their presence in instant message and chat sessions while sleeping when the user is away from the computer.
Abstract: Considerable power is consumed by unused computers. Studies show many of these computers have their power management features disabled in order to maintain their network presence and network connections. Past research proposes to use a low power proxy to 'stand in' for a computer, allowing it to go to sleep and thus save power while still maintaining its network presence. This paper describes a method to proxy for sleeping Jabber clients. By using a low cost power proxy, Jabber clients can maintain their presence in instant message and chat sessions while sleeping when the user is away from the computer. It keeps a record of session activity for the sleeping client and forwards the record of activity to the client when it wakes up.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: A parallel recovery scheme based on XOR differential logging for main memory database systems in such environments that provides system availability during recovery, which is of importance for large scale main memorydatabase systems.
Abstract: In update intensive applications, main memory database systems produce large volume of log records, it is critical to write out the log records efficiently to speedup transaction processing. We propose a parallel recovery scheme based on XOR differential logging for main memory database systems in such environments. Some NVRAM is used to temporarily hold log records and decouple transaction committing from disk writes, inherited parallelism properties of differential logging are exploited to accelerate log flushing by using multiple log disks. During recovery, log records are loaded from multiple log disks and applied to data partition in time without the need of reordering according to serialization order, total recovery time is cut down. The scheme employs a data partition based consistent checkpointing method. The log records are classified according to IDs of data partitions accessed. Data partitions are recovered according to loading priorities computed from update frequencies and transaction waiting times, data access demands of new transactions coming after failure recovery are given attention immediately, thus the scheme provides system availability during recovery, which is of importance for large scale main memory database systems.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper presents a formal operational semantics for a C+PUB subset language using the Coq proof assistant and a certified N-body computation as example of using this formal semantics.
Abstract: PUB (Paderborn University BSPLib) is a C library supporting the development of bulk-synchronous parallel (BSP) algorithms. The BSP model allows an estimation of the execution time, avoids deadlocks and indeterminism. This paper presents a formal operational semantics for a C+PUB subset language using the Coq proof assistant and a certified N-body computation as example of using this formal semantics .

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper updates the research work by moving the bodyguard paradigm, which is to help security software developers from the current serialized paradigm, to a multi-core paradigm, into the new Ubiquitous Multi-Core Framework.
Abstract: Distributed Denial of Service attacks is one of the most challenging areas to deal with in Security. Not only do security managers have to deal with flood and vulnerability attacks. They also have to consider whether they are from legitimate or malicious attackers. In our previous work we developed a framework called bodyguard, which is to help security software developers from the current serialized paradigm, to a multi-core paradigm. In this paper, we update our research work by moving our bodyguard paradigm, into our new Ubiquitous Multi-Core Framework. From this shift, we show a marked improvement from our previous result of 20% to 110% speedup performance with an average cost of 1.5 ms. We also conducted a second series of experiments, which we trained up Neural Network, and tested it against actual DDoS attack traffic. From these experiments, we were able to achieve an average of 93.36%, of this attack traffic.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: A parallel associative memory-based pattern recognition algorithm known as distributed hierarchical graph neuron (DHGN), able to reduce computational loads by efficiently disseminates recognition processes throughout the network, suitable to be deployed in wireless sensor networks.
Abstract: Pattern recognition applications such as natural phenomena detection and structural health monitoring have been widely applied using wireless sensor networks. These applications involve large amount of data to be analysed, and thus incur high computational time and complexity. In this paper, we present a parallel associative memory-based pattern recognition algorithm known as distributed hierarchical graph neuron (DHGN). It is a single-cycle learning algorithm with in-network processing capability; able to reduce computational loads by efficiently disseminates recognition processes throughout the network. Hence, suitable to be deployed in wireless sensor networks. The results of the accuracy and scalability tests show that our system performs with high accuracy and remains scalable for increases in pattern size and the number of stored patterns. The response time for pattern recognition remains within milliseconds irrespective of the size of the network.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper proposes a sensor grid infrastructure that forms the key resource sharing backbone and provides secure access to valuable sensor, computational, data, and storage resources for supporting large-scale ambient intelligence.
Abstract: In this paper, we present the idea that large-scale ambient intelligence takes the vision of anytime-anywhere to anytime-anywhere-anything (A3). Based on this vision, we argue that a mix of computing, communication and interface technologies remains limited in providing seamless access to services if the data and services from various autonomously operating entities remain non-sharable. Thus, we propose a sensor grid infrastructure that forms the key resource sharing backbone and provides secure access to valuable sensor, computational, data, and storage resources for supporting large-scale ambient intelligence.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: The cross-layer design of a cognitive network, the role of the blackboard architecture, and possible applications are introduced, which is known for coordinating multiple agents in a real-time manner.
Abstract: The objective of this research is to provide appropriate cross-layer architecture for wireless cognitive networks for efficient resource allocation and improved quality of service. We proposed the blackboard model, which is known for coordinating multiple agents (cognitive nodes) in a real-time manner, receiving the current state of information from these nodes, providing conclusions basing on information received from these nodes, updating these nodes with current conclusions, and suggesting needed actions for these nodes. Each cognitive node is assumed as an agent to the blackboard. The parameter values abstracted from these cognitive nodes to blackboard are structured messages that are optimized with respective to an objective function. This paper introduces the cross-layer design of a cognitive network, the role of the blackboard architecture, and possible applications.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: An efficient algorithm that finds disjoint paths for set-to-set routing in a dual-cube with about half of links per node compared with the hypercube containing equal number of nodes is proposed.
Abstract: In this paper, we propose an efficient algorithm that finds disjoint paths for set-to-set routing in a dual-cube. A dual-cube is a hypercube-like interconnection network with about half of links per node compared with the hypercube containing equal number of nodes. For a dual-cube Dn with n links per node, the algorithm finds n disjoint paths, node sirarrtj (1 les i, j les n), si isin S, tj isin T, in O (n2 log n) time and the maximum length of the paths is bounded by 3n + 3.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: The main contribution of this paper is to present a hardware connected component labeling algorithm for k-concave binary images designed and implemented in FPGA, and to evaluate the performance.
Abstract: Connected component labeling is a task that assigns unique IDs to the connected components of a binary image. The main contribution of this paper is to present a hardware connected component labeling algorithm for k-concave binary images designed and implemented in FPGA. Pixels of a binary image are given to the FPGA in raster order, and the resulting labels are also output in the same order. The advantage of our labeling algorithm is low latency and to use FPGA effectively. We have implemented our hardware labeling algorithm in an Altera Stratix Family FPGA, and evaluated the performance. The implementation result shows that for a 20-concave binary image of 2048 times 2048, our connected component labeling algorithm runs in approximately 72 ms and its latency is approximately 2.9 ms.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: An improved approach named NDAMR (node-disjoint alternative multiple path), a routing protocol that maintains the only two shortest backup paths in the source and destination nodes, which can alleviate the redundancy-frames overhead during the process of data salvation by the neighboring intermediate nodes is proposed.
Abstract: This study elaborates the influence patterns of different backup strategies on AODV-based routing protocols. Although some backup routing strategies yield good data delivery rates, they suffer from low efficiency. To make the process of data salvation more efficiently in case of link failure, we explore the possibility of combining the AODV-BR strategy and on-demand node-disjoint multi-path routing protocols. This article proposes an improved approach named NDAMR (node-disjoint alternative multiple path), a routing protocol that maintains the only two shortest backup paths in the source and destination nodes. The NDAMR can alleviate the redundancy-frames overhead during the process of data salvation by the neighboring intermediate nodes. Our simulation results have demonstrated that NDAMR delivers good data delivery performance while restricting the impacts of transmission collision and contention.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: Hardware algorithms for redundant radix-2r number system in the FPGA to speed the arithmetic operations for numbers with many bits, which have applications in security systems such as RSA encryption and decryption are presented.
Abstract: The main contribution of this paper is to present hardware algorithms for redundant radix-2r number system in the FPGA to speed the arithmetic operations for numbers with many bits, which have applications in security systems such as RSA encryption and decryption. Our hardware algorithms accelerate arithmetic operations including addition, multiplication, and Montgomery modulo multiplication.Quite surprisingly, our hardware algorithms of the multiplication and Montgomery multiplication for two 1024-bit numbers runs only 64 clock cycles using redundant radix-216 number system. Also, the experimental results for Xilinx Virtex-II Pro Family FPGA XC2VP100-6 show that the clock frequency of our circuit is independent of the number of bits. The speed up factors of our hardware algorithm using the redundant number system over those using the conventional number system are 8.3 for 1024-bit addition, 3.4 for 1024-bit multiplication, and 2.5 for 1024-bit Montgomery modulo multiplication. Further, for 256-bit Montgomery modulo multiplication, our hardware algorithm runs in 0.38 mus, while a previously known implementation runs in 1.22 mus. Thus, our approach using redundant number system for arithmetic operations is very efficient.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: DIMMnet-3, which is a practical solution to enhance memory system and I/O system of PC, and Toshiba Cell Reference Set are introduced and Communication mechanisms named LHS and LHC, which are architectures for reducing latency for mixed messages with small controlling data and large acknowledge data are proposed.
Abstract: Introduction of multi-core structures has not led to a decline in the rapid performance improvement of COTS CPU recently. On the other hand, the performance of memory and I/O systems is insufficient to catch up with that of COTS CPU. In this paper, with a view to realizing high-performance computer systems not only for HPC but also for Google-like servers, we propose concepts concerning memory systems and network systems with large extended memory. We introduce DIMMnet-3, which is a practical solution to enhance memory system and I/O system of PC, and Toshiba Cell Reference Set. Examples of the killer applications of this new type of hardware are presented. Communication mechanisms named LHS and LHC are also proposed. These are architectures for reducing latency for mixed messages with small controlling data and large acknowledge data. The latency evaluation of them is shown.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: The XtreemOS grid checkpointing architecture is described and how the gap between the abstract grid and the system-specific checkpointers is bridged and how Linux control groups can be used to address resource isolation issues during the restart.
Abstract: The EU-funded XtreemOS project implements a grid operating system transparently exploiting resources of virtual organizations through the standard POSIX interface. Grid checkpointing and restart requires to save and restore jobs executing in a distributed heterogeneous grid environment. The latter may spawn millions of grid nodes ( PCs, clusters, and mobile devices ) using different system-specific checkpointers saving and restoring application and kernel data structures for processes executing on a grid node. In this paper we shortly describe the XtreemOS grid checkpointing architecture and how we bridge the gap between the abstract grid and the system-specific checkpointers. Then we discuss how we keep track of processes and how different process grouping techniques are managed to ensure that all processes of a job and any further dependent ones can be checkpointed and restarted. Finally, we present how Linux control groups can be used to address resource isolation issues during the restart.

Proceedings ArticleDOI
Yongqiang Zou1, Li Zha1, Xiaoning Wang1, Haojie Zhou1, Peixu Li1 
01 Dec 2008
TL;DR: The architecture provides minimal but sufficient VO functionalities while keeping decentralization, flexibility, simplicity, and effectiveness and has been implemented in Vega GOS and applied in the China National Grid and other grid platforms.
Abstract: Virtual organizations (VO) are widely accepted in the grid and other distributed computing environments. However, there are few effective VO implementations. This paper presents a layered architecture to construct Agora, an implementation of VO. Agora manages users, resources, and agora instances, provides policies to support a DAC/MAC-hybrid cross-domain access control mechanism, and maintains the context of operations. The Agora architecture consists of three layers. At the bottom is the physical layer containing external resources, then an abstraction RController is introduced to manipulate external resources. Above the physical layer, all the involved entities, including users, resources, and agoras, are abstracted as GNodes, and a naming layer is introduced to manage these GNodes. At the top, the logic layer implements all the Agora functionalities. This architecture has been implemented in Vega GOS and applied in the China National Grid and other grid platforms. The evaluation shows that the architecture provides minimal but sufficient VO functionalities while keeping decentralization, flexibility, simplicity, and effectiveness.