scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Parallel and Distributed Systems in 2002"


Journal ArticleDOI
TL;DR: Two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time are presented, called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm.
Abstract: Efficient application scheduling is critical for achieving high performance in heterogeneous computing environments. The application scheduling problem has been shown to be NP-complete in general cases as well as in several restricted cases. Because of its key importance, this problem has been extensively studied and various algorithms have been proposed in the literature which are mainly for systems with homogeneous processors. Although there are a few algorithms in the literature for heterogeneous processors, they usually require significantly high scheduling costs and they may not deliver good quality schedules with lower costs. In this paper, we present two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time, which are called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm. The HEFT algorithm selects the task with the highest upward rank value at each step and assigns the selected task to the processor, which minimizes its earliest finish time with an insertion-based approach. On the other hand, the CPOP algorithm uses the summation of upward and downward rank values for prioritizing tasks. Another difference is in the processor selection phase, which schedules the critical tasks onto the processor that minimizes the total execution time of the critical tasks. In order to provide a robust and unbiased comparison with the related work, a parametric graph generator was designed to generate weighted directed acyclic graphs with various characteristics. The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithms significantly surpass previous approaches in terms of both quality and cost of schedules, which are mainly presented with schedule length ratio, speedup, frequency of best results, and average scheduling time metrics.

2,961 citations


Journal ArticleDOI
TL;DR: This paper presents an improved scheme, called PEGASIS (power-efficient gathering in sensor information systems), which is a near-optimal chain-based protocol that minimizes energy, and presents two new schemes that attempt to balance the energy and delay cost for data gathering from sensor networks.
Abstract: Gathering sensed information in an energy efficient manner is critical to operating the sensor network for a long period of time. The LEACH protocol presented by Heinzelman et al. (2000) is an elegant solution where clusters are formed to fuse data before transmitting to the base station. In this paper, we present an improved scheme, called PEGASIS (power-efficient gathering in sensor information systems), which is a near-optimal chain-based protocol that minimizes energy. In PEGASIS, each node communicates only with a close neighbor and takes turns transmitting to the base station, thus reducing the amount of energy spent per round. Simulation results show that PEGASIS performs better than LEACH. For many applications, in addition to minimizing energy, it is also important to consider the delay incurred in gathering sensed data. We capture this with the energy /spl times/ delay metric and present schemes that attempt to balance the energy and delay cost for data gathering from sensor networks. We present two new schemes to minimize energy /spl times/ delay using CDMA and non-CDMA sensor nodes. We compared the performance of direct, LEACH, and our schemes with respect to energy /spl times/ delay using extensive simulations for different network sizes. Results show that our schemes perform 80 or more times better than the direct scheme and also outperform the LEACH protocol.

1,194 citations


Journal ArticleDOI
TL;DR: This paper proposes to significantly reduce or eliminate the communication overhead of a broadcasting task by applying the concept of localized dominating sets, which do not require any communication overhead in addition to maintaining positions of neighboring nodes.
Abstract: In a multihop wireless network, each node has a transmission radius and is able to send a message to all of its neighbors that are located within the radius. In a broadcasting task, a source node sends the same message to all the nodes in the network. In this paper, we propose to significantly reduce or eliminate the communication overhead of a broadcasting task by applying the concept of localized dominating sets. Their maintenance does not require any communication overhead in addition to maintaining positions of neighboring nodes. Retransmissions by only internal nodes in a dominating set is sufficient for reliable broadcasting. Existing dominating sets are improved by using node degrees instead of their ids as primary keys. We also propose to eliminate neighbors that already received the message and rebroadcast only if the list of neighbors that might need the message is nonempty. A retransmission after negative acknowledgements scheme is also described. The important features of the proposed algorithms are their reliability (reaching all nodes in the absence of message collisions), significant rebroadcast savings, and their localized and parameterless behavior. The reduction in communication overhead for the broadcasting task is measured experimentally. Dominating set based broadcasting, enhanced by a neighbor elimination scheme and highest degree key, provides reliable broadcast with /spl les/53 percent of node retransmissions (on random unit graphs with 100 nodes) for all average degrees d. Critical d is around 4, with <48 percent for /spl les/3, /spl les/40 percent for d/spl ges/10, and /spl les/20 percent for d/spl ges/25. The proposed methods are better than existing ones in all considered aspects: reliability, rebroadcast savings, and maintenance communication overhead. In particular, the cluster structure is inefficient for broadcasting because of considerable communication overhead for maintaining the structure and is also inferior in terms of rebroadcast savings.

930 citations


Journal ArticleDOI
TL;DR: This paper uses feedback control theory to achieve overload protection, performance guarantees, and service differentiation in the presence of load unpredictability, and shows that control-theoretic techniques offer a sound way of achieving desired performance in performance-critical Internet applications.
Abstract: The Internet is undergoing substantial changes from a communication and browsing infrastructure to a medium for conducting business and marketing a myriad of services. The World Wide Web provides a uniform and widely-accepted application interface used by these services to reach multitudes of clients. These changes place the Web server at the center of a gradually emerging e-service infrastructure with increasing requirements for service quality and reliability guarantees in an unpredictable and highly-dynamic environment. This paper describes performance control of a Web server using classical feedback control theory. We use feedback control theory to achieve overload protection, performance guarantees, and service differentiation in the presence of load unpredictability. We show that feedback control theory offers a promising analytic foundation for providing service differentiation and performance guarantees. We demonstrate how a general Web server may be modeled for purposes of performance control, present the equivalents of sensors and actuators, formulate a simple feedback loop, describe how it can leverage on real-time scheduling and feedback-control theories to achieve per-class response-time and throughput guarantees, and evaluate the efficacy of the scheme on an experimental testbed using the most popular Web server, Apache. Experimental results indicate that control-theoretic techniques offer a sound way of achieving desired performance in performance-critical Internet applications. Our QoS (Quality-of-Service) management solutions can be implemented either in middleware that is transparent to the server, or as a library called by server code.

625 citations


Journal ArticleDOI
TL;DR: An efficient partitioning algorithm is proposed that addresses the scalability issue of designing a large scale DVE system by dynamically divide the virtual world into different partitions and then efficiently assign these partitions to different servers.
Abstract: Distributed virtual environment (DVE) systems model and simulate the activities of thousands of entities interacting in a virtual world over a wide area network. Possible applications for DVE systems are multiplayer video games, military and industrial trainings, and collaborative engineering. In general, a DVE system is composed of many servers and each server is responsible to manage multiple clients who want to participate in the virtual world. Each server receives updates from different clients (such as the current position and orientation of each client) and then delivers this information to other clients in the virtual world. The server also needs to perform other tasks, such as object collision detection and synchronization control. A large scale DVE system needs to support many clients and this imposes a heavy requirement on networking resources and computational resources. Therefore, how to meet the growing requirement of bandwidth and computational resources is one of the major challenges in designing a scalable and cost-effective DVE system. In this paper, we propose an efficient partitioning algorithm that addresses the scalability issue of designing a large scale DVE system. The main idea is to dynamically divide the virtual world into different partitions and then efficiently assign these partitions to different servers. This way, each server will process approximately the same amount of workload. Another objective of the partitioning algorithm is to reduce the server-to-server communication overhead. The theoretical foundation of our dynamic partitioning algorithm is based on the linear optimization principle. We also illustrate how one can parallelize the proposed partitioning algorithm so that it can efficiently partition a very large scale DVE system. Lastly, experiments are carried out to illustrate the effectiveness of the proposed partitioning algorithm under various settings of the virtual world.

232 citations


Journal ArticleDOI
TL;DR: An efficient localized algorithm for determining a dominating and absorbant set of vertices (mobile hosts) is given and this set can be easily updated when the network topology changes dynamically, extending dominating-set-based routing to networks with unidirectional links.
Abstract: We extend dominating-set-based routing to networks with unidirectional links. Specifically, an efficient localized algorithm for determining a dominating and absorbant set of vertices (mobile hosts) is given and this set can be easily updated when the network topology changes dynamically. A host /spl nu/ is called a dominating neighbor (absorbant neighbor) of another host u if there is a directed edge from /spl nu/ to u (from u to /spl nu/). A subset of vertices is dominating and absorbant if every vertex not in the subset has one dominating neighbor and one absorbant neighbor in the subset. The derived dominating and absorbant set exhibits good locality properties; that is, the change of a node status (dominating/dominated) affects only the status of nodes in the neighborhood. The notion of dominating and absorbant set can also be applied iteratively on the dominating and absorbant set itself, forming a hierarchy of dominating and absorbant sets. The effectiveness of our approach is confirmed and the locality of node status update is verified through simulation.

224 citations


Journal ArticleDOI
TL;DR: In this paper, the reliability of resources is taken into account using an incremental cost function proposed in this paper and the new algorithm is referred to as the reliable dynamic level scheduling algorithm.
Abstract: In a heterogeneous distributed computing system, machine and network failures are inevitable and can have an adverse effect on applications executing on the system. To reduce the effect of failures on an application executing on a failure-prone system, matching and scheduling algorithms which minimize not only the execution time but also the probability of failure of the application must be devised. However, because of the conflicting requirements, it is not possible to minimize both of the objectives at the same time. Thus, the goal of this paper is to develop matching and scheduling algorithms which account for both the execution time and the reliability of the application. This goal is achieved by modifying an existing matching and scheduling algorithm. The reliability of resources is taken into account using an incremental cost function proposed in this paper and the new algorithm is referred to as the reliable dynamic level scheduling algorithm. The incremental cost function can be defined based on one of the three cost functions developed here. These cost functions are unique in the sense that they are not restricted to tree-based networks and a specific matching and scheduling algorithm. The simulation results confirm that the proposed incremental cost function can be incorporated into matching and scheduling algorithms to produce schedules where the effect of failures of machines and network resources on the execution of the application is reduced and the execution time of the application is minimized as well.

195 citations


Journal ArticleDOI
TL;DR: This work proves that, for a given wireless network, there exists a new call arrival rate which can maximize the total utility of users while maintaining the required QoS and proposes an integrated pricing and call admission control scheme where the price is adjusted dynamically based on the current network conditions in order to alleviate the problem of congestion.
Abstract: Call admission control (CAC) plays a significant role in providing the desired quality of service (QoS) in cellular networks. We investigate the role of pricing as an additional dimension of the call admission control process in order to efficiently and effectively control the use of wireless network resources. First, we prove that, for a given wireless network, there exists a new call arrival rate which can maximize the total utility of users while maintaining the required QoS. Based on this result and observation, we propose an integrated pricing and call admission control scheme where the price is adjusted dynamically based on the current network conditions in order to alleviate the problem of congestion. Our proposed integrated approach implicitly implements a distributed user-based prioritization mechanism by providing negative incentives according to the current network conditions and therefore shaping the aggregate traffic in the network. We compare the performance of our approach in terms of congestion prevention, achievable total user utility, and obtained revenue, with the corresponding results of conventional systems where pricing is not taken into consideration in the call admission control process. These performance results verify the considerable improvement that can be achieved by the integration of pricing in the call admission control process in cellular networks.

169 citations


Journal ArticleDOI
TL;DR: A novel, rate-based, borrowing scheme for QoS provisioning in high-speed cellular networks carrying multimedia traffic that outperforms the best previously known schemes in terms of call dropping probability, call blocking probability, and bandwidth utilization.
Abstract: Now that cellular networks are being called upon to support real-time interactive multimedia traffic such as video teleconferencing, these networks must be able to provide their users with quality-of-service (QoS) guarantees. Although the QoS provisioning problem arises in wireline networks as well, mobility of hosts, scarcity of bandwidth, and channel fading make QoS provisioning a challenging task in wireless networks. It has been noticed that multimedia applications can tolerate and gracefully adapt to transient fluctuations in the QoS that they receive from the network. The management of such adaptive multimedia applications is becoming a new research area in wireless networks. As it turns out, the additional flexibility afforded by the ability of multimedia applications to tolerate and adapt to transient changes in the QoS parameters can be exploited by protocol designers to significantly improve the overall performance of wireless systems. The main contribution of this paper is to propose a novel, rate-based, borrowing scheme for QoS provisioning in high-speed cellular networks carrying multimedia traffic. Our scheme attempts to allocate the desired bandwidth to every multimedia connection originating in a cell or being handed off to the cell. The novelty of our scheme is that, in case of insufficient bandwidth, in order not to deny service to requesting connections (new or hand-off), bandwidth will be borrowed, on a temporary basis, from existing connections. Our borrowing scheme guarantees that no connection gives up more than its fair share of bandwidth, in the sense that the amount of bandwidth borrowed from a connection is proportional to its tolerance to bandwidth loss. Importantly, our scheme ensures that the borrowed bandwidth is promptly returned to the degraded connections. Extensive simulation results show that our rate-based QoS provisioning scheme outperforms the best previously known schemes in terms of call dropping probability, call blocking probability, and bandwidth utilization.

159 citations


Journal ArticleDOI
Jianxi Fan1
TL;DR: It is shown that the n-dimensional crossed cube is n-diagnosable under a major diagnosis model-the comparison diagnosis model proposed by Malek and Maeng and Malek (1981) if n/spl ges/4; and the polynomial algorithm presented by Sengupta and Dahbura (1992) may be used to diagnose it.
Abstract: Diagnosability of a multiprocessor system is one important study topic in the parallel processing area. As a hypercube variant, the crossed cube has many attractive properties. The diameter, wide diameter and fault diameter of it are all approximately half of those of the hypercube. The power that the crossed cube simulates trees and cycles is stronger than the hypercube. Because of these advantages of the crossed cube, it has attracted much attention from researchers. We show that the n-dimensional crossed cube is n-diagnosable under a major diagnosis model-the comparison diagnosis model proposed by Malek (1980) and Maeng and Malek (1981) if n/spl ges/4. According to this, the polynomial algorithm presented by Sengupta and Dahbura (1992) may be used to diagnose the n-dimensional crossed cube, provided that the number of the faulty nodes in the n-dimensional crossed cube does not exceed n. The conclusion of this paper also indicates that the diagnosability of the n-dimensional crossed cube is the same as that of the n-dimensional hypercube when n>5 and better than that of the n-dimensional hypercube when n=4.

146 citations


Journal ArticleDOI
Sunghyun Choi1, Kang G. Shin
TL;DR: This work designs and evaluates predictive and adaptive schemes for bandwidth reservation for the hand-offs of ongoing sessions and the admission control of new connections, and develops a method to estimate user mobility based on an aggregate history of hand-off observed in each cell.
Abstract: How to keep the probability of hand-off drops within a prespecified limit is a very important quality-of-service (QoS) issue in cellular networks because mobile users should be able to maintain ongoing sessions even during their hand-off from one cell to another. We design and evaluate predictive and adaptive schemes for bandwidth reservation for the hand-offs of ongoing sessions and the admission control of new connections. We first develop a method to estimate user mobility based on an aggregate history of hand-offs observed in each cell. This method is then used to probabilistically predict mobiles' directions and hand-off times in a cell. For each cell, the bandwidth to be reserved for hand-offs is calculated by estimating the total sum of tractional bandwidths of the expected hand-offs within a mobility-estimation time window. Three different admission-control schemes for new connection requests using this bandwidth reservation are proposed. We also consider variations that utilize the path/location information available from the car navigation system or global positioning system. Finally, we evaluate the performance of the proposed schemes extensively to show that they meet our design goal and outperform the static reservation scheme under various scenarios.

Journal ArticleDOI
TL;DR: It is proved that ERR is efficient, with a per-packet work complexity of O(1), and analytically derive the relative fairness bound of ERR, a popular metric used to measure fairness, which is derived on the start-up latency experienced by a new flow that arrives at an ERR scheduler.
Abstract: Parallel systems are increasingly being used in multiuser environments with the interconnection network shared by several users at the same time. Fairness is an intuitively desirable property in the allocation of bandwidth available on a link among traffic flows of different users that share the link. Strict fairness in traffic scheduling can improve the isolation between users, offer a more predictable performance and improve performance by eliminating some bottlenecks. This paper presents a simple, fair, efficient, and easily implementable scheduling discipline, called Elastic Round Robin (ERR), designed to satisfy the unique needs of wormhole switching, which is popular in interconnection networks of parallel systems. In spite of the constraints of wormhole switching imposed on the design, ERR is also suitable for use in Internet routers and has better fairness and performance characteristics than previously known scheduling algorithms of comparable efficiency, including Deficit Round Robin and Surplus Round Robin. In this paper, we prove that ERR is efficient, with a per-packet work complexity of O(1). We analytically derive the relative fairness bound of ERR, a popular metric used to measure fairness. We also derive the bound on the start-up latency experienced by a new flow that arrives at an ERR scheduler. Finally, this paper presents simulation results comparing the fairness and performance characteristics of ERR with other scheduling disciplines of comparable efficiency.

Journal ArticleDOI
TL;DR: An M/G/1 model is developed to analytically determine the delay incurred in handling various types of queries using the enhanced APTEEN (Adaptive Periodic Threshold-sensitive Energy Efficient sensor Network protocol) protocol.
Abstract: Wireless sensor networks are a new class of ad hoc networks that will find increasing deployment in coming years, as they enable reliable monitoring and analysis of unfamiliar and untested environments. The advances in technology have made it possible to have extremely small, low powered sensor devices equipped with programmable computing, multiple parameter sensing, and wireless communication capability. Because of their inherent limitations, the protocols designed for such sensor networks must efficiently use both limited bandwidth and battery energy. We develop an M/G/1 model to analytically determine the delay incurred in handling various types of queries using our enhanced APTEEN (Adaptive Periodic Threshold-sensitive Energy Efficient sensor Network protocol) protocol. Our protocol uses an enhanced TDMA schedule to efficiently incorporate query handling, with a queuing mechanism for heavy loads. It also provides the additional flexibility of querying the network through any node in the network. To verify our analytical results, we have simulated a temperature sensing application with a Poisson arrival rate for queries on the network simulator ns-2. As the simulation and analytical results match perfectly well, this can be said to be the first step towards analytically determining the delay characteristics of a wireless sensor network.

Journal ArticleDOI
TL;DR: This paper proposes a suitable addressing scheme for nodes, derive a formula for distance between nodes, and presents a very simple and elegant routing algorithm for hexagonal interconnection.
Abstract: Nodes in a hexagonal network are placed at the vertices of a regular triangular tessellation, so that each node has up to six neighbors. The network is proposed as an alternative interconnection network to a mesh connected computer (with nodes serving as processors) and is used also to model cellular networks where nodes are the base stations. In this paper, we propose a suitable addressing scheme for nodes (with two variants), derive a formula for distance between nodes, and present a very simple and elegant routing algorithm. This addressing scheme and corresponding routing algorithm for hexagonal interconnection are considerably simpler than previously proposed solutions. We then apply the addressing scheme for solving two problems in cellular networks. With the new scheme, the distance between the new and old cell to which a mobile phone user is connected can be easily determined and coded with three integers, one of them being zero. Further, in order to minimize the wireless cost of tracking mobile users, we propose hexagonal cell identification codes containing three, four, or six bits, respectively, to implement a distance based tracking strategy. These schemes do not have errors in determining cell distance in existing hexagonal based cellular networks. Another application is for connection rerouting in cellular networks during a path extension process.

Journal ArticleDOI
TL;DR: This work develops a specification methodology that documents and specifies a cache coherence protocol in eight tables: the states, events, actions, and transitions of the cache and memory controllers, and demonstrates the utility of the table-based specification methodology.
Abstract: We develop a specification methodology that documents and specifies a cache coherence protocol in eight tables: the states, events, actions, and transitions of the cache and memory controllers. We then use this methodology to specify a detailed, modern three-state broadcast snooping protocol with an unordered data network and an ordered address network that allows arbitrary skew. We also present a detailed specification of a new protocol called multicast snooping (Bilir et al., 1999) and, in doing so, we better illustrate the utility of the table-based specification methodology. Finally, we demonstrate a technique for verification of the multicast snooping protocol, through the sketch of a manual proof that the specification satisfies a sequentially consistent memory model.

Journal ArticleDOI
TL;DR: This work proposes a novel scheme for a MAC address assignment that exploits the exploitation of spatial address reuse and an encoded representation of the addresses in data packets, and develops a purely distributed algorithm that relies solely on local message exchanges.
Abstract: Sensor networks consist of autonomous wireless sensor nodes that are networked together in an ad hoc fashion. The tiny nodes are equipped with substantial processing capabilities, enabling them to combine and compress their sensor data. The aim is to limit the amount of network traffic, and as such conserve the nodes' limited battery energy. However, due to the small packet payload, the MAC header is a significant, and energy-costly, overhead. To remedy this, we propose a novel scheme for a MAC address assignment. The two key features which make our approach unique are the exploitation of spatial address reuse and an encoded representation of the addresses in data packets. To assign the addresses, we develop a purely distributed algorithm that relies solely on local message exchanges. Other salient features of our approach are the ability to handle unidirectional links and the excellent scalability of both the assignment algorithm and address representation. In typical scenarios, the MAC overhead is reduced by a factor of three compared to existing approaches.

Journal ArticleDOI
TL;DR: Unified tools for obtaining the topological properties of an arbitrary OTIS network based on the properties of the corresponding factor network are presented.
Abstract: We conduct a general study of the topological properties of optical transpose interconnection systems (OTIS). We first obtain their basic topological metrics of size, degree, shortest distance and diameter, and then we obtain results related to the recursive structure and efficient embedding of meshes, cubes, spanning trees and cycles. We also present minimal one-to-one routing and optimal broadcasting algorithms, and we show how to construct node-disjoint paths between any two nodes of an OTIS network. Recent studies have addressed only particular members of the general class of OTIS networks. In this paper, we present unified tools for obtaining the topological properties of an arbitrary OTIS network based on the properties of the corresponding factor network.

Journal ArticleDOI
TL;DR: Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.
Abstract: The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen (1969) and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.

Journal ArticleDOI
TL;DR: This paper shows that list scheduling with statically-computed priorities (LSSP) can be performed at a significantly lower cost than existing approaches, without sacrificing performance, and can be applied to any LSSP algorithm.
Abstract: In compile-time task scheduling for distributed-memory systems, list scheduling is generally accepted as an attractive approach, since it pairs low cost with good results. List-scheduling algorithms schedule tasks in order of their priority. This priority can be computed either (1) statically, before the scheduling, or (2) dynamically, during the scheduling. In this paper, we show that list scheduling with statically-computed priorities (LSSP) can be performed at a significantly lower cost than existing approaches, without sacrificing performance. Our approach is general, i.e. it can be applied to any LSSP algorithm. The low complexity is achieved by using low-complexity methods for the most time-consuming parts in list-scheduling algorithms, i.e. processor selection and task selection, preserving the criteria used in the original algorithms. We exemplify our method by applying it to the MCP (Modified Critical Path) algorithm. Using an extension of this method, we can also reduce the time complexity of a particular class of list scheduling with dynamic priorities (LSDP) [including algorithms such as DLS (Dynamic Level Scheduling), ETF (Earliest Task First) and ERT (Earliest Ready Task)]. Our results confirm that the modified versions of the list-scheduling algorithms obtain a performance comparable to their original versions, yet at a significantly lower cost. We also show that the modified versions of the list-scheduling algorithms consistently outperform multi-step algorithms, such as DSC-LLB (Dynamic Sequence Clustering with List Load Balancing), which also have higher complexity and clearly outperform algorithms in the same class of complexity, such as CPM (Critical Path Method).

Journal ArticleDOI
TL;DR: A novel communication-based load-balancing algorithm that works by associating a credit value with each agent, in view of the special agent characteristics, is proposed, implemented, and evaluated.
Abstract: Multiagent computing on a cluster of workstations is widely envisioned to be a powerful paradigm for building useful distributed applications. The agents of the system span across all the machines of a cluster. Just like the case of traditional distributed systems, load balancing becomes an area of concern. With different characteristics between ordinary processes and agents, it is both interesting and useful to investigate whether conventional load-balancing strategies are also applicable and sufficient to cope with the newly emerging needs, such as coping with temporally continuous agents, devising a performance metric for multiagent systems, and taking into account the vast amount of communication and interaction among agent. This paper discusses the above issues with reference to agent properties and load balancing techniques and outlines the space of load-balancing design choices in the arena of multiagent computing. In view of the special agent characteristics, a novel communication-based load-balancing algorithm is proposed, implemented, and evaluated. The proposed algorithm works by associating a credit value with each agent. The credit of an agent depends on its affinity to a machine, its current workload, its communication behavior, and mobility, etc. When a load imbalance occurs, the credits of all agents are examined and an agent with a lower credit value is migrated to relatively lightly loaded machine in the system. Quasi-simulated experiments of this algorithm show load-balancing improvement compared with conventional workload-oriented load-balancing schemes.

Journal ArticleDOI
TL;DR: The new RAID-x design is experimentally compared with the RAID-5, RAID-10, and chained-declustering RAID through benchmarking on a research Linux cluster at USC, and the strength of RAID-X is found in three areas: improved aggregate I/O bandwidth especially for parallel writes, orthogonal mirroring with low software overhead, and enhanced scalability in cluster I-O processing.
Abstract: This paper presents a new distributed disk-array architecture for achieving high I/O performance in scalable cluster computing In a serverless cluster of computers, all distributed local disks can be integrated as a distributed-software redundant array of independent disks (ds-RAID) with a single I/O space We report the new RAID-x design and its benchmark performance results The advantage of RAID-x comes mainly from its orthogonal striping and mirroring (OSM) architecture The bandwidth is enhanced with distributed striping across local and remote disks, while the reliability comes from orthogonal mirroring on local disks at the background Our RAID-x design is experimentally compared with the RAID-5, RAID-10, and chained-declustering RAID through benchmarking on a research Linux cluster at USC Andrew and Bonnie benchmark results are reported on all four disk-array architectures Cooperative disk drivers and Linux extensions are developed to enable not only the single I/O space, but also the shared virtual memory and global file hierarchy We reveal the effects of traffic rate and stripe unit size on I/O performance Through scalability and overhead analysis, we find the strength of RAID-x in three areas: 1) improved aggregate I/O bandwidth especially for parallel writes, 2) orthogonal mirroring with low software overhead, and 3) enhanced scalability in cluster I/O processing Architectural strengths and weakness of all four ds-RAID architectures are evaluated comparatively The optimal choice among them depends on parallel read/write performance desired, the level of fault tolerance required, and the cost-effectiveness in specific I/O processing applications

Journal ArticleDOI
TL;DR: This paper considers a technique for composing global (barrier-style) and local (channel scanning) synchronization protocols within a single parallel discrete-event simulation and demonstrates an implementation which finds an optimal solution at runtime and considers its performance on network topologies.
Abstract: This paper considers a technique for composing global (barrier-style) and local (channel scanning) synchronization protocols within a single parallel discrete-event simulation. Composition is attractive because it allows one to tailor the synchronization mechanism to the model being simulated. We first motivate the problem by showing the large performance gap that can be introduced by a mismatch of model and synchronization method. Our solution calls for each channel between submodels to be classified as synchronous or asynchronous. We mathematically formulate the problem of optimally classifying channels and show that, in principle, the optimal classification can be obtained in time proportional to max{C/spl times/log C, V/spl times/N}, where C is the number of channels, V the number of unique minimal delays on those channels, and N is the number of submodels. We then demonstrate an implementation which finds an optimal solution at runtime and consider its performance on network topologies, including one of the global Internet at the autonomous system level. We find that the automated method effectively determines channel assignments that maximize performance.

Journal ArticleDOI
TL;DR: A novel optional and responsive fine-grain locking scheme is proposed for consistency maintenance in Internet-based collaborative editors that is made optional in the sense that a user may update any part of the document without necessarily requesting a lock.
Abstract: Locking is a standard technique used in distributed computing and database systems to ensure data integrity by prohibiting concurrent conflicting updates on shared data objects. Internet-based collaborative systems are a special class of distributed applications which support human-to-human interaction and collaboration over the Internet. In this paper, a novel optional and responsive fine-grain locking scheme is proposed for consistency maintenance in Internet-based collaborative editors. In the proposed scheme, locking is made optional in the sense that a user may update any part of the document without necessarily requesting a lock. In the face of high communication latency in the Internet environment, responsive locking is achieved by granting the permit to the user for updating the data region immediately after issuing a locking request. Moreover, multiple fine-grain locks can be placed on different regions inside a document to allow concurrent and mutually exclusive editing on the same document. Protocols and algorithms for locking conflict resolution and consistency maintenance are devised to address special technical issues involved in optional and responsive fine-grain locking. The proposed locking scheme and supporting techniques were implemented in an Internet-based collaborative editor to demonstrate its feasibility and usability.

Journal ArticleDOI
TL;DR: It is shown that the proposed policies can effectively improve overall job execution performance by well utilizing both CPU and memory resources with known and unknown memory demands.
Abstract: The cluster system we consider for load sharing is a compute farm which is a pool of networked server nodes providing high-performance computing for CPU-intensive, memory-intensive, and I/O active jobs in a batch mode. Existing resource management systems mainly target at balancing the usage of CPU loads among server nodes. With the rapid advancement of CPU chips, memory and disk access speed improvements significantly lag behind advancement of CPU speed, increasing the penalty for data movement, such as page faults and I/O operations, relative to normal CPU operations. Aiming at reducing the memory resource contention caused by page faults and I/O activities, we have developed and examined load sharing policies by considering effective usage of global memory in addition to CPU load balancing in clusters. We study two types of application workloads: 1) Memory demands are known in advance or are predictable and 2) memory demands are unknown and dynamically changed during execution. Besides using workload traces with known memory demands, we have also made kernel instrumentation to collect different types of workload execution traces to capture dynamic memory access patterns. Conducting different groups of trace-driven simulations, we show that our proposed policies can effectively improve overall job execution performance by well utilizing both CPU and memory resources with known and unknown memory demands.

Journal ArticleDOI
TL;DR: This paper examines and compares two novel input/output access pattern classification methods based on learning algorithms, and proposes a method for forming global classifications from local classifications for parallel file system performance.
Abstract: Input/output performance on current parallel file systems is sensitive to a good match of application access patterns to file system capabilities. Automatic input/output access pattern classification can determine application access patterns at execution time, guiding adaptive file system policies. In this paper, we examine and compare two novel input/output access pattern classification methods based on learning algorithms. The first approach uses a feedforward neural network previously trained on access pattern benchmarks to generate qualitative classifications. The second approach uses hidden Markov models trained on access patterns from previous executions to create a probabilistic model of input/output accesses. In a parallel application, access patterns can be recognized at the level of each local thread or as the global interleaving of all application threads. Classification of patterns at both levels is important for parallel file system performance; we propose a method for forming global classifications from local classifications. We present results from parallel and sequential benchmarks and applications that demonstrate the viability of this approach.

Journal ArticleDOI
TL;DR: This paper evaluates the silicon overhead of SMT by performing a transistor/interconnect-level analysis of the layout and shows how the Instruction Set Architecture (ISA) and microarchitecture can have a large effect on the SMT overhead and performance.
Abstract: Simultaneous Multi-Threading (SMT) is a hardware technique that increases processor throughput by issuing instructions simultaneously from multiple threads. However, while SMT can be added to an existing microarchitecture with relatively low overhead, this additional chip area could be used for other resources such as more functional units, larger caches, or better branch predictors. How large is the SMT overhead and at what point does SMT no longer pay off for maximum throughput compared to adding other architecture features? This paper evaluates the silicon overhead of SMT by performing a transistor/interconnect-level analysis of the layout. We discuss microarchitecture issues that impact SMT implementations and show how the Instruction Set Architecture (ISA) and microarchitecture can have a large effect on the SMT overhead and performance. Results show that SMT yields large performance gains with small to moderate area overhead.

Journal ArticleDOI
TL;DR: A general algorithm is developed that guarantees relaxed mutual exclusion for a single resource and proves necessary and sufficient conditions for the information structure and can be used to design more efficient distributed channel allocation algorithms.
Abstract: Distributed dynamic channel allocation (DDCA) is a fundamental resource management problem in mobile cellular networks. It has a flavor of distributed mutual exclusion but is not exactly a mutual exclusion problem. We establish the exact relationship between the two problems. Specifically, we introduce the problem of relaxed mutual exclusion to model one important aspect of the DDCA problem. We develop a general algorithm that guarantees relaxed mutual exclusion for a single resource and prove necessary and sufficient conditions for the information structure. Considering distributed dynamic channel allocation as a special case of relaxed mutual exclusion, we apply and extend the algorithm to further address the issues that arise in distributed channel allocation such as deadlock resolution, dealing with multiple channels, design of efficient information structures, and channel selection strategies. Based on these results, we propose an example distributed channel allocation scheme using one of the information structures proposed. Analysis and simulation results are provided and show that the results of this research can be used to design more efficient distributed channel allocation algorithms.

Journal ArticleDOI
TL;DR: The overall goal of this research is to formulate policies required to drive a dynamically adaptive metapartitioner for SAMR grid hierarchies capable of selecting the most appropriate partitioning strategy at runtime based on current application and system state.
Abstract: Structured adaptive mesh refinement (SAMR) methods for the numerical solution of partial differential equations yield highly advantageous ratios for cost/accuracy as compared to methods based on static uniform approximations. These techniques are being effectively used in many domains including computational fluid dynamics, numerical relativity, astrophysics, subsurface modeling, and oil reservoir simulation. Distributed implementations of these methods, however, lead to significant challenges in dynamic data-distribution, load-balancing, and runtime management. This paper presents an application-centric characterization of a suite of dynamic domain-based inverse space-filling curve partitioning techniques for the distributed adaptive grid hierarchies that underlie SAMR applications. The overall goal of this research is to formulate policies required to drive a dynamically adaptive metapartitioner for SAMR grid hierarchies capable of selecting the most appropriate partitioning strategy at runtime based on current application and system state. Such a metapartitioner can significantly reduce the execution time of SAMR applications.

Journal ArticleDOI
TL;DR: An approach to designing cellular automata-based multiprocessor scheduling algorithms in which extracting knowledge about the scheduling process occurs is presented, and a generic definition of program graph neighborhood is proposed, transparent to the various kinds, sizes, and shapes of program graphs.
Abstract: We present an approach to designing cellular automata-based multiprocessor scheduling algorithms in which extracting knowledge about the scheduling process occurs. We consider the simplest case when a multiprocessor system is limited to two-processors. To design cellular automata corresponding to a given program graph, we propose a generic definition of program graph neighborhood, transparent to the various kinds, sizes, and shapes of program graphs. The cellular automata-based scheduler works in two modes: learning mode and operation mode. Discovered rules are typically suitable for sequential cellular automata working as a scheduler, while the most interesting and promising feature of cellular automata are their massive parallelism. To overcome difficulties in evolving parallel cellular automata rules, we propose using coevolutionary genetic algorithm. Discovered this way, rules enable us to design effective parallel schedulers. We present a number of experimental results for both sequential and parallel scheduling algorithms discovered in the context of a cellular automata-based scheduling system.

Journal ArticleDOI
TL;DR: This study investigates rebuild algorithms for automatically rebuilding data stored in a failed disk into a spare disk, and a novel pipelined rebuild algorithm is proposed to take advantage of the sequential property of track retrievals to pipeline the reading and writing processes.
Abstract: Continuous-media (CM) servers have been around for some years. Apart from server capacity, another important issue in the deployment of CM servers is reliability. This study investigates rebuild algorithms for automatically rebuilding data stored in a failed disk into a spare disk. Specifically, a block-based rebuild algorithm is studied with the rebuild time and buffer requirement modeled. A buffer-sharing scheme is then proposed to eliminate the additional buffers needed by the rebuild process. To further improve rebuild performance, a track-based rebuild algorithm that rebuilds lost data in tracks is proposed and analyzed. Results show that track-based rebuild, while it substantially outperforms block-based rebuild, requires significantly more buffers (17-135 percent more) even with buffer sharing. To tackle this problem, a novel pipelined rebuild algorithm is proposed to take advantage of the sequential property of track retrievals to pipeline the reading and writing processes. This pipelined rebuild algorithm achieves the same rebuild performance as track-based rebuild, but reduces the extra buffer requirement to insignificant levels (0.7-1.9 percent). Numerical results computed using models of five commercial disk drives demonstrate that automatic rebuild of a failed disk can be done in a reasonable amount of time, even at relatively high server utilization (e.g., less than 1.5 hours at 90 percent utilization).