Workload and Failure Characterization on a Large-Scale Federated Testbed

Open Access

Workload and Failure Characterization on a Large-Scale Federated Testbed

Chats0

TLDR

A detailed characterization of the actual use of the PlanetLab network testbed is presented, using a variety of measurement tools, on the network, CPU, memory and disk usage of individual PlanetLab nodes and sites over a three-month period.

Abstract:

Recently, a number of federated distributed computational and communication infrastructures have emerged, including the Grid, PlanetLab, and Content Distribution Networks. In these environments, mutually distrustful autonomous domains pool resources together for their mutual benefit, for instance to gain access to: unique computational resources, multiple vantage points on the network, or more computation than available locally. Key challenges for such federated infrastructures include resource allocation, scheduling, and constructing highly available services in the face of faulty end hosts and unpredictable network behavior. Developing such appropriate mechanisms and policies requires an understanding of the usage characteristics and operating environment of the target environment. In this paper, we present a detailed characterization of the actual use of the PlanetLab network testbed. PlanetLab consists of 240 nodes spread across 100 autonomous domains with over 500 active users. Using a variety of measurement tools, we present a three-month study on the network, CPU, memory and disk usage of individual PlanetLab nodes and sites. On the consumer side, we further characterize the consumption of individual users. Next, we present results on the availability and reliability of system nodes and the network interconnecting them. Finally, we discuss the implications of our measurements for emerging federated environments.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Exploring event correlation for failure prediction in coalitions of clusters

Song Fu, +1 more

TL;DR: A spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation is developed to cluster failure events based on their correlations and predict their future occurrences.

...read moreread less

Proceedings ArticleDOI

Mirage: a microeconomic resource allocation system for sensornet testbeds

Brent N. Chun, +7 more

TL;DR: It is argued that a microeconomic resource allocation scheme, specifically the combinatorial auction, is well suited to testbed resource management and to demonstrate this, the Mirage resource allocation system is presented.

...read moreread less

Proceedings Article

Subtleties in tolerating correlated failures in wide-area storage systems

Suman Nath, +3 more

TL;DR: This paper systematically revisits previously proposed techniques for addressing correlated failures and identifies a set of design principles that system builders can use to tolerate correlated failures.

...read moreread less

Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems

Praveen Yalagandula, +1 more

TL;DR: This paper analyzes traces from three large distributed systems to answer several subtle questions regarding machine failure characteristics and derives a set of fundamental principles for designing highly available distributed systems.

...read moreread less

Proceedings ArticleDOI

Multi-state grid resource availability characterization

Brent Rood, +1 more

TL;DR: This paper introduces five availability states, and characterizes a Condor pool trace that uncovers when, how, and why its resources reside in, and transition between, these states, which suggests resource categories that schedulers can use to make better mapping decisions.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Chord: A scalable peer-to-peer lookup service for internet applications

Ion Stoica, +4 more

TL;DR: Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.

...read moreread less

Journal ArticleDOI

Xen and the art of virtualization

Paul Barham, +8 more

TL;DR: Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sacrificing either performance or functionality, considerably outperform competing commercial and freely available solutions.

...read moreread less

Journal ArticleDOI

Free riding on Gnutella

Eytan Adar, +2 more

- 02 Oct 2000 -

First Monday

TL;DR: It is argued that free riding leads to degradation of the system performance and adds vulnerability to the system, and copyright issues might become moot compared to the possible collapse of such systems.

...read moreread less

Journal ArticleDOI

An integrated experimental environment for distributed systems and networks

Brian S. White, +8 more

TL;DR: The overall design and implementation of Netbed is presented and its ability to improve experimental automation and efficiency is demonstrated, leading to new methods of experimentation, including automated parameter-space studies within emulation and straightforward comparisons of simulated, emulated, and wide-area scenarios.

...read moreread less

Journal ArticleDOI

Measurement, modeling, and analysis of a peer-to-peer file-sharing workload

Krishna P. Gummadi, +5 more

TL;DR: Unlike the Web, whose workload is driven by document change, it is demonstrated that clients' fetch-at-most-once behavior, the creation of new objects, and the addition of new clients to the system are the primary forces that drive multimedia workloads such as Kazaa.

...read moreread less

Collapse

Workload and Failure Characterization on a Large-Scale Federated Testbed

Citations

Exploring event correlation for failure prediction in coalitions of clusters

Mirage: a microeconomic resource allocation system for sensornet testbeds

Subtleties in tolerating correlated failures in wide-area storage systems

Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems

Multi-state grid resource availability characterization

References

Chord: A scalable peer-to-peer lookup service for internet applications

Xen and the art of virtualization

Free riding on Gnutella

An integrated experimental environment for distributed systems and networks

Measurement, modeling, and analysis of a peer-to-peer file-sharing workload

Related Papers (5)

Failure data analysis of a large-scale heterogeneous server environment

A large-scale study of failures in high-performance computing systems

Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs

SHARP: an architecture for secure resource peering

Modeling machine availability in enterprise and wide-area distributed computing environments