scispace - formally typeset
Search or ask a question
Author

Derrick Kondo

Other affiliations: Teradata, University of Paris-Sud, Stanford University  ...read more
Bio: Derrick Kondo is an academic researcher from French Institute for Research in Computer Science and Automation. The author has contributed to research in topics: Grid computing & Cloud computing. The author has an hindex of 29, co-authored 63 publications receiving 3454 citations. Previous affiliations of Derrick Kondo include Teradata & University of Paris-Sud.


Papers
More filters
Proceedings ArticleDOI
23 May 2009
TL;DR: This work compares and contrast the performance and monetary cost-benefits of clouds for desktop grid applications, ranging in computational size and storage and examines performance measurements and monetary expenses of real desktop grids and the Amazon elastic compute cloud.
Abstract: Cloud Computing has taken commercial computing by storm. However, adoption of cloud computing platforms and services by the scientific community is in its infancy as the performance and monetary cost-benefits for scientific applications are not perfectly clear. This is especially true for desktop grids (aka volunteer computing) applications. We compare and contrast the performance and monetary cost-benefits of clouds for desktop grid applications, ranging in computational size and storage. We address the following questions: (i) What are the performance tradeoffs in using one platform over the other? (ii) What are the specific resource requirements and monetary costs of creating and deploying applications on each platform? (iii) In light of those monetary and performance cost-benefits, how do these platforms compare? (iv) Can cloud computing platforms be used in combination with desktop grids to improve cost-effectiveness even further? We examine those questions using performance measurements and monetary expenses of real desktop grids and the Amazon elastic compute cloud.

383 citations

Proceedings ArticleDOI
05 Jul 2010
TL;DR: Based on the real price history of EC2 spot instances, this work compares several adaptive check pointing schemes in terms of monetary costs and improvement of job completion times and shows that its approach can reduce significantly both price and the task completion times.
Abstract: Recently introduced spot instances in the Amazon Elastic Compute Cloud (EC2) offer lower resource costs in exchange for reduced reliability; these instances can be revoked abruptly due to price and demand fluctuations. Mechanisms and tools that deal with the cost-reliability trade-offs under this schema are of great value for users seeking to lessen their costs while maintaining high reliability. We study how one such a mechanism, namely check pointing, can be used to minimize the cost and volatility of resource provisioning. Based on the real price history of EC2 spot instances, we compare several adaptive check pointing schemes in terms of monetary costs and improvement of job completion times. Trace-based simulations show that our approach can reduce significantly both price and the task completion times.

270 citations

Proceedings ArticleDOI
17 May 2010
TL;DR: The Failure Trace Archive is created as an online public repository of availability traces taken from diverse parallel and distributed systems to facilitate the design, validation, and comparison of fault-tolerant models and algorithms.
Abstract: With the increasing functionality and complexity of distributed systems, resource failures are inevitable. While numerous models and algorithms for dealing with failures exist, the lack of public trace data sets and tools has prevented meaningful comparisons. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA) as an online public repository of availability traces taken from diverse parallel and distributed systems. Our main contributions in this study are the following. First, we describe the design of the archive, in particular the rationale of the standard FTA format, and the design of a toolbox that facilitates automated analysis of trace data sets. Second, applying the toolbox, we present a uniform comparative analysis with statistics and models of failures in nine distributed systems. Third, we show how different interpretations of these data sets can result in different conclusions. This emphasizes the critical need for the public availability of trace data and methods for their analysis.

203 citations

Proceedings ArticleDOI
26 Apr 2004
TL;DR: This work utilizes measurements of an enterprise desktop grid with over 220 hosts running the Entropia commercial desktop grid software to characterize CPU availability and develops a performance model for desktop grid applications for various task granularities, showing that there is an optimal task size.
Abstract: Summary form only given. Desktop resources are attractive for running compute-intensive distributed applications. Several systems that aggregate these resources in desktop grids have been developed. While these systems have been successfully used for many high throughput applications there has been little insight into the detailed temporal structure of CPU availability of desktop grid resources. Yet, this structure is critical to characterize the utility of desktop grid platforms for both task parallel and even data parallel applications. We address the following questions: (i) What are the temporal characteristics of desktop CPU availability in an enterprise setting? (ii) How do these characteristics affect the utility of desktop grids? (iii) Based on these characteristics, can we construct a model of server "equivalents" for the desktop grids, which can be used to predict application performance? We present measurements of an enterprise desktop grid with over 220 hosts running the Entropia commercial desktop grid software. We utilize these measurements to characterize CPU availability and develop a performance model for desktop grid applications for various task granularities, showing that there is an optimal task size. We then use a cluster equivalence metric to quantify the utility of the desktop grid relative to that of a dedicated cluster.

202 citations

Proceedings ArticleDOI
17 Aug 2010
TL;DR: This work proposes a probabilistic model for the optimization of monetary costs, performance, and reliability, given user and application requirements and dynamic conditions and demonstrates how users should bid optimally on Spot Instances to reach different objectives with desired levels of confidence.
Abstract: With the recent introduction of Spot Instances in the Amazon Elastic Compute Cloud (EC2), users can bid for resources and thus control the balance of reliability versus monetary costs. A critical challenge is to determine bid prices that minimize monetary costs for a user while meeting Service Level Agreement (SLA) constraints (for example, sufficient resource availability to complete a computation within a desired deadline). We propose a probabilistic model for the optimization of monetary costs, performance, and reliability, given user and application requirements and dynamic conditions. Using real instance price traces and workload models, we evaluate our model and demonstrate how users should bid optimally on Spot Instances to reach different objectives with desired levels of confidence.

194 citations


Cited by
More filters
Proceedings Article
01 Jan 2003

1,212 citations

Proceedings ArticleDOI
17 Apr 2015
TL;DR: A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented.
Abstract: Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

1,185 citations

Proceedings ArticleDOI
14 Oct 2012
TL;DR: Analysis of the first publicly available trace data from a sizable multi-purpose cluster finds that many longer-running jobs have relatively stable resource utilizations, which can help adaptive resource schedulers.
Abstract: To better understand the challenges in developing effective cloud-based resource schedulers, we analyze the first publicly available trace data from a sizable multi-purpose cluster. The most notable workload characteristic is heterogeneity: in resource types (e.g., cores:RAM per machine) and their usage (e.g., duration and resources needed). Such heterogeneity reduces the effectiveness of traditional slot- and core-based scheduling. Furthermore, some tasks are constrained as to the kind of machine types they can use, increasing the complexity of resource assignment and complicating task migration. The workload is also highly dynamic, varying over time and most workload features, and is driven by many short jobs that demand quick scheduling decisions. While few simplifying assumptions apply, we find that many longer-running jobs have relatively stable resource utilizations, which can help adaptive resource schedulers.

1,051 citations

01 Jan 2011
TL;DR: It is shown thatEnergy consumption in transport and switching can be a significant percentage of total energy consumption in cloud computing, and considers both public and private clouds, and includes energy consumption of the transmission and switching networks.
Abstract: Network-based cloud computing is rapidly expanding as an alternative to conventional office-based computing. As cloud computing becomes more widespread, the energy consumption of the network and computing resources that underpin the cloud will grow. This is happening at a time when there is increasing attention being paid to the need to manage energy consumption across the entire information and communications technology (ICT) sector. While data center energy use has received much attention recently, there has been less attention paid to the energy consumption of the transmission and switching networks that are key to connecting users to the cloud. In this paper, we present an analysis of energy consumption in cloud computing. The analysis considers both public and private clouds, and includes energy consumption in switching and transmission as well as data processing and data storage. We show that energy consumption in transport and switching can be a significant percentage of total energy consumption in cloud computing. Cloud computing can enable more energy-efficient use of computing power, especially when the computing tasks are of low intensity or infrequent. However, under some circum- stances cloud computing can consume more energy than conventional computing where each user performs all com- puting on their own personal computer (PC).

748 citations