Showing papers on "Scalability published in 2010"

PDF

Open Access

Proceedings Article•

Spark: cluster computing with working sets

[...]

Matei Zaharia¹, Mosharaf Chowdhury¹, Michael J. Franklin¹, Scott Shenker¹, Ion Stoica¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

22 Jun 2010

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

Abstract: MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

4,959 citations

Journal Article•DOI•

Cassandra: a decentralized structured storage system

[...]

Avinash Lakshman¹, Prashant Malik¹•Institutions (1)

Facebook¹

14 Apr 2010-Operating Systems Review

TL;DR: Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.

...read moreread less

Abstract: Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.

...read moreread less

2,870 citations

Journal Article•DOI•

CD-HIT Suite

[...]

Ying Huang¹, Beifang Niu¹, Ying Gao¹, Limin Fu¹, Weizhong Li¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

01 Mar 2010-Bioinformatics

TL;DR: A new web server, CD-HIT Suite, is developed for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels and users can now interactively explore the clusters within web browsers.

...read moreread less

Abstract: Summary: CD-HIT is a widely used program for clustering and comparing large biological sequence datasets. In order to further assist the CD-HIT users, we significantly improved this program with more functions and better accuracy, scalability and flexibility. Most importantly, we developed a new web server, CD-HIT Suite, for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels. Users can now interactively explore the clusters within web browsers. We also provide downloadable clusters for several public databases (NCBI NR, Swissprot and PDB) at different identity levels. Availability: Free access at http://cd-hit.org Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

2,084 citations

Proceedings Article•DOI•

Onix: a distributed control platform for large-scale production networks

[...]

Teemu Koponen, Martin Casado, Natasha Gude, Jeremy Stribling, Leon Poutievski¹, Min Zhu¹, Rajiv Ramanathan¹, Yuichiro Iwata², Hiroaki Inoue², Takayuki Hama², Scott Shenker³ - Show less +7 more•Institutions (3)

Google¹, NEC², International Computer Science Institute³

04 Oct 2010

TL;DR: Onix provides a general API for control plane implementations, while allowing them to make their own trade-offs among consistency, durability, and scalability.

...read moreread less

Abstract: Computer networks lack a general control paradigm, as traditional networks do not provide any network-wide management abstractions. As a result, each new function (such as routing) must provide its own state distribution, element discovery, and failure recovery mechanisms. We believe this lack of a common control platform has significantly hindered the development of flexible, reliable and feature-rich network control planes.To address this, we present Onix, a platform on top of which a network control plane can be implemented as a distributed system. Control planes written within Onix operate on a global view of the network, and use basic state distribution primitives provided by the platform. Thus Onix provides a general API for control plane implementations, while allowing them to make their own trade-offs among consistency, durability, and scalability.

...read moreread less

1,463 citations

Book Chapter•DOI•

The ns-3 Network Simulator

[...]

George F. Riley¹, Thomas R. Henderson²•Institutions (2)

Georgia Institute of Technology¹, University of Washington²

01 Jan 2010

TL;DR: With simulation based studies, the approach can be studied in detail at varying scales, with varying data applications, varying field conditions, and will result in reproducible and analyzable results.

...read moreread less

Abstract: As networks of computing devices grow larger and more complex, the need for highly accurate and scalable network simulation technologies becomes critical. Despite the emergence of large-scale testbeds for network research, simulation still plays a vital role in terms of scalability (both in size and in experimental speed), reproducibility, rapid prototyping, and education. With simulation based studies, the approach can be studied in detail at varying scales, with varying data applications, varying field conditions, and will result in reproducible and analyzable results.

...read moreread less

1,462 citations

Journal Article•DOI•

Wireless sensor networks for healthcare: A survey

[...]

Hande Alemdar¹, Cem Ersoy¹•Institutions (1)

Boğaziçi University¹

01 Oct 2010-Computer Networks

TL;DR: This paper provides several state of the art examples together with the design considerations like unobtrusiveness, scalability, energy efficiency, security and also provides a comprehensive analysis of the benefits and challenges of these systems.

...read moreread less

1,331 citations

Journal Article•DOI•

Diffusion LMS Strategies for Distributed Estimation

[...]

Federico S. Cattivelli¹, Ali H. Sayed¹•Institutions (1)

University of California, Los Angeles¹

01 Mar 2010-IEEE Transactions on Signal Processing

TL;DR: This work motivates and proposes new versions of the diffusion LMS algorithm that outperform previous solutions, and provides performance and convergence analysis of the proposed algorithms, together with simulation results comparing with existing techniques.

...read moreread less

Abstract: We consider the problem of distributed estimation, where a set of nodes is required to collectively estimate some parameter of interest from noisy measurements. The problem is useful in several contexts including wireless and sensor networks, where scalability, robustness, and low power consumption are desirable features. Diffusion cooperation schemes have been shown to provide good performance, robustness to node and link failure, and are amenable to distributed implementations. In this work we focus on diffusion-based adaptive solutions of the LMS type. We motivate and propose new versions of the diffusion LMS algorithm that outperform previous solutions. We provide performance and convergence analysis of the proposed algorithms, together with simulation results comparing with existing techniques. We also discuss optimization schemes to design the diffusion LMS weights.

...read moreread less

1,116 citations

Proceedings Article•DOI•

Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement

[...]

Xiaoqiao Meng¹, Vasileios Pappas¹, Li Zhang¹•Institutions (1)

IBM¹

14 Mar 2010

TL;DR: This paper designs a two-tier approximate algorithm that efficiently solves the VM placement problem for very large problem sizes and shows a significant performance improvement compared to existing general methods that do not take advantage of traffic patterns and data center network characteristics.

...read moreread less

Abstract: The scalability of modern data centers has become a practical concern and has attracted significant attention in recent years. In contrast to existing solutions that require changes in the network architecture and the routing protocols, this paper proposes using traffic-aware virtual machine (VM) placement to improve the network scalability. By optimizing the placement of VMs on host machines, traffic patterns among VMs can be better aligned with the communication distance between them, e.g. VMs with large mutual bandwidth usage are assigned to host machines in close proximity. We formulate the VM placement as an optimization problem and prove its hardness. We design a two-tier approximate algorithm that efficiently solves the VM placement problem for very large problem sizes. Given the significant difference in the traffic patterns seen in current data centers and the structural differences of the recently proposed data center architectures, we further conduct a comparative analysis on the impact of the traffic patterns and the network architectures on the potential performance gain of traffic-aware VM placement. We use traffic traces collected from production data centers to evaluate our proposed VM placement algorithm, and we show a significant performance improvement compared to existing general methods that do not take advantage of traffic patterns and data center network characteristics.

...read moreread less

1,078 citations

Proceedings Article•

HyperFlow: a distributed control plane for OpenFlow

[...]

Amin Tootoonchian¹, Yashar Ganjali¹•Institutions (1)

University of Toronto¹

27 Apr 2010

TL;DR: HyperFlow is logically centralized but physically distributed: it provides scalability while keeping the benefits of network control centralization, and enables interconnecting independently managed OpenFlow networks, an essential feature missing in current OpenFlow deployments.

...read moreread less

Abstract: OpenFlow assumes a logically centralized controller, which ideally can be physically distributed. However, current deployments rely on a single controller which has major drawbacks including lack of scalability. We present HyperFlow, a distributed event-based control plane for OpenFlow. HyperFlow is logically centralized but physically distributed: it provides scalability while keeping the benefits of network control centralization. By passively synchronizing network-wide views of OpenFlow controllers, HyperFlow localizes decision making to individual controllers, thus minimizing the control plane response time to data plane requests. HyperFlow is resilient to network partitioning and component failures. It also enables interconnecting independently managed OpenFlow networks, an essential feature missing in current OpenFlow deployments. We have implemented HyperFlow as an application for NOX. Our implementation requires minimal changes to NOX, and allows reuse of existing NOX applications with minor modifications. Our preliminary evaluation shows that, assuming sufficient control bandwidth, to bound the window of inconsistency among controllers by a factor of the delay between the farthest controllers, the network changes must occur at a rate lower than 1000 events per second across the network.

...read moreread less

974 citations

Proceedings Article•DOI•

S4: Distributed Stream Computing Platform

[...]

Leonardo Neumeyer¹, Bruce Robbins¹, Anish Nair¹, Anand Kesari¹•Institutions (1)

Yahoo!¹

13 Dec 2010

TL;DR: The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers.

...read moreread less

Abstract: S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our design is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.

...read moreread less

972 citations

Patent•

Performing data storage operations with a cloud storage environment, including automatically selecting among multiple cloud storage sites

[...]

Anand Prahlad, Marcus S. Muller, Rajiv Kottomtharayil, Srinivas Kavuri, Parag Gokhale - Show less +1 more

31 Mar 2010

TL;DR: In this article, a variety of data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment are described.

...read moreread less

Abstract: Systems and methods are disclosed for performing data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment The systems support a variety of clients and cloud storage sites that may connect to the system in a cloud environment that requires data transfer over wide area networks, such as the Internet, which may have appreciable latency and/or packet loss, using various network protocols, including HTTP and FTP Methods are disclosed for content indexing data stored within a cloud environment to facilitate later searching, including collaborative searching Methods are also disclosed for performing containerized deduplication to reduce the strain on a system namespace, effectuate cost savings, etc Methods are disclosed for identifying suitable storage locations, including suitable cloud storage sites, for data files subject to a storage policy Further, systems and methods for providing a cloud gateway and a scalable data object store within a cloud environment are disclosed, along with other features

...read moreread less

Proceedings Article•DOI•

Scalable flow-based networking with DIFANE

[...]

Minlan Yu¹, Jennifer Rexford¹, Michael J. Freedman¹, Jia Wang²•Institutions (2)

Princeton University¹, AT&T Labs²

30 Aug 2010

TL;DR: DIFANE is proposed, a scalable and efficient solution that keeps all traffic in the data plane by selectively directing packets through intermediate switches that store the necessary rules.

...read moreread less

Abstract: Ideally, enterprise administrators could specify fine-grain policies that drive how the underlying switches forward, drop, and measure traffic. However, existing techniques for flow-based networking rely too heavily on centralized controller software that installs rules reactively, based on the first packet of each flow. In this paper, we propose DIFANE, a scalable and efficient solution that keeps all traffic in the data plane by selectively directing packets through intermediate switches that store the necessary rules. DIFANE relegates the controller to the simpler task of partitioning these rules over the switches. DIFANE can be readily implemented with commodity switch hardware, since all data-plane functions can be expressed in terms of wildcard rules that perform simple actions on matching packets. Experiments with our prototype on Click-based OpenFlow switches show that DIFANE scales to larger networks with richer policies.

...read moreread less

Journal Article•DOI•

Dremel: interactive analysis of web-scale datasets

[...]

Sergey Melnik¹, Andrey Gubarev¹, Jing Jing Long¹, Geoffrey M. Romer¹, Shiva Shivakumar¹, Matthew B. Tolton¹, Theodore Vassilakis¹ - Show less +3 more•Institutions (1)

Google¹

01 Sep 2010

TL;DR: The architecture and implementation of Dremel are described, and how it complements MapReduce-based computing is explained, and a novel columnar storage representation for nested records is presented.

...read moreread less

Abstract: Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

...read moreread less

Proceedings Article•DOI•

Hierarchical attribute-based encryption for fine-grained access control in cloud storage services

[...]

Guojun Wang¹, Qin Liu¹, Jie Wu²•Institutions (2)

Central South University¹, Temple University²

04 Oct 2010

TL;DR: This paper proposes a scheme to help enterprises to efficiently share confidential data on cloud servers by first combining the HIBE system and the ciphertext-policy attribute-based encryption (CP-ABE) system, and then making a performance-expressivity tradeoff.

...read moreread less

Abstract: Cloud computing, as an emerging computing paradigm, enables users to remotely store their data into a cloud so as to enjoy scalable services on-demand. Especially for small and medium-sized enterprises with limited budgets, they can achieve cost savings and productivity enhancements by using cloud-based services to manage projects, to make collaborations, and the like. However, allowing cloud service providers (CSPs), which are not in the same trusted domains as enterprise users, to take care of confidential data, may raise potential security and privacy issues. To keep the sensitive user data confidential against untrusted CSPs, a natural way is to apply cryptographic approaches, by disclosing decryption keys only to authorized users. However, when enterprise users outsource confidential data for sharing on cloud servers, the adopted encryption system should not only support fine-grained access control, but also provide high performance, full delegation, and scalability, so as to best serve the needs of accessing data anytime and anywhere, delegating within enterprises, and achieving a dynamic set of users. In this paper, we propose a scheme to help enterprises to efficiently share confidential data on cloud servers. We achieve this goal by first combining the hierarchical identity-based encryption (HIBE) system and the ciphertext-policy attribute-based encryption (CP-ABE) system, and then making a performance-expressivity tradeoff, finally applying proxy re-encryption and lazy re-encryption to our scheme.

...read moreread less

Journal Article•DOI•

Schism: a workload-driven approach to database replication and partitioning

[...]

Carlo Curino¹, Evan P. C. Jones¹, Yang Zhang¹, Samuel Madden¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Sep 2010

TL;DR: Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.

...read moreread less

Abstract: We present Schism, a novel workload-aware approach for database partitioning and replication designed to improve scalability of shared-nothing distributed databases. Because distributed transactions are expensive in OLTP settings (a fact we demonstrate through a series of experiments), our partitioner attempts to minimize the number of distributed transactions, while producing balanced partitions. Schism consists of two phases: i) a workload-driven, graph-based replication/partitioning phase and ii) an explanation and validation phase. The first phase creates a graph with a node per tuple (or group of tuples) and edges between nodes accessed by the same transaction, and then uses a graph partitioner to split the graph into k balanced partitions that minimize the number of cross-partition transactions. The second phase exploits machine learning techniques to find a predicate-based explanation of the partitioning strategy (i.e., a set of range predicates that represent the same replication/partitioning scheme produced by the partitioner).The strengths of Schism are: i) independence from the schema layout, ii) effectiveness on n-to-n relations, typical in social network databases, iii) a unified and fine-grained approach to replication and partitioning. We implemented and tested a prototype of Schism on a wide spectrum of test cases, ranging from classical OLTP workloads (e.g., TPC-C and TPC-E), to more complex scenarios derived from social network websites (e.g., Epinions.com), whose schema contains multiple n-to-n relationships, which are known to be hard to partition. Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.

...read moreread less

Journal Article•DOI•

The case for RAMClouds: scalable high-performance storage entirely in DRAM

[...]

John Ousterhout¹, Parag Agrawal¹, David Erickson¹, Christos Kozyrakis¹, Jacob Leverich¹, David Mazières¹, Subhasish Mitra¹, Aravind Narayanan¹, Guru Parulkar¹, Mendel Rosenblum¹, Stephen M. Rumble¹, Eric Stratmann¹, Ryan Stutsman¹ - Show less +9 more•Institutions (1)

Stanford University¹

27 Jan 2010-Operating Systems Review

TL;DR: This paper argues for a new approach to datacenter storage called RAMCloud, where information is kept entirely in DRAM and large-scale systems are created by aggregating the main memories of thousands of commodity servers.

...read moreread less

Abstract: Disk-oriented approaches to online storage are becoming increasingly problematic: they do not scale gracefully to meet the needs of large-scale Web applications, and improvements in disk capacity have far outstripped improvements in access latency and bandwidth. This paper argues for a new approach to datacenter storage called RAMCloud, where information is kept entirely in DRAM and large-scale systems are created by aggregating the main memories of thousands of commodity servers. We believe that RAMClouds can provide durable and available storage with 100-1000x the throughput of disk-based systems and 100-1000x lower access latency. The combination of low latency and large scale will enable a new breed of dataintensive applications.

...read moreread less

Proceedings Article•DOI•

Multi-Objective Virtual Machine Placement in Virtualized Data Center Environments

[...]

Jing Xu¹, José A. B. Fortes¹•Institutions (1)

University of Florida¹

18 Dec 2010

TL;DR: A two-level control system to manage the mappings of workloads to VMs and VMs to physical resources and an improved genetic algorithm with fuzzy multi-objective evaluation is proposed for efficiently searching the large solution space and conveniently combining possibly conflicting objectives.

...read moreread less

Abstract: Server consolidation using virtualization technology has become increasingly important for improving data center efficiency It enables one physical server to host multiple independent virtual machines (VMs), and the transparent movement of workloads from one server to another Fine-grained virtual machine resource allocation and reallocation are possible in order to meet the performance targets of applications running on virtual machines On the other hand, these capabilities create demands on system management, especially for large-scale data centers In this paper, a two-level control system is proposed to manage the mappings of workloads to VMs and VMs to physical resources The focus is on the VM placement problem which is posed as a multi-objective optimization problem of simultaneously minimizing total resource wastage, power consumption and thermal dissipation costs An improved genetic algorithm with fuzzy multi-objective evaluation is proposed for efficiently searching the large solution space and conveniently combining possibly conflicting objectives The simulation-based evaluation using power-consumption and thermal-dissipation models based on profiling of a Blade Center, demonstrates the good performance, scalability and robustness of our proposed approach Compared with four well-known bin-packing algorithms and two single-objective approaches, the solutions obtained from our approach seek good balance among the conflicting objectives while others cannot

...read moreread less

Proceedings Article•DOI•

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

[...]

Adam Moody¹, Greg Bronevetsky¹, Kathryn Mohror¹, Bronis R. de Supinski¹•Institutions (1)

Lawrence Livermore National Laboratory¹

13 Nov 2010

TL;DR: The Scalable Checkpoint/Restart (SCR) library is designed, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system that improves efficiency on existing large-scale systems and that this benefit increases as the system size grows.

...read moreread less

Abstract: High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with different costs and different levels of resiliency in a single run. This solution employs lightweight checkpoints to handle the most common failure modes and relies on more expensive checkpoints for less common, but more severe failures. This theoretically promising approach has not been fully evaluated in a large- scale, production system context. We have designed the Scalable Checkpoint/Restart (SCR) library, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.

...read moreread less

Proceedings Article•DOI•

A Comparative Study into Distributed Load Balancing Algorithms for Cloud Computing

[...]

Martin Randles¹, David Lamb¹, Azzelarabe Taleb-Bendiab¹•Institutions (1)

Liverpool John Moores University¹

20 Apr 2010

TL;DR: This paper investigates three possible distributed solutions proposed for load balancing; approaches inspired by Honeybee Foraging Behaviour, Biased Random Sampling and Active Clustering.

...read moreread less

Abstract: The anticipated uptake of Cloud computing, built on well-established research in Web Services, networks, utility computing, distributed computing and virtualisation, will bring many advantages in cost, flexibility and availability for service users. These benefits are expected to further drive the demand for Cloud services, increasing both the Cloud's customer base and the scale of Cloud installations. This has implications for many technical issues in Service Oriented Architectures and Internet of Services (IoS)-type applications; including fault tolerance, high availability and scalability. Central to these issues is the establishment of effective load balancing techniques. It is clear the scale and complexity of these systems makes centralized assignment of jobs to specific servers infeasible; requiring an effective distributed solution. This paper investigates three possible distributed solutions proposed for load balancing; approaches inspired by Honeybee Foraging Behaviour, Biased Random Sampling and Active Clustering.

...read moreread less

Book•

MongoDB: The Definitive Guide

[...]

Kristina Chodorow, Michael Dirolf

24 Sep 2010

TL;DR: This authoritative introduction to MongoDB will learn the many advantages of using document-oriented databases, and discover why MongoDB is a reliable, high-performance system that allows for almost infinite horizontal scalability.

...read moreread less

Abstract: How does MongoDB help you manage a huMONGOus amount of data collected through your web application? With this authoritative introduction, you'll learn the many advantages of using document-oriented databases, and discover why MongoDB is a reliable, high-performance system that allows for almost infinite horizontal scalability. Written by engineers from 10gen, the company that develops and supports this open source database, MongoDB: The Definitive Guide provides guidance for database developers, advanced configuration for system administrators, and an overview of the concepts and use cases for other people on your project. Learn how easy it is to handle data as self-contained JSON-style documents, rather than as records in a relational database. Explore ways that document-oriented storage will work for your project Learn how MongoDBs schema-free data model handles documents, collections, and multiple databases Execute basic write operations, and create complex queries to find data with any criteria Use indexes, aggregation tools, and other advanced query techniques Learn about monitoring, security and authentication, backup and repair, and more Set up master-slave and automatic failover replication in MongoDB Use sharding to scale MongoDB horizontally, and learn how it impacts applications Get example applications written in Java, PHP, Python, and Ruby

...read moreread less

Journal Article•DOI•

The performance of MapReduce: an in-depth study

[...]

Dawei Jiang¹, Beng Chin Ooi¹, Lei Shi¹, Sai Wu¹•Institutions (1)

National University of Singapore¹

01 Sep 2010

TL;DR: By carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5, and is thus more comparable to that of parallel database systems.

...read moreread less

Abstract: MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency.In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.

...read moreread less

Journal Article•DOI•

Consensus-Based Distributed Support Vector Machines

[...]

Pedro A. Forero¹, Alfonso Cano¹, Georgios B. Giannakis•Institutions (1)

University of Minnesota¹

01 Mar 2010-Journal of Machine Learning Research

TL;DR: Algorithm to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons is developed.

...read moreread less

Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization sub-problems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is fixed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.

...read moreread less

Proceedings Article•DOI•

An analysis of Linux scalability to many cores

[...]

Silas Boyd-Wickizer¹, Austin T. Clements¹, Yandong Mao¹, Aleksey Pesterev¹, M. Frans Kaashoek¹, Robert Morris¹, Nickolai Zeldovich¹ - Show less +3 more•Institutions (1)

Massachusetts Institute of Technology¹

04 Oct 2010

TL;DR: There is no scalability reason to give up on traditional operating system organizations just yet, according to this analysis of seven system applications running on Linux on a 48- core computer.

...read moreread less

Abstract: This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48- core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques-- this paper introduces one new technique, sloppy counters-- these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet.

...read moreread less

Proceedings Article•DOI•

Volley: automated data placement for geo-distributed cloud services

[...]

Sharad Agarwal¹, John Dunagan¹, Navendu Jain¹, Stefan Saroiu¹, Alec Wolman¹, Harbinder Bhogan² - Show less +2 more•Institutions (2)

Microsoft¹, University of Toronto²

28 Apr 2010

TL;DR: Volley is evaluated on the month-long Live Mesh trace, and it is found that, compared to a state-of-the-art heuristic, Volley simultaneously reduces datacenter capacity skew, reduces inter-datacenter traffic by over 1.8× and reduces 75th percentile user-latency by over 30%.

...read moreread less

Abstract: As cloud services grow to span more and more globally distributed datacenters, there is an increasingly urgent need for automated mechanisms to place application data across these datacenters. This placement must deal with business constraints such as WAN bandwidth costs and datacenter capacity limits, while also minimizing user-perceived latency. The task of placement is further complicated by the issues of shared data, data inter-dependencies, application changes and user mobility. We document these challenges by analyzing month-long traces from Microsoft's Live Messenger and Live Mesh, two large-scale commercial cloud services.We present Volley, a system that addresses these challenges. Cloud services make use of Volley by submitting logs of datacenter requests. Volley analyzes the logs using an iterative optimization algorithm based on data access patterns and client locations, and outputs migration recommendations back to the cloud service.To scale to the data volumes of cloud service logs, Volley is designed to work in SCOPE [5], a scalable MapReduce-style platform; this allows Volley to perform over 400 machine-hours worth of computation in less than a day. We evaluate Volley on the month-long Live Mesh trace, and we find that, compared to a state-of-the-art heuristic that places data closest to the primary IP address that accesses it, Volley simultaneously reduces datacenter capacity skew by over 2×, reduces inter-datacenter traffic by over 1.8× and reduces 75th percentile user-latency by over 30%.

...read moreread less

Proceedings Article•DOI•

DAvinCi: A cloud computing framework for service robots

[...]

Rajesh Vellore Arumugam¹, Vikas Reddy Enti¹, Liu Bingbing¹, Wu Xiaojun¹, Krishnamoorthy Baskaran¹, Foong Foo Kong¹, A. Senthil Kumar¹, Kang Dee Meng¹, Goh Wai Kit¹ - Show less +5 more•Institutions (1)

Data Storage Institute¹

03 May 2010

TL;DR: DAvinCi, a software framework that provides the scalability and parallelism advantages of cloud computing for service robots in large environments, is proposed and the possibilities of parallelizing some of the robotics algorithms as Map/Reduce tasks in Hadoop are explored.

...read moreread less

Abstract: We propose DAvinCi, a software framework that provides the scalability and parallelism advantages of cloud computing for service robots in large environments. We have implemented such a system around the Hadoop cluster with ROS (Robotic Operating system) as the messaging framework for our robotic ecosystem. We explore the possibilities of parallelizing some of the robotics algorithms as Map/Reduce tasks in Hadoop. We implemented the FastSLAM algorithm in Map/Reduce and show how significant performance gains in execution times to build a map of a large area can be achieved with even a very small eight-node Hadoop cluster. The global map can later be shared with other robots introduced in the environment via a Software as a Service (SaaS) Model. This reduces the burden of exploration and map building for the new robot and minimizes it's need for additional sensors. Our primary goal is to develop a cloud computing environment which provides a compute cluster built with commodity hardware exposing a suite of robotic algorithms as a SaaS and share data co-operatively across the robotic ecosystem.

...read moreread less

Journal Article•DOI•

DV-CAST: A distributed vehicular broadcast protocol for vehicular ad hoc networks

[...]

Ozan K. Tonguz¹, Nawaporn Wisitpongphan¹, Fan Bai²•Institutions (2)

Carnegie Mellon University¹, General Motors²

01 Apr 2010-IEEE Wireless Communications

TL;DR: This article presents the design and implementation of a new distributed vehicular multihop broadcast protocol, DV-CAST, that can operate in all traffic regimes, including extreme scenarios such as dense and sparse traffic regimes.

...read moreread less

Abstract: The potential of infrastructureless vehicular ad hoc networks for providing safety and nonsafety applications is quite significant. The topology of VANETs in urban, suburban, and rural areas can exhibit fully connected, fully disconnected, or sparsely connected behavior, depending on the time of day or the market penetration rate of wireless communication devices. In this article we focus on highway scenarios, and present the design and implementation of a new distributed vehicular multihop broadcast protocol, that can operate in all traffic regimes, including extreme scenarios such as dense and sparse traffic regimes. DV-CAST is a distributed broadcast protocol that relies only on local topology information for handling broadcast messages in VANETs. It is shown that the performance of the proposed DV-CAST protocol in terms of reliability, efficiency, and scalability is excellent.

...read moreread less

Journal Issue•DOI•

The Scalasca performance toolset architecture

[...]

Markus Geimer¹, Felix Wolf², Brian J. N. Wylie¹, Erika Ábrahám², Daniel Becker², Bernd Mohr¹ - Show less +2 more•Institutions (2)

Forschungszentrum Jülich¹, RWTH Aachen University²

01 Apr 2010-Concurrency and Computation: Practice and Experience

TL;DR: The current toolset architecture is reviewed, emphasizing its scalable design and the role of the different components in transforming raw measurement data into knowledge of application execution behavior.

...read moreread less

Abstract: Scalasca is a performance toolset that has been specifically designed to analyze parallel application execution behavior on large-scale systems with many thousands of processors. It offers an incremental performance-analysis procedure that integrates runtime summaries with in-depth studies of concurrent behavior via event tracing, adopting a strategy of successively refined measurement configurations. Distinctive features are its ability to identify wait states in applications with very large numbers of processes and to combine these with efficiently summarized local measurements. In this article, we review the current toolset architecture, emphasizing its scalable design and the role of the different components in transforming raw measurement data into knowledge of application execution behavior. The scalability and effectiveness of Scalasca are then surveyed from experience measuring and analyzing real-world applications on a range of computer systems. Copyright © 2010 John Wiley & Sons, Ltd.

...read moreread less

Patent•

Cloud gateway system for managing data storage to cloud storage sites

[...]

Anand Prahlad, Marcus S. Muller, Rajiv Kottomtharayil, Srinivas Kavuri, Parag Gokhale, Manoj Kumar Vijayan - Show less +2 more

31 Mar 2010

TL;DR: In this article, the authors present a set of systems and methods for performing data storage operations, including content indexing, containerized deduplication, and policy-driven storage, within a cloud environment.

...read moreread less

Abstract: Systems and methods are disclosed for performing data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment. The systems support a variety of clients and cloud storage sites that may connect to the system in a cloud environment that requires data transfer over wide area networks, such as the Internet, which may have appreciable latency and/or packet loss, using various network protocols, including HTTP and FTP. Methods are disclosed for content indexing data stored within a cloud environment to facilitate later searching, including collaborative searching. Methods are also disclosed for performing containerized deduplication to reduce the strain on a system namespace, effectuate cost savings, etc. Methods are disclosed for identifying suitable storage locations, including suitable cloud storage sites, for data files subject to a storage policy. Further, systems and methods for providing a cloud gateway and a scalable data object store within a cloud environment are disclosed, along with other features.

...read moreread less

Proceedings Article•DOI•

User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop

[...]

Zhi-Dan Zhao, Mingsheng Shang

09 Jan 2010

TL;DR: This paper implements user-based CF algorithm on a cloud computing platform, namely Hadoop, to solve the scalability problem of CF.

...read moreread less

Abstract: Collaborative Filtering(CF) algorithms are widely used in a lot of recommender systems, however, the computational complexity of CF is high thus hinder their use in large scale systems. In this paper, we implement user-based CF algorithm on a cloud computing platform, namely Hadoop, to solve the scalability problem of CF. Experimental results show that a simple method that partition users into groups according to two basic principles, i.e., tidy arrangement of mapper number to overcome the initiation of mapper and partition task equally such that all processors finish task at the same time, can achieve linear speedup.

...read moreread less

Proceedings Article•DOI•

Cloud auto-scaling with deadline and budget constraints

[...]

Ming Mao¹, J. Li¹, Marty Humphrey¹•Institutions (1)

University of Virginia¹

01 Oct 2010

TL;DR: This paper presents a cloud auto-scaling mechanism to automatically scale computing instances based on workload information and performance desire, and demonstrates that it can meet user specified performance goal with less cost.

...read moreread less

Abstract: Clouds have become an attractive computing platform which offers on-demand computing power and storage capacity. Its dynamic scalability enables users to quickly scale up and scale down underlying infrastructure in response to business volume, performance desire and other dynamic behaviors. However, challenges arise when considering computing instance non-deterministic acquisition time, multiple VM instance types, unique cloud billing models and user budget constraints. Planning enough computing resources for user desired performance with less cost, which can also automatically adapt to workload changes, is not a trivial problem. In this paper, we present a cloud auto-scaling mechanism to automatically scale computing instances based on workload information and performance desire. Our mechanism schedules VM instance startup and shut-down activities. It enables cloud applications to finish submitted jobs within the deadline by controlling underlying instance numbers and reduces user cost by choosing appropriate instance types. We have implemented our mechanism in Windows Azure platform, and evaluated it using both simulations and a real scientific cloud application. Results show that our cloud auto-scaling mechanism can meet user specified performance goal with less cost.

...read moreread less

Collapse