scispace - formally typeset
Search or ask a question

Showing papers on "Scalability published in 2007"


Proceedings ArticleDOI
C. Ranger1, R. Raghuraman1, A. Penmetsa1, Gary Bradski1, Christos Kozyrakis1 
10 Feb 2007
TL;DR: It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.
Abstract: This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code

1,058 citations


Proceedings ArticleDOI
11 Jun 2007
TL;DR: A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.
Abstract: Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning. However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins. We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

821 citations


Journal ArticleDOI
Y. Hoskote1, Sriram R. Vangal1, A. Singh1, Nitin Borkar1, S. Borkar1 
TL;DR: A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W.
Abstract: A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W. A 2D on-die mesh interconnection network operating at 5 GHz provides the high-performance communication fabric to connect the cores. The network delivers a bisection bandwidth of 2.56 Terabits per second and a per hop fall-through latency of 1 nanosecond.

658 citations


Journal ArticleDOI
TL;DR: This paper discusses an advanced approach for a 3DTV service, which is based on the concept of video-plus-depth data representations, and provides a modular and flexible system architecture supporting a wide range of multi-view structures.
Abstract: Due to enormous progress in the areas of auto-stereoscopic 3D displays, digital video broadcast and computer vision algorithms, 3D television (3DTV) has reached a high technical maturity and many people now believe in its readiness for marketing. Experimental prototypes of entire 3DTV processing chains have been demonstrated successfully during the last few years, and the motion picture experts group (MPEG) of ISO/IEC has launched related ad hoc groups and standardization efforts envisaging the emerging market segment of 3DTV. In this context the paper discusses an advanced approach for a 3DTV service, which is based on the concept of video-plus-depth data representations. It particularly considers aspects of interoperability and multi-view adaptation for the case that different multi-baseline geometries are used for multi-view capturing and 3D display. Furthermore it presents algorithmic solutions for the creation of depth maps and depth image-based rendering related to this framework of multi-view adaptation. In contrast to other proposals, which are more focused on specialized configurations, the underlying approach provides a modular and flexible system architecture supporting a wide range of multi-view structures.

434 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: This paper proposes express virtual channels (EVCs), a novel flow control mechanism which allows packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the energy/delay towards that of a dedicated wire while simultaneously approaching ideal throughput with a practical design suitable for on-chip networks.
Abstract: Due to wire delay scalability and bandwidth limitations inherent in shared buses and dedicated links, packet-switched on-chip interconnection networks are fast emerging as the pervasive communication fabric to connect different processing elements in many-core chips. However, current state-of-the-art packet-switched networks rely on complex routers which increases the communication overhead and energy consumption as compared to the ideal interconnection fabric.In this paper, we try to close the gap between the state-of-the-art packet-switched network and the ideal interconnect by proposing express virtual channels (EVCs), a novel flow control mechanism which allows packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the energy/delay towards that of a dedicated wire while simultaneously approaching ideal throughput with a practical design suitable for on-chip networks.Our evaluation results using a detailed cycle-accurate simulator on a range of synthetic traffic and SPLASH benchmark traces show upto 84% reduction in packet latency and upto 23% improvement in throughput while reducing the average router energy consumption by upto 38% over an existing state-of-the-art packet-switched design. When compared to the ideal interconnect, EVCs add just two cycles to the no-load latency, and are within 14% of the ideal throughput. Moreover, we show that the proposed design incurs a minimal hardware overhead while exhibiting excellent scalability with increasing network sizes.

388 citations


Proceedings ArticleDOI
10 Nov 2007
TL;DR: Falkon's integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system, and large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time.
Abstract: To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher. Falkon's integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. We describe Falkon architecture and implementation, and present performance results for both microbenchmarks and applications. Microbenchmarks show that Falkon throughput (487 tasks/sec) and scalability (to 54,000 executors and 2,000,000 tasks processed in just 112 minutes) are one to two orders of magnitude better than other systems used in production Grids. Large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time, relative to versions that execute tasks via separate scheduler submissions.

350 citations


Proceedings ArticleDOI
08 May 2007
TL;DR: This paper proposes a new algorithm that can produce within 30seconds high-quality solutions for hard placement problems with thousands of machines and thousands of applications, and has been implemented and adopted in a leading commercial middleware product for managing the performance of Web applications.
Abstract: Given a set of machines and a set of Web applications with dynamically changing demands, an online application placement controller decides how many instances to run for each application and where to put them, while observing all kinds of resource constraints. This NP hard problem has real usage in commercial middleware products. Existing approximation algorithms for this problem can scale to at most a few hundred machines, and may produce placement solutions that are far from optimal when system resources are tight. In this paper, we propose a new algorithm that can produce within 30seconds high-quality solutions for hard placement problems with thousands of machines and thousands of applications. This scalability is crucial for dynamic resource provisioning in large-scale enterprise data centers. Our algorithm allows multiple applications to share a single machine, and strivesto maximize the total satisfied application demand, to minimize the number of application starts and stops, and to balance the load across machines. Compared with existing state-of-the-art algorithms, for systems with 100 machines or less, our algorithm is up to 134 times faster, reduces application starts and stops by up to 97%, and produces placement solutions that satisfy up to 25% more application demands. Our algorithm has been implemented and adopted in a leading commercial middleware product for managing the performance of Web applications.

345 citations


Proceedings ArticleDOI
14 Oct 2007
TL;DR: At the core of Sinfonia is a novel minitransaction primitive that enables efficient and consistent access to data, while hiding the complexities that arise from concurrency and failures.
Abstract: We propose a new paradigm for building scalable distributed systems. Our approach does not require dealing with message-passing protocols -- a major complication in existing distributed systems. Instead, developers just design and manipulate data structures within our service called Sinfonia. Sinfonia keeps data for applications on a set of memory nodes, each exporting a linear address space. At the core of Sinfonia is a novel minitransaction primitive that enables efficient and consistent access to data, while hiding the complexities that arise from concurrency and failures. Using Sinfonia, we implemented two very different and complex applications in a few months: a cluster file system and a group communication service. Our implementations perform well and scale to hundreds of machines.

335 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: This paper presents a distributed, multi- agent, hybrid system for which it is shown that, under certain secondary objectives on the agents and the assumption that the initial network is connected, the resulting motion always satisfies connectivity of the network.
Abstract: Control of mobile networks raises fundamental and novel problems in controlling the structure of the resulting dynamic graphs. In particular, in applications involving mobile sensor networks and multi-agent systems, a great new challenge is the development of distributed motion algorithms that guarantee connectivity of the overall network. In this paper, we address this challenge using a novel control decomposition. First, motion control is performed in the continuous state space, where nearest neighbor potential fields are used to maintain existing links in the network. Second, distributed coordination protocols in the discrete graph space ensure connectivity of the switching network topology. Coordination is based on locally updated estimates of the abstract network topology by every agent as well as distributed auctions that enable tie breaking whenever simultaneous link deletions may violate connectivity. Integration of the overall system results in a distributed, multi- agent, hybrid system for which we show that, under certain secondary objectives on the agents and the assumption that the initial network is connected, the resulting motion always satisfies connectivity of the network. Our approach can also account for communication time delays in the network as well as collision avoidance, while its efficiency and scalability properties are illustrated in nontrivial computer simulations.

291 citations


Proceedings ArticleDOI
11 Jun 2007
TL;DR: This work applies a regression-based approximation of the CPU demand of client transactions on a given hardware to an analytic model of a simple network of queues, each queue representing a tier, and shows the approximation's effectiveness for modeling diverse workloads with a changing transaction mix over time.
Abstract: The multi-tier implementation has become the industry standard for developing scalable client-server enterprise applications. Since these applications are performance sensitive, effective models for dynamic resource provisioning and for delivering quality of service to these applications become critical. Workloads in such environments are characterized by client sessions of interdependent requests with changing transaction mix and load over time, making model adaptivity to the observed workload changes a critical requirement for model effectiveness. In this work, we apply a regression-based approximation of the CPU demand of client transactions on a given hardware. Then we use this approximation in an analytic model of a simple network of queues, each queue representing a tier, and show the approximation's effectiveness for modeling diverse workloads with a changing transaction mix over time. Using the TPC- W benchmark and its three different transaction mixes we investigate factors that impact the efficiency and accuracy of the proposed performance prediction models. Experimental results show that this regression-based approach provides a simple and powerful solution for efficient capacity planning and resource provisioning of multi-tier applications under changing workload conditions.

289 citations


Proceedings ArticleDOI
11 Jun 2007
TL;DR: This work proposes and motivate JouleSort, an external sort benchmark, for evaluating the energy efficiency of a wide range of computer systems from clusters to handhelds, and demonstrates a Joule sort system that is over 3.5x as energy-efficient as last year's estimated winner.
Abstract: The energy efficiency of computer systems is an important concern in a variety of contexts. In data centers, reducing energy use improves operating cost, scalability, reliability, and other factors. For mobile devices, energy consumption directly affects functionality and usability. We propose and motivate JouleSort, an external sort benchmark, for evaluating the energy efficiency of a wide range of computer systems from clusters to handhelds. We list the criteria, challenges, and pitfalls from our experience in creating a fair energy-efficiency benchmark. Using a commercial sort, we demonstrate a JouleSort system that is over 3.5x as energy-efficient as last year's estimated winner. This system is quite different from those currently used in data centers. It consists of a commodity mobile CPU and 13 laptop drives connected by server-style I/O interfaces.

Journal ArticleDOI
TL;DR: This work proposes to mitigate device shortcomings and exploit their dynamical character by building self-organizing, self-healing networks that implement massively parallel computations, useful for complex pattern recognition problems.
Abstract: Nanodevices have terrible properties for building Boolean logic systems: high defect rates, high variability, high death rates, drift, and (for the most part) only two terminals. Economical assembly requires that they be dynamical. We argue that strategies aimed at mitigating these limitations, such as defect avoidance/reconfiguration, or applying coding theory to circuit design, present severe scalability and reliability challenges. We instead propose to mitigate device shortcomings and exploit their dynamical character by building self-organizing, self-healing networks that implement massively parallel computations. The key idea is to exploit memristive nanodevice behavior to cheaply implement adaptive, recurrent networks, useful for complex pattern recognition problems. Pulse-based communication allows the designer to make trade-offs between power consumption and processing speed. Self-organization sidesteps the scalability issues of characterization, compilation and configuration. Network dynamics supplies a graceful response to device death. We present simulation results of such a network—a self-organized spatial filter array—that demonstrate its performance as a function of defects and device variation.

Journal ArticleDOI
TL;DR: Key extensions to the coherence protocol enable POWER6 microprocessor-based systems to achieve better SMP scalability while enabling reductions in system packaging complexity and cost.
Abstract: This paper describes the implementation of the IBM POWER6™ microprocessor, a two-way simultaneous multithreaded (SMT) dual-core chip whose key features include binary compatibility with IBM POWER5™ microprocessor-based systems; increased functional capabilities, such as decimal floating-point and vector multimedia extensions; significant reliability, availability, and serviceability enhancements; and robust scalability with up to 64 physical processors. Based on a new industry-leading high-frequency core architecture with enhanced SMT and driven by a high-throughput symmetric multiprocessing (SMP) cache and memory subsystem, the POWER6 chip achieves a significant performance boost compared with its predecessor, the POWER5 chip. Key extensions to the coherence protocol enable POWER6 microprocessor-based systems to achieve better SMP scalability while enabling reductions in system packaging complexity and cost.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: This paper proposes techniques to generate annotated, Internet router graphs of different sizes based on existing observations of Internet characteristics and finds that their generated graphs match a variety of graph properties of observed topologies for a range of target graph sizes.
Abstract: Researchers involved in designing network services and protocols rely on results from simulation and emulation environments to understand their application performance and scalability. To better understand the behavior of these applications and predict their performance when deployed on the actual Internet, the generated topologies must closely match real network characteristics, not just in terms of graph structure (node interconnectivity) but also with respect to various node and link annotations. Relevant annotations include link latencies, AS membership and whether a router is a peering or internal router. Finally, it should be possible to rescale a given topology to a variety of sizes while still maintaining its essential characteristics.In this paper, we propose techniques to generate annotated, Internet router graphs of different sizes based on existing observations of Internet characteristics. We find that our generated graphs match a variety of graph properties of observed topologies for a range of target graph sizes. While the best available data of Internet topology currently remains imperfect, the quality of our generated topologies will improve with the fidelity of available measurement techniques or next generation architectures that make Internet structure more transparent.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: The design and implementation of distributed rate limiters are presented, which work together to enforce a global rate limit across traffic aggregates at multiple sites, enabling the coordinated policing of a cloud-based service's network traffic.
Abstract: Today's cloud-based services integrate globally distributed resources into seamless computing platforms. Provisioning and accounting for the resource usage of these Internet-scale applications presents a challenging technical problem. This paper presents the design and implementation of distributed rate limiters, which work together to enforce a global rate limit across traffic aggregates at multiple sites, enabling the coordinated policing of a cloud-based service's network traffic. Our abstraction not only enforces a global limit, but also ensures that congestion-responsive transport-layer flows behave as if they traversed a single, shared limiter. We present two designs - one general purpose, and one optimized for TCP - that allow service operators to explicitly trade off between communication costs and system accuracy, efficiency, and scalability. Both designs are capable of rate limiting thousands of flows with negligible overhead (less than 3% in the tested configuration). We demonstrate that our TCP-centric design is scalable to hundreds of nodes while robust to both loss and communication delay, making it practical for deployment in nationwide service providers.

Book
01 Feb 2007
TL;DR: FastSLAM as discussed by the authors is a family of algorithms for the simultaneous localization and mapping (SLAM) problem in robotics, which has been successfully applied in different dynamic environments, including a solution to the problem of people tracking.
Abstract: This monograph describes a new family of algorithms for the simultaneous localization and mapping (SLAM) problem in robotics, called FastSLAM. The FastSLAM-type algorithms have enabled robots to acquire maps of unprecedented size and accuracy, in a number of robot application domains and have been successfully applied in different dynamic environments, including a solution to the problem of people tracking.

01 Jan 2007
TL;DR: This study describes meta-learning and presents the JAM system (Java Agents for Meta-learning), an agent-based meta- learning system for large-scale data mining applications and identifies and addresses several important desiderata for distributed data mining systems that stem from their additional complexity compared to centralized or host-based systems.
Abstract: Data mining systems aim to discover patterns and extract useful information from facts recorded in databases. A widely adopted approach to this objective is to apply various machine learning algorithms to compute descriptive models of the available data. Here, we explore one of the main challenges in this research area, the development of techniques that scale up to large and possibly physically distributed databases. Meta-learning is a technique that seeks to compute higher-level classifiers (or classification models), called meta-classifiers, that integrate in some principled fashion multiple classifiers computed separately over different databases. This study, describes meta-learning and presents the JAM system (Java Agents for Meta-learning), an agent-based meta-learning system for large-scale data mining applications. Specifically, it identifies and addresses several important desiderata for distributed data mining systems that stem from their additional complexity compared to centralized or host-based systems. Distributed systems may need to deal with heterogenous platforms, with multiple databases and (possibly) different schemas, with the design and implementation of scalable and effective protocols for communicating among the data sites, and the selective and efficient use of the information that is gathered from other peer data sites. Other important problems, intrinsic within ∗Supported in part by an IBM fellowship. data mining systems that must not be ignored, include, first, the ability to take advantage of newly acquired information that was not previously available when models were computed and combine it with existing models, and second, the flexibility to incorporate new machine learning methods and data mining technologies. We explore these issues within the context of JAM and evaluate various proposed solutions through extensive empirical studies.

Journal ArticleDOI
TL;DR: This paper uses a set of real traces and attempts to develop some theoretical basis to demonstrate that a random peer partnership selection with a hybrid pull-push scheme has the potentially to scale.
Abstract: Peer-to-peer (P2P) technology has found much success in applications like file distributions and VoIP yet, its adoption in live video streaming remains as an elusive goal. Our recent success in Coolstreaming system brings promises in this direction; however, it also reveals that there exist many practical engineering problems in real live streaming systems over the Internet. Our focus in this paper is on a nonoptimal real working system, in which we illustrate a set of existing practical problems and how they could be handled. We believe this is essential in providing the basic understanding of P2P streaming systems. This paper uses a set of real traces and attempts to develop some theoretical basis to demonstrate that a random peer partnership selection with a hybrid pull-push scheme has the potentially to scale. Specifically, first, we describe the fundamental system design tradeoffs and key changes in the design of a Coolstreaming system including substreaming, buffer management, scheduling and the adopt of a hybrid pull-push mechanism over the original pull-based content delivery approach; second, we examine the overlay topology and its convergence; third, using a combination of real traces and analysis, we quantitatively provide the insights on how the buffering technique resolves the problems associated with dynamics and heterogeneity; fourth, we show how substream and path diversity can help to alleviate the impact from congestion and churns; fifth, we discuss the system scalability and limitations.

Book ChapterDOI
17 Sep 2007
TL;DR: A new metric that measures the informativeness of objects to be classified, which can be applied as a query-based distance metric to measure the closeness between objects, and two novel KNN procedures are proposed.
Abstract: The K-nearest neighbor (KNN) decision rule has been a ubiquitous classification tool with good scalability. Past experience has shown that the optimal choice of Kdepends upon the data, making it laborious to tune the parameter for different applications. We introduce a new metric that measures the informativeness of objects to be classified. When applied as a query-based distance metric to measure the closeness between objects, two novel KNN procedures, Locally Informative-KNN (LI-KNN) and Globally Informative-KNN (GI-KNN), are proposed. By selecting a subset of most informative objects from neighborhoods, our methods exhibit stability to the change of input parameters, number of neighbors(K) and informative points (I). Experiments on UCI benchmark data and diverse real-world data sets indicate that our approaches are application-independent and can generally outperform several popular KNN extensions, as well as SVM and Boosting methods.

Proceedings ArticleDOI
14 Mar 2007
TL;DR: New language constructs to support open nesting in Java are described, and it is demonstrated how these constructs can be mapped efficiently to existing STM data structures, demonstrating how open nesting can enhance application scalability.
Abstract: Transactional memory (TM) promises to simplify concurrent programming while providing scalability competitive to fine-grained locking. Language-based constructs allow programmers to denote atomic regions declaratively and to rely on the underlying system to provide transactional guarantees along with concurrency. In contrast with fine-grained locking, TM allows programmers to write simpler programs that are composable and deadlock-free.TM implementations operate by tracking loads and stores to memory and by detecting concurrent conflicting accesses by different transactions. By automating this process, they greatly reduce the programmer's burden, but they also are forced to be conservative. Incertain cases, conflicting memory accesses may not actually violate the higher-level semantics of a program, and a programmer may wish to allow seemingly conflicting transactions to execute concurrently.Open nested transactions enable expert programmers to differentiate between physical conflicts, at the level of memory, and logical conflicts that actually violate application semantics. A TMsystem with open nesting can permit physical conflicts that are not logical conflicts, and thus increase concurrency among application threads.Here we present an implementation of open nested transactions in a Java-based software transactional memory (STM)system. We describe new language constructs to support open nesting in Java, and we discuss new abstract locking mechanisms that a programmer can use to prevent logical conflicts. We demonstrate how these constructs can be mapped efficiently to existing STM data structures. Finally, we evaluate our system on a set of Java applications and data structures, demonstrating how open nesting can enhance application scalability.

Journal ArticleDOI
TL;DR: This paper aims to serve as a review of the most promising Grid systems that use P2P techniques to facilitate resource discovery in order to perform a qualitative comparison of the existing approaches and to draw conclusions about their advantages and weaknesses.

Proceedings ArticleDOI
05 Nov 2007
TL;DR: A simple stochastic model is described that can be used to compare different data-driven downloading strategies based on two performance metrics: continuity (probability of continuous playback), and startup latency (expected time to start playback).
Abstract: P2P streaming tries to achieve scalability (like P2P file distribution) and at the same time meet real-time playback requirements. It is a challenging problem still not well understood. In this paper, we describe a simple stochastic model that can be used to compare different data-driven downloading strategies based on two performance metrics: continuity (probability of continuous playback), and startup latency (expected time to start playback). We first study two simple strategies: rarest first and greedy. The former is a well-known strategy for P2P file sharing that gives good scalability, whereas the latter an intuitively reasonable strategy to optimize continuity and startup latency from a single peer's viewpoint. Greedy, while achieving low startup latency, fares poorly in continuity by failing to maximize P2P sharing; whereas rarest first is the opposite. This highlights the trade-off between startup latency and continuity, and how system scalability improves continuity. Based on this insight, we propose a mixed strategy that can be used to achieve the best of both worlds. Our algorithm dynamically adapts to the peer population size to ensure scalability; at the same time, it reserves part of a peer's effort to the immediate playback requirements to ensure low startup latency.

Patent
18 Aug 2007
TL;DR: A method, system, and apparatus for identifying, describing, integrating, and discovering information events with the unique feature of applicability to and extensibility across event states irrespective of function and or embodiment in one system with one approach, one infrastructure, one architecture, one method, and one principled basis, comprising: a self-mint method for self-service identity; an information architecture; a method to organize everything; a scalable business process for integrating data from different tables and/or from different systems into one combined system; a programming process and language, operating system architecture, and
Abstract: A method, system, and apparatus for identifying, describing, integrating, and discovering information events with the unique feature of applicability to and extensibility across event states irrespective of function and or embodiment in one system with one approach, one infrastructure, one architecture, one method, and one principled basis, comprising: a self-mint method for self-service identity; an information architecture; a method to organize everything; a scalable business process for integrating data from different tables and/or from different systems into one combined system; a programming process and language, operating system architecture, and modeling medium; and a search engine and directory to make it all accessible; altogether comprising an infrastructure for a network system.

Proceedings ArticleDOI
11 Nov 2007
TL;DR: The design and implementation of RADOS is presented, a reliable object storage service that can scales to many thousands of devices by leveraging the intelligence present in individual storage nodes by allowing nodes to act semi-autonomously to self-manage replication, failure detection, and failure recovery through the use of a small cluster map.
Abstract: Brick and object-based storage architectures have emerged as a means of improving the scalability of storage clusters. However, existing systems continue to treat storage nodes as passive devices, despite their ability to exhibit significant intelligence and autonomy. We present the design and implementation of RADOS, a reliable object storage service that can scales to many thousands of devices by leveraging the intelligence present in individual storage nodes. RADOS preserves consistent data access and strong safety semantics while allowing nodes to act semi-autonomously to self-manage replication, failure detection, and failure recovery through the use of a small cluster map. Our implementation offers excellent performance, reliability, and scalability while providing clients with the illusion of a single logical object store.

Proceedings ArticleDOI
14 May 2007
TL;DR: From the message distribution experiments, it is found that on an average about 50% messages are transferred through intra-node communication, which is much higher than intuition, and indicates that optimizing intra- node communication is as important as optimizing inter- nodes communication in a multi-core cluster.
Abstract: Multi-core processors are growing as a new industry trend as single core processors rapidly reach the physical limits of possible complexity and speed. In the new Top500 supercomputer list, more than 20% processors belong to the multi-core processor family. However, without an in-depth study on application behaviors and trends on multi-core clusters, we might not be able to understand the characteristics of multi-core cluster in a comprehensive manner and hence not be able to get optimal performance. In this paper, we take on these challenges and design a set of experiments to study the impact of multi-core architecture on cluster computing. We choose to use one of the most advanced multi-core servers, Intel Bensley system with Woodcrest processors, as our evaluation platform, and use benchmarks including HPL, NAMD, and NAS as the applications to study. From our message distribution experiments, we find that on an average about 50% messages are transferred through intra-node communication, which is much higher than intuition. This trend indicates that optimizing intra- node communication is as important as optimizing inter- node communication in a multi-core cluster. We also observe that cache and memory contention may be a potential bottleneck in multi-core clusters, and communication middleware and applications should be multi-core aware to alleviate this problem. We demonstrate that multi-core aware algorithm, e.g. data tiling, improves benchmark execution time by up to 70%. We also compare the scalability of a multi-core cluster with that of a single-core cluster and find that the scalability of the multi-core cluster is promising.

Proceedings Article
06 Jan 2007
TL;DR: This work presents the first memory-bounded dynamic programming algorithm for finite-horizon decentralized POMDPs, which can handle horizons that are multiple orders of magnitude larger than what was previously possible, while achieving the same or better solution quality.
Abstract: Decentralized decision making under uncertainty has been shown to be intractable when each agent has different partial information about the domain. Thus, improving the applicability and scalability of planning algorithms is an important challenge. We present the first memory-bounded dynamic programming algorithm for finite-horizon decentralized POMDPs. A set of heuristics is used to identify relevant points of the infinitely large belief space. Using these belief points, the algorithm successively selects the best joint policies for each horizon. The algorithm is extremely efficient, having linear time and space complexity with respect to the horizon length. Experimental results show that it can handle horizons that are multiple orders of magnitude larger than what was previously possible, while achieving the same or better solution quality. These results significantly increase the applicability of decentralized decision-making techniques.

Journal ArticleDOI
TL;DR: VMesh, a distributed peer-to-peer video-on-demand (VoD) streaming scheme which efficiently supports random seeking functionality, is proposed which achieves low startup and seeking latency under random user interactivity and peer join/leave which is a crucial requirement in an interactive VoD system.
Abstract: Provisioning random access functions in peer-to-peer on-demand video streaming is challenging, due to not only the asynchronous user interactivity but also the unpredictability of group dynamics. In this paper, we propose VMesh, a distributed peer-to-peer video-on-demand (VoD) streaming scheme which efficiently supports random seeking functionality. In VMesh, videos are divided into segments and stored at peers' local storage in a distributed manner. An overlay mesh is built upon peers to support random forward/backward seek, pause and restart during playback. Our scheme takes advantage of the large aggregate storage capacity of peers to improve the segment supply so as to support efficient interactive commands in a scalable manner. Unlike previous work based on "cache-and-relay" mechanism, in our scheme, user interactivity such as random seeking performed by a peer does not break the connections between it and its children, and hence our scheme achieves better playback continuity. Through simulation, we show that our system achieves low startup and seeking latency under random user interactivity and peer join/leave which is a crucial requirement in an interactive VoD system.

Journal ArticleDOI
TL;DR: In this article, the authors propose algorithms for learning Markov boundaries from data without having to learn a Bayesian network first, and evaluate their correctness, scalability and data efficiency.

01 Jan 2007
TL;DR: The main conclusion is that existing database vendors need to enhance their products to better support multi-tenancy.
Abstract: This is a position paper on multi-tenant databases. As motivation, it first describes the emerging marketplace of hosted enterprise services and the importance of using multi-tenancy to handle high traffic volumes at low cost. It then outlines the main requirements on multi-tenant databases: scale up by consolidating multiple tenants onto the same server and scale out by providing an administrative framework that manages a farm of such servers. Finally it describes three approaches to implementing multi-tenant databases and compares them based on some simple experiments. The main conclusion is that existing database vendors need to enhance their products to better support multi-tenancy.

Proceedings ArticleDOI
26 Mar 2007
TL;DR: This paper investigates the behavior of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application, and shows that a scale- out strategy can be the key to good performance even on ascale-up machine.
Abstract: Scale-up solutions in the form of large SMPs have represented the mainstream of commercial computing for the past several years. The major server vendors continue to provide increasingly larger and more powerful machines. More recently, scale-out solutions, in the form of clusters of smaller machines, have gained increased acceptance for commercial computing. Scale-out solutions are particularly effective in high-throughput Web-centric applications. In this paper, we investigate the behavior of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application. Our conclusions show that a scale-out strategy can be the key to good performance even on a scale-up machine. Furthermore, scale-out solutions offer better price/performance, although at an increase in management complexity.