scispace - formally typeset
Open AccessJournal ArticleDOI

Storage optimization for large-scale distributed stream-processing systems

TLDR
A novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions is provided, to keep the data of highest overall value, while simultaneously balancing the read load to the file system.
Abstract
We consider storage in an extremely large-scale distributed computer system designed for stream processing applications. In such systems, both incoming data and intermediate results may need to be stored to enable analyses at unknown future times. The quantity of data of potential use would dominate even the largest storage system. Thus, a mechanism is needed to keep the data most likely to be used. One recently introduced approach is to employ retention value functions, which effectively assign each data object a value that changes over time in a prespecified way lDouglis et al.2004r. Storage space for data entering the system is reclaimed automatically by deleting data of the lowest current value. In such large systems, there will naturally be multiple file systems available, each with different properties. Choosing the right file system for a given incoming stream of data presents a challenge. In this article we provide a novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions. The goal is to keep the data of highest overall value, while simultaneously balancing the read load to the file system. The key aspects of such a scheme are quite different from those that arise in traditional file assignment problems. We further motivate this optimization problem and describe a solution, comparing its performance to other reasonable schemes via simulation experiments.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

SODA: an optimizing scheduler for large-scale stream-based distributed computer systems

TL;DR: The design and functionality of SODA, the mathematical components, and experiments to show the performance of the scheduler are described, which must be able to shift resource allocation dynamically in response to changes to resource availability, job arrivals and departures, incoming data rates and so on.
Book ChapterDOI

COLA: optimizing stream processing applications via graph partitioning

TL;DR: This paper describes an optimization scheme for fusing compile-time operators into reasonably-sized run-time software units called processing elements (PEs), and computes a hierarchical partitioning of the operator graph based on a minimum-ratio cut subroutine.
Book

File placement on distributed computer systems

TL;DR: This article examines recent developments in the integration of the query processing, file partitioning, concurrency control, and network design problems with the file placement problem.
Book ChapterDOI

Advances and Challenges for Scalable Provenance in Stream Processing Systems

TL;DR: The requirements behind the initial implementation of Century's provenance subsystem are described, its strengths and limitations are analyzed, and a new provenance architecture is proposed to address some of these limitations.
Proceedings Article

Identifying trends in enterprise data protection systems

TL;DR: A study of 40,000 enterprise data protection systems deploying Symantec NetBackup, a commercial backup product, finds that the main reason behind inefficiencies in data protection system is misconfigurations and believes there is potential in developing automated, self-healing data Protection systems that achieve higher efficiency standards.
References
More filters
Book

The Art of Computer Programming

TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.

The Art in Computer Programming

Andrew Hunt, +1 more
TL;DR: Here the authors haven’t even started the project yet, and already they’re forced to answer many questions: what will this thing be named, what directory will it be in, what type of module is it, how should it be compiled, and so on.
Book

Introduction to linear optimization

TL;DR: p. 27, l.
Book

Network Flows

TL;DR: The question the authors are trying to ask is: how many units of water can they send from the source to the sink per unit of time?
Related Papers (5)