Storage optimization for large-scale distributed stream-processing systems

doi:10.1145/1326542.1326547

Open AccessJournal ArticleDOI

Storage optimization for large-scale distributed stream-processing systems

Kirsten W. Hildrum, +5 more

- 25 Feb 2008 -

ACM Transactions on Storage

- Vol. 3, Iss: 4, pp 5

TLDR

A novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions is provided, to keep the data of highest overall value, while simultaneously balancing the read load to the file system.

Abstract:

We consider storage in an extremely large-scale distributed computer system designed for stream processing applications. In such systems, both incoming data and intermediate results may need to be stored to enable analyses at unknown future times. The quantity of data of potential use would dominate even the largest storage system. Thus, a mechanism is needed to keep the data most likely to be used. One recently introduced approach is to employ retention value functions, which effectively assign each data object a value that changes over time in a prespecified way lDouglis et al.2004r. Storage space for data entering the system is reclaimed automatically by deleting data of the lowest current value. In such large systems, there will naturally be multiple file systems available, each with different properties. Choosing the right file system for a given incoming stream of data presents a challenge. In this article we provide a novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions. The goal is to keep the data of highest overall value, while simultaneously balancing the read load to the file system. The key aspects of such a scheme are quite different from those that arise in traditional file assignment problems. We further motivate this optimization problem and describe a solution, comparing its performance to other reasonable schemes via simulation experiments.

Storage optimization for large-scale distributed stream-processing systems

Citations

SODA: an optimizing scheduler for large-scale stream-based distributed computer systems

COLA: optimizing stream processing applications via graph partitioning

File placement on distributed computer systems

Advances and Challenges for Scalable Provenance in Stream Processing Systems

Identifying trends in enterprise data protection systems

References

The Art of Computer Programming

The Art in Computer Programming

Human behavior and the principle of least effort

Introduction to linear optimization

Network Flows

Related Papers (5)

Position: short object lifetimes require a delete-optimized storage system

ARC: a self-tuning, low overhead replacement cache

LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance

Nitro: a capacity-optimized SSD cache for primary storage

TelegraphCQ: Continuous Dataflow Processing for an Uncertain World.