scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Storage in 2005"


Journal ArticleDOI
TL;DR: Ext3cow provides a time-shifting interface that permits a real-time and continuous view of data in the past that takes advantage of the fine-grained control of on-disk and in-memory data available only to a file system, resulting in minimal degradation of performance and functionality.
Abstract: The ext3cow file system, built on the popular ext3 file system, provides an open-source file versioning and snapshot platform for compliance with the versioning and audtitability requirements of recent electronic record retention legislation. Ext3cow provides a time-shifting interface that permits a real-time and continuous view of data in the past. Time-shifting does not pollute the file system namespace nor require snapshots to be mounted as a separate file system. Further, ext3cow is implemented entirely in the file system space and, therefore, does not modify kernel interfaces or change the operation of other file systems. Ext3cow takes advantage of the fine-grained control of on-disk and in-memory data available only to a file system, resulting in minimal degradation of performance and functionality. Experimental results confirm this hypothesis; ext3cow performs comparably to ext3 on many benchmarks and on trace-driven experiments.

177 citations


Journal Article
TL;DR: Singularity demonstrates the practicality of new technologies and architectural decisions, which should lead to the construction of more robust and dependable systems.
Abstract: . Singularity is a research project in Microsoft Research that started with the question: what would a software platform look like if it was designed from scratch with the primary goal of dependability? Singularity is working to answer this question by building on advances in programming languages and tools to develop a new system architecture and operating system (named Singularity), with the aim of producing a more robust and dependable software platform. Singularity demonstrates the practicality of new technologies and architectural decisions, which should lead to the construction of more robust and dependable systems.

162 citations


Journal ArticleDOI
TL;DR: This article proposes a tree-based management scheme which adopts multiple granularities in flash-memory management to not only reduce the run-time RAM footprint but also manage the write workload, due to housekeeping.
Abstract: Many existing approaches on flash-memory management are based on RAM-resident tables in which one single granularity size is used for both address translation and space management. As high-capacity flash memory is becoming more affordable than ever, the dilemma of how to manage the RAM space or how to improve the access performance is emerging for many vendors. In this article, we propose a tree-based management scheme which adopts multiple granularities in flash-memory management. Our objective is to not only reduce the run-time RAM footprint but also manage the write workload, due to housekeeping. The proposed method was evaluated under realistic workloads, where significant advantages over existing approaches were observed, in terms of the RAM space, access performance, and flash-memory lifetime.

132 citations


Journal ArticleDOI
TL;DR: This work uses an online feedback loop with an adaptive controller that throttles storage access requests to ensure that the available system throughput is shared among workloads according to their performance goals and their relative importance.
Abstract: Ensuring performance isolation and differentiation among workloads that share a storage infrastructure is a basic requirement in consolidated data centers. Existing management tools rely on resource provisioning to meet performance goals; they require detailed knowledge of the system characteristics and the workloads. Provisioning is inherently slow to react to system and workload dynamics and, in the general case, it is not practical to provision for the worst case.We propose a software-only solution that ensures predictable performance for storage access. It is applicable to a wide range of storage systems and makes no assumptions about workload characteristics. We use an online feedback loop with an adaptive controller that throttles storage access requests to ensure that the available system throughput is shared among workloads according to their performance goals and their relative importance. The controller considers the system as a “black box” and adapts automatically to system and workload changes. The controller is distributed to ensure high availability under overload conditions, and it can be used for both block and file access protocols. The evaluation of Triage, our experimental prototype, demonstrates workload isolation and differentiation in an overloaded cluster file-system where workloads and system components are changing.

131 citations


Journal ArticleDOI
TL;DR: Both graceful degradation and live-block recovery are implemented in a prototype SCSI-based storage system underneath unmodified file systems, demonstrating that powerful “file-system like” functionality can be implemented within a “semantically smart” disk system behind a narrow block-based interface.
Abstract: We present the design, implementation, and evaluation of D-GRAID, a gracefully degrading and quickly recovering RAID storage array. D-GRAID ensures that most files within the file system remain available even when an unexpectedly high number of faults occur. D-GRAID achieves high availability through aggressive replication of semantically critical data, and fault-isolated placement of logically related data. D-GRAID also recovers from failures quickly, restoring only live file system data to a hot spare. Both graceful degradation and live-block recovery are implemented in a prototype SCSI-based storage system underneath unmodified file systems, demonstrating that powerful “file-system like” functionality can be implemented within a “semantically smart” disk system behind a narrow block-based interface.

111 citations


Journal ArticleDOI
TL;DR: A lossy, gracefully degrading storage model is believed to be necessary and sufficient for many scientific applications since it supports both progressive data collection for interesting events as well as long-term in-network storage for in- network querying and processing.
Abstract: Wireless sensor networks enable dense sensing of the environment, offering unprecedented opportunities for observing the physical world. This article addresses two key challenges in wireless sensor networks: in-network storage and distributed search. The need for these techniques arises from the inability to provide persistent, centralized storage and querying in many sensor networks. Centralized storage requires multihop transmission of sensor data to Internet gateways which can quickly drain battery-operated nodes.Constructing a storage and search system that satisfies the requirements of data-rich scientific applications is a daunting task for many reasons: (a) the data requirements may be large compared to available storage and communication capacity of resource-constrained nodes, (b) user requirements are diverse and range from identification and collection of interesting event signatures to obtaining a deeper understanding of long-term trends and anomalies in the sensor events, and (c) many applications are in new domains where a priori information may not be available to reduce these requirements.This article describes a lossy, gracefully degrading storage model. We believe that such a model is necessary and sufficient for many scientific applications since it supports both progressive data collection for interesting events as well as long-term in-network storage for in-network querying and processing. Our system demonstrates the use of in-network wavelet-based summarization and progressive aging of summaries in support of long-term querying in storage and communication-constrained networks. We evaluate the performance of our linux implementation and show that it achieves: (a) low communication overhead for multiresolution summarization, (b) highly efficient drill-down search over such summaries, and (c) efficient use of network storage capacity through load-balancing and progressive aging of summaries.

111 citations


Journal ArticleDOI
TL;DR: Two algorithms are proposed that use a data mining technique called frequent sequence mining to discover block correlations in storage systems and run reasonably fast with feasible space requirement, indicating that they are practical for dynamically inferring correlations in a storage system.
Abstract: Block correlations are common semantic patterns in storage systems. They can be exploited for improving the effectiveness of storage caching, prefetching, data layout, and disk scheduling. Unfortunately, information about block correlations is unavailable at the storage system level. Previous approaches for discovering file correlations in file systems do not scale well enough for discovering block correlations in storage systems.In this article, we propose two algorithms, C-Miner and C-Minera, that use a data mining technique called frequent sequence mining to discover block correlations in storage systems. Both algorithms run reasonably fast with feasible space requirement, indicating that they are practical for dynamically inferring correlations in a storage system. C-Miner is a direct application of a frequent-sequence mining algorithm with a few modifications; compared with C-Miner, C-Minera is redesigned for mining block correlations by making concessions for the specific problem of long sequences in storage system traces. Therefore, C-Minera can discover 7--109p more correlation rules within 2--15 times shorter time than C-Miner. Moreover, we have also evaluated the benefits of block correlation-directed prefetching and data layout through experiments. Our results using real system workloads show that correlation-directed prefetching and data layout can reduce average I/O response time by 12--30p compared to the base case, and 7--25p compared to the commonly used sequential prefetching scheme for most workloads.

59 citations


Journal ArticleDOI
TL;DR: This article proposes a technique that provides a performance guarantee for control algorithms, and proposes a new control algorithm, Performance-Directed Dynamic (PD), that dynamically adjusts its thresholds periodically, based on available slack and recent workload characteristics.
Abstract: Much research has been conducted on energy management for memory and disks. Most studies use control algorithms that dynamically transition devices to low power modes after they are idle for a certain threshold period of time. The control algorithms used in the past have two major limitations. First, they require painstaking, application-dependent manual tuning of their thresholds to achieve energy savings without significantly degrading performance. Second, they do not provide performance guarantees.This article addresses these two limitations for both memory and disks, making memory/disk energy-saving schemes practical enough to use in real systems. Specifically, we make four main contributions. (1) We propose a technique that provides a performance guarantee for control algorithms. We show that our method works well for all tested cases, even with previously proposed algorithms that are not performance-aware. (2) We propose a new control algorithm, Performance-Directed Dynamic (PD), that dynamically adjusts its thresholds periodically, based on available slack and recent workload characteristics. For memory, PD consumes the least energy when compared to previous hand-tuned algorithms combined with a performance guarantee. However, for disks, PD is too complex and its self-tuning is unable to beat previous hand-tuned algorithms. (3) To improve on PD, we propose a simpler, optimization-based, threshold-free control algorithm, Performance-Directed Static (PS). PS periodically assigns a static configuration by solving an optimization problem that incorporates information about the available slack and recent traffic variability to different chips/disks. We find that PS is the best or close to the best across all performance-guaranteed disk algorithms, including hand-tuned versions. (4) We also explore a hybrid scheme that combines PS and PD algorithms to further improve energy savings.

51 citations


Journal ArticleDOI
TL;DR: Om as mentioned in this paper is a peer-to-peer wide-area storage system that achieves high availability and manageability through online automatic regeneration while still preserving consistency guarantees by utilizing the limited view divergence property in today's Internet and adopting the witness model.
Abstract: Reducing management costs and improving the availability of large-scale distributed systems require automatic replica regeneration, that is, creating new replicas in response to replica failures. A major challenge to regeneration is maintaining consistency when the replica group changes. Doing so is particularly difficult across the wide area where failure detection is complicated by network congestion and node overload.In this context, this article presents Om, the first read/write peer-to-peer, wide-area storage system that achieves high availability and manageability through online automatic regeneration while still preserving consistency guarantees. We achieve these properties through the following techniques. First, by utilizing the limited view divergence property in today's Internet and by adopting the witness model, Om is able to regenerate from any single replica, rather than requiring a majority quorum, at the cost of a small (10−6 in our experiments) probability of violating consistency during each regeneration. As a result, Om can deliver high availability with a small number of replicas, while traditional designs would significantly increase the number of replicas. Next, we distinguish failure-free reconfigurations from failure-induced ones, enabling common reconfigurations to proceed with a single round of communication. Finally, we use a lease graph among the replicas and a two-phase write protocol to optimize for reads, so that reads in Om can be processed by any single replica. Experiments on PlanetLab show that consistent regeneration in Om completes in approximately 20 seconds.

48 citations


Journal ArticleDOI
TL;DR: The tradeoff between SATA which has the advantage that fewer higher capacity drives are needed for a given system storage capacity, which further reduces cost and allows higher drive failure rates, and the use of additional storage system redundancy and drive failure prediction to maintain system data integrity using less reliable drives is discussed.
Abstract: Information storage reliability and security is addressed by using personal computer disk drives in enterprise-class nearline and archival storage systems. The low cost of these serial ATA (SATA) PC drives is a tradeoff against drive reliability design and demonstration test levels, which are higher in the more expensive SCSI and Fibre Channel drives. This article discusses the tradeoff between SATA which has the advantage that fewer higher capacity drives are needed for a given system storage capacity, which further reduces cost and allows higher drive failure rates, and the use of additional storage system redundancy and drive failure prediction to maintain system data integrity using less reliable drives. RAID stripe failure probability is calculated using typical ATA and SCSI drive failure rates, for single and double parity data reconstruction failure, and failure due to drive unrecoverable block errors. Reliability improvement from drive failure prediction is also calculated, and can be significant. Today's SATA drive specifications for unrecoverable block errors appear to allow stripe reconstruction failure, and additional in-drive parity blocks are suggested as a solution. The possibility of using low cost disks data for backup and archiving is discussed, replacing higher cost magnetic tape. This requires significantly better RAID stripe failure probability, and suitable drive technology alternatives are discussed. The failure rate of nonoperating drives is estimated using failure analysis results from a4000 drives. Nonoperating RAID stripe failure rates are thereby estimated. User data security needs to be assured in addition to reliability, and to extend past the point where physical control of drives is lost, such as when drives are removed from systems for data vaulting, repair, sale, or discard. Today, over a third of resold drives contain unerased user data. Security is proposed via the existing SATA drive secure-erase command, or via the existing SATA drive password commands, or by data encryption. Finally, backup and archival disc storage is compared to magnetic tape, a technology with a proven reliability record over the full half-century of digital data storage. In contrast, tape archives are not vulnerable to tape transport failure modes. Only failure modes in the archived tapes and reels will make data unrecoverable.

40 citations


Journal ArticleDOI
TL;DR: This report study the disk replacement problem (DRP) to find a sequence of disk additions and removals for a storage system, while migrating the data and respecting the following constraints.
Abstract: Random data placement, which is efficient and scalable for large-scale storage systems, has recently emerged as an alternative to traditional data striping. In this report, we study the disk replacement problem (DRP) to find a sequence of disk additions and removals for a storage system, while migrating the data and respecting the following constraints: (1) the data is initially balanced across the existing distributed disk configuration, (2) the data must again be balanced across the new configuration, and (3) the data migration cost must be minimized. In practice, migrating data from old disks to new devices is complicated by the fact that the total number of disks connected to the storage system is often limited by a fixed number of available slots and not all the old and new disks can be connected at the same time. This article presents solutions for both cases where the number of disk slots is either unconstrained or constrained.

Journal ArticleDOI
TL;DR: A description of DISP and an analysis of its fault-tolerant properties is provided and the complexity of the protocol is analyzed to discuss several potential applications.
Abstract: DISP is a practical client-server protocol for the distributed storage of immutable data objects. Unlike most other contemporary protocols, DISP permits applications to make explicit tradeoffs between total storage space, computational overhead, and guarantees of availability, integrity, and privacy on a per-object basis. Applications specify the degree of redundancy with which each item is encoded, what level of integrity checks are computed and stored with each item, and whether items are stored in an encrypted format. At one extreme, clients willing to pay the overhead are guaranteed privacy, integrity, and availability of data stored in the system as long as fewer than half the servers are Byzantine. At the other extreme, objects that do not require privacy or integrity in the face of Byzantine servers can be stored with very low computational and storage overhead.DISP is efficient in terms of message count, message size, and storage requirements: even in the worst case, the read and write protocols require a number of messages that are linear with respect to the number of servers. In terms of message size, DISP requires transferring only marginally more than L bytes to correctly read an object of size L, even in the face of Byzantine server failures. In this article we provide a description of DISP and an analysis of its fault-tolerant properties. We also analyze the complexity of the protocol and discuss several potential applications. We conclude with a description of our prototype implementation and measurements of its performance on commodity hardware.

Journal ArticleDOI
TL;DR: An optimal solution to realize the file storage scheme in tree networks with asymmetric edges between adjacent nodes is established by combining the memory-allocation algorithm with the data-interleaving algorithm.
Abstract: A file storage scheme is proposed for networks containing heterogeneous clients. In the scheme, the performance measured by file-retrieval delays degrades gracefully under increasingly serious faulty circumstances. The scheme combines coding with storage for better performance. The problem is NP-hard for general networks; and this article focuses on tree networks with asymmetric edges between adjacent nodes. A polynomial-time memory-allocation algorithm is presented, which determines how much data to store on each node, with the objective of minimizing the total amount of data stored in the network. Then a polynomial-time data-interleaving algorithm is used to determine which data to store on each node for satisfying the quality-of-service requirements in the scheme. By combining the memory-allocation algorithm with the data-interleaving algorithm, an optimal solution to realize the file storage scheme in tree networks is established.

Journal ArticleDOI
TL;DR: This "cheap" recovery mechanism simplifies management by lowering the cost of acting on false positives and enables one to use statistical techniques to turn hard-to-catch failures, such as node degradation, into failure, followed by recovery.
Abstract: Cluster hash tables (CHTs) are key components of many large-scale Internet services due to their highly-scalable performance and the prevalence of the type of data they store. Another advantage of CHTs is that they can be designed to be as self-managing as a cluster of stateless servers. One key to achieving this extreme manageability is reboot-based recovery that is predictably fast and has modest impact on system performance and availability. This "cheap" recovery mechanism simplifies management in two ways. First, it simplifies failure detection by lowering the cost of acting on false positives. This enables one to use statistical techniques to turn hard-to-catch failures, such as node degradation, into failure, followed by recovery. Second, cheap recovery simplifies capacity planning by recasting repartitioning as failure plus recovery to achieve zero-downtime incremental scaling. These low-cost recovery and scaling mechanisms make it possible for the system to be continuously self-adjusting, a key property of self-managing systems.

Journal ArticleDOI
TL;DR: In the proposed Postmanet, the use of digital storage media transported by the postal system as a general digital communication mechanism has several important advantages, including wide global reach, great bandwidth potential, low cost, and ease of incremental adoption.
Abstract: Making high-bandwidth Internet access pervasively available to a large worldwide audience is a difficult challenge, especially in many developing regions. As we wait for the uncertain takeoff of technologies that promise to improve the situation, we propose to explore an approach that is potentially more easily realizable: the use of digital storage media transported by the postal system as a general digital communication mechanism. We shall call such a system a Postmanet. Compared to more conventional wide-area connectivity options, the Postmanet has several important advantages, including wide global reach, great bandwidth potential, low cost, and ease of incremental adoption. While the idea of sending digital content via the postal system is not a new one, none of the existing attempts have turned the postal system into a generic and transparent communication channel that not only can cater to a wide array of applications, but also effectively manage the many idiosyncrasies associated with using the postal system. In the proposed Postmanet, we see two recurring themes at many different levels of the system. One is the simultaneous exploitation of the Internet and the postal system so we can combine their latency and bandwidth advantages. The other is the exploitation of the abundant capacity and bandwidth of the Postmanet to improve its latency, cost, and reliability.

Journal ArticleDOI
TL;DR: A system for load management in shared-disk file systems built on clusters of heterogeneous computers that continuously tunes load placement using an adaptive, nonuniform (ANU) randomization realizes the scalability and metadata reduction benefits of hash-based, randomized placement techniques, while avoiding hashing's drawbacks.
Abstract: We develop and evaluate a system for load management in shared-disk file systems built on clusters of heterogeneous computers. It balances workload by moving file sets among cluster server nodes. It responds to changing server resources that arise from failure and recovery, and dynamically adding or removing servers. It also realizes performance consistency---nearly uniform performance across all servers. The system is adaptive and self-tuning. It operates without any a priori knowledge of workload properties, or the capabilities of the servers. Rather, it continuously tunes load placement using a technique called adaptive, nonuniform (ANU) randomization. ANU randomization realizes the scalability and metadata reduction benefits of hash-based, randomized placement techniques, while avoiding hashing's drawbacks: load skew, inability to cope with heterogeneity, and lack of tunability. ANU randomization outperforms virtual-processor approaches to load balancing, while reducing the amount of shared state among servers and the amount of load movement.

Journal ArticleDOI
TL;DR: The feasibility of building network media servers that exploit the latest advances in media compression technology towards reducing the cost of wide-scale streaming services for stored data is demonstrated.
Abstract: We describe the design and implementation of the Exedra continuous media server, and experimentally evaluate alternative resource management policies using a prototype system that we built. Exedra has been designed to provide scalable and efficient support for variable bit-rate media streams whose compression efficiency leads to reduced storage space and bandwidth requirements in comparison to constant bit-rate streams of equivalent quality. We examine alternative disk striping policies, and quantify the benefits of innovative techniques for storage space allocation, buffer management, and resource reservation, which we developed to achieve both predictability and high-performance in handling disk and network data transfers of variable size. Additionally, we investigate the differences between diverse data replication schemes over disk arrays, and compare methods for disk access time reservation that enable tolerance of disk failures at minimal cost. Overall, we demonstrate the feasibility of building network media servers that exploit the latest advances in media compression technology towards reducing the cost of wide-scale streaming services for stored data.