scispace - formally typeset
Open AccessProceedings Article

Proactive error prediction to improve storage system reliability

TLDR
A range of different machine learning techniques are explored and it is shown that sector errors can be predicted ahead of time with high accuracy, even when only little training data or only training data for a different drive model is available.
Abstract
This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. another drive in the same RAID) and is lost if the error is encountered while the system operates in degraded mode, e.g. during RAID reconstruction. In this paper, we explore a range of different machine learning techniques and show that sector errors can be predicted ahead of time with high accuracy. Prediction is robust, even when only little training data or only training data for a different drive model is available. We also discuss a number of possible use cases for improving storage system reliability through the use of sector error predictors. We evaluate one such use case in detail: We show that the mean time to detecting errors (and hence the window of vulnerability to data loss) can be greatly reduced by adapting the speed of a scrubber based on error predictions.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article

Improving Service Availability of Cloud Systems by Predicting Disk Error.

TL;DR: A cost-sensitive ranking-based machine learning model that can learn the characteristics of faulty disks in the past and rank the disks based on their error-proneness in the near future is developed and successfully applied to improve service availability of Microsoft Azure.
Proceedings ArticleDOI

Disk Failure Prediction in Data Centers via Online Learning

TL;DR: A novel disk failure prediction model using Online Random Forests (ORFs) that can automatically evolve with sequential arrival of data on-the-fly and thus is highly adaptive to the variance of SMART distribution over time, which has favourable advantage against the offline counterparts in terms of superior prediction performance.
Proceedings Article

Making Disk Failure Predictions SMARTer

TL;DR: This work presents analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator.

Guest paper: Failure trends in a large disk drive population

TL;DR: It is found that temperature and activity levels were much less correlated with drive failures than previously reported, and models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.
Proceedings Article

Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures

TL;DR: This paper study the reliability of SSD-based storage systems deployed in Alibaba Cloud, which cover near half a million SSDs and span over three years of usage under representative cloud services, and derives a number of major lessons and a set of effective methods to address the issues observed.
References
More filters
Proceedings Article

Failure trends in a large disk drive population

TL;DR: In this article, the authors present data collected from detailed observations of a large disk drive population in a production Internet services deployment, and analyze the correlation between failures and several parameters generally believed to impact longevity.
Proceedings ArticleDOI

An analysis of latent sector errors in disk drives

TL;DR: This is the first study of such large scale the sample size is at least an order of magnitude larger than previously published studies and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.
Journal ArticleDOI

Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

TL;DR: A new algorithm based on the multiple-instance learning framework and the naive Bayesian classifier (mi-NB) is developed which is specifically designed for the low false-alarm case, and is shown to have promising performance.
Proceedings Article

Flash reliability in production: the expected and the unexpected

TL;DR: A large-scale field study covering many millions of drive days, ten different drive models, different flash technologies, and no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes is provided.
Journal ArticleDOI

Improved disk-drive failure warnings

TL;DR: Improved methods are proposed for disk-drive failure prediction using the SMART internal drive attribute measurements in present drives, and the present warning-algorithm based on maximum error thresholds is replaced by distribution-free statistical hypothesis tests.
Related Papers (5)