Proactive error prediction to improve storage system reliability

Open AccessProceedings Article

Proactive error prediction to improve storage system reliability

- pp 391-402

TLDR

A range of different machine learning techniques are explored and it is shown that sector errors can be predicted ahead of time with high accuracy, even when only little training data or only training data for a different drive model is available.

Abstract:

This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. another drive in the same RAID) and is lost if the error is encountered while the system operates in degraded mode, e.g. during RAID reconstruction. In this paper, we explore a range of different machine learning techniques and show that sector errors can be predicted ahead of time with high accuracy. Prediction is robust, even when only little training data or only training data for a different drive model is available. We also discuss a number of possible use cases for improving storage system reliability through the use of sector error predictors. We evaluate one such use case in detail: We show that the mean time to detecting errors (and hence the window of vulnerability to data loss) can be greatly reduced by adapting the speed of a scrubber based on error predictions.

Citations

PDF

Open Access

More filters

Proceedings Article

Improving Service Availability of Cloud Systems by Predicting Disk Error.

Yong Xu, +11 more

TL;DR: A cost-sensitive ranking-based machine learning model that can learn the characteristics of faulty disks in the past and rank the disks based on their error-proneness in the near future is developed and successfully applied to improve service availability of Microsoft Azure.

...read moreread less

Proceedings ArticleDOI

Disk Failure Prediction in Data Centers via Online Learning

Jiang Xiao, +5 more

TL;DR: A novel disk failure prediction model using Online Random Forests (ORFs) that can automatically evolve with sequential arrival of data on-the-fly and thus is highly adaptive to the variance of SMART distribution over time, which has favourable advantage against the offline counterparts in terms of superior prediction performance.

...read moreread less

Proceedings Article

Making Disk Failure Predictions SMARTer

Sidi Lu, +5 more

TL;DR: This work presents analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator.

...read moreread less

Guest paper: Failure trends in a large disk drive population

Eduardo Pinheiro, +2 more

TL;DR: It is found that temperature and activity levels were much less correlated with drive failures than previously reported, and models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.

...read moreread less

Proceedings Article

Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures

Erci Xu, +4 more

TL;DR: This paper study the reliability of SSD-based storage systems deployed in Alibaba Cloud, which cover near half a million SSDs and span over three years of usage under representative cloud services, and derives a number of major lessons and a set of effective methods to address the issues observed.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Failure trends in a large disk drive population

Eduardo Pinheiro, +2 more

TL;DR: In this article, the authors present data collected from detailed observations of a large disk drive population in a production Internet services deployment, and analyze the correlation between failures and several parameters generally believed to impact longevity.

...read moreread less

Proceedings ArticleDOI

An analysis of latent sector errors in disk drives

Lakshmi Narayanan Bairavasundaram, +3 more

TL;DR: This is the first study of such large scale the sample size is at least an order of magnitude larger than previously published studies and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.

...read moreread less

Journal ArticleDOI

Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

Joseph F. Murray, +2 more

- 01 Dec 2005 -

Journal of Machine Learning Research

TL;DR: A new algorithm based on the multiple-instance learning framework and the naive Bayesian classifier (mi-NB) is developed which is specifically designed for the low false-alarm case, and is shown to have promising performance.

...read moreread less

Proceedings Article

Flash reliability in production: the expected and the unexpected

Bianca Schroeder, +2 more

TL;DR: A large-scale field study covering many millions of drive days, ten different drive models, different flash technologies, and no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes is provided.

...read moreread less

Journal ArticleDOI

Improved disk-drive failure warnings

G.F. Hughes, +3 more

- 07 Nov 2002 -

IEEE Transactions on Reliability

TL;DR: Improved methods are proposed for disk-drive failure prediction using the SMART internal drive attribute measurements in present drives, and the present warning-algorithm based on maximum error thresholds is replaced by distribution-free statistical hypothesis tests.

...read moreread less

Journal of Machine Learning Research

Proactive error prediction to improve storage system reliability

Citations

Improving Service Availability of Cloud Systems by Predicting Disk Error.

Disk Failure Prediction in Data Centers via Online Learning

Making Disk Failure Predictions SMARTer

Guest paper: Failure trends in a large disk drive population

Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures

References

Failure trends in a large disk drive population

An analysis of latent sector errors in disk drives

Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

Flash reliability in production: the expected and the unexpected

Improved disk-drive failure warnings

Related Papers (5)

Predicting Disk Replacement towards Reliable Data Centers

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Proactive drive failure prediction for large scale storage systems

Failure trends in a large disk drive population

Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application