Open AccessProceedings Article
Proactive error prediction to improve storage system reliability
Farzaneh Mahdisoltani,Ioan Stefanovici,Bianca Schroeder +2 more
- pp 391-402
TLDR
A range of different machine learning techniques are explored and it is shown that sector errors can be predicted ahead of time with high accuracy, even when only little training data or only training data for a different drive model is available.Abstract:
This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. another drive in the same RAID) and is lost if the error is encountered while the system operates in degraded mode, e.g. during RAID reconstruction. In this paper, we explore a range of different machine learning techniques and show that sector errors can be predicted ahead of time with high accuracy. Prediction is robust, even when only little training data or only training data for a different drive model is available. We also discuss a number of possible use cases for improving storage system reliability through the use of sector error predictors. We evaluate one such use case in detail: We show that the mean time to detecting errors (and hence the window of vulnerability to data loss) can be greatly reduced by adapting the speed of a scrubber based on error predictions.read more
Citations
More filters
Proceedings Article
Improving Service Availability of Cloud Systems by Predicting Disk Error.
Yong Xu,Kaixin Sui,Randolph Yao,Hongyu Zhang,Qingwei Lin,Yingnong Dang,Peng Li,Keceng Jiang,Wenchi Zhang,Jian-Guang Lou,Murali Chintalapati,Dongmei Zhang +11 more
TL;DR: A cost-sensitive ranking-based machine learning model that can learn the characteristics of faulty disks in the past and rank the disks based on their error-proneness in the near future is developed and successfully applied to improve service availability of Microsoft Azure.
Proceedings ArticleDOI
Disk Failure Prediction in Data Centers via Online Learning
TL;DR: A novel disk failure prediction model using Online Random Forests (ORFs) that can automatically evolve with sequential arrival of data on-the-fly and thus is highly adaptive to the variance of SMART distribution over time, which has favourable advantage against the offline counterparts in terms of superior prediction performance.
Proceedings Article
Making Disk Failure Predictions SMARTer
TL;DR: This work presents analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator.
Guest paper: Failure trends in a large disk drive population
TL;DR: It is found that temperature and activity levels were much less correlated with drive failures than previously reported, and models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.
Proceedings Article
Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures
TL;DR: This paper study the reliability of SSD-based storage systems deployed in Alibaba Cloud, which cover near half a million SSDs and span over three years of usage under representative cloud services, and derives a number of major lessons and a set of effective methods to address the issues observed.
References
More filters
Proceedings Article
Failure trends in a large disk drive population
TL;DR: In this article, the authors present data collected from detailed observations of a large disk drive population in a production Internet services deployment, and analyze the correlation between failures and several parameters generally believed to impact longevity.
Proceedings ArticleDOI
An analysis of latent sector errors in disk drives
TL;DR: This is the first study of such large scale the sample size is at least an order of magnitude larger than previously published studies and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.
Journal ArticleDOI
Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
TL;DR: A new algorithm based on the multiple-instance learning framework and the naive Bayesian classifier (mi-NB) is developed which is specifically designed for the low false-alarm case, and is shown to have promising performance.
Proceedings Article
Flash reliability in production: the expected and the unexpected
TL;DR: A large-scale field study covering many millions of drive days, ten different drive models, different flash technologies, and no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes is provided.
Journal ArticleDOI
Improved disk-drive failure warnings
TL;DR: Improved methods are proposed for disk-drive failure prediction using the SMART internal drive attribute measurements in present drives, and the present warning-algorithm based on maximum error thresholds is replaced by distribution-free statistical hypothesis tests.