scispace - formally typeset
Search or ask a question
Topic

Data Corruption

About: Data Corruption is a research topic. Over the lifetime, 435 publications have been published within this topic receiving 6784 citations.


Papers
More filters
Proceedings Article
04 May 2015
TL;DR: SEI is presented, an algorithm that tolerates Arbitrary State Corruption faults and prevents data corruption from propagating across a distributed system and scales in three dimensions: memory, number of processing threads, and development effort.
Abstract: In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two real-world applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPU-intensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

10 citations

Proceedings ArticleDOI
05 Apr 2020
TL;DR: A new on-line error correcting scheme is proposed based on partial and selective checksums which can correct errors in the field and can achieve low decoding latency and comparatively smaller memory and area overhead in order to guarantee protection against errors in a single column.
Abstract: Resistive RAM technology with it’s in memory computation and matrix vector multiplication capabilities has paved the way for efficient hardware implementations of neural networks. The ability to store the training weights and perform a direct matrix vector multiplication with the applied inputs thus producing the outputs directly reduces a lot of memory transfer overhead. But such schemes are prone to various soft errors and hard errors due to immature fabrication processes creating marginal cells, read disturbance errors, etc. Soft errors are of concern in this case since they can potentially cause mi-classification of objects leading to catastrophic consequences for safety critical applications. Since the location of soft errors are not known previously, they can potentially manifest in the field leading to data corruption. In this paper, a new on-line error correcting scheme is proposed based on partial and selective checksums which can correct errors in the field. The proposed scheme can correct any number of errors in a single column of a given RRAM matrix. Two different checksum computation schemes are proposed, a majority voting-based scheme and a Hamming code-based scheme. The memory overhead and decoding area, latency and dynamic power consumption for both the proposed schemes are presented. It is seen that the proposed solutions can achieve low decoding latency and comparatively smaller memory and area overhead in order to guarantee protection against errors in a single column. Lastly, a scheme to extend the proposed scheme to multiple column errors is also discussed.

10 citations

Patent
01 Jul 2003
TL;DR: In this article, the parity and checksum data are stored in the RAID data storage system for each stripe that stores data, and the parity data is used to determine whether data in the corresponding stripe is corrupt.
Abstract: The present invention relates to an apparatus or computer executable method of detecting and repairing corrupt data in a RAID data storage system. In one embodiment, parity and checksum data are stored in the RAID data storage system for each stripe that stores data. The parity data is used to determine whether data in the corresponding stripe is corrupt. If stripe data is determined to be corrupt, the checksum data is used to correct the corruption.

10 citations

Proceedings ArticleDOI
09 Jan 2014
TL;DR: The performance of a failure predictor when used to forecast failures in a web-serving system subject to successive updates is studied and it is suggested that re-training is indeed necessary.
Abstract: Failure prediction is a promising technique to improve dependability of computer systems, in particular when it is important to foresee incoming failures and take corrective actions to avoid downtime or data corruption. Failure prediction is especially adequate in long running systems where internal errors accumulate and eventually lead to failures. The problem is that such systems do evolve. The workload and even the system itself changes over time, and this may affect the performance of the failure predictor. However, training failure prediction algorithms is a complex and time-consuming task and should be performed only when needed. Thus, it is important to understand if a system change affects prediction performance, to avoid running the target system with an ineffective predictor and prevent unnecessary retraining efforts. In this work we study the performance of a failure predictor when used to forecast failures in a web-serving system subject to successive updates. We observe and analyze the variation of performance in terms of ROC-AUC using fault injection and virtualization for the generation of the data needed for the assessment. Our results suggest that re-training is indeed necessary.

10 citations


Network Information
Related Topics (5)
Network packet
159.7K papers, 2.2M citations
82% related
Software
130.5K papers, 2M citations
81% related
Wireless sensor network
142K papers, 2.4M citations
78% related
Wireless network
122.5K papers, 2.1M citations
77% related
Cluster analysis
146.5K papers, 2.9M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20221
202121
202025
201927
201827
201727