Topic

Data Corruption

About: Data Corruption is a research topic. Over the lifetime, 435 publications have been published within this topic receiving 6784 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•

Scalable error isolation for distributed systems

[...]

Diogo Behrens¹, Marco Serafini², Sergei Arnautov¹, Flavio Junqueira³, Christof Fetzer¹ - Show less +1 more•Institutions (3)

Dresden University of Technology¹, Qatar Computing Research Institute², Microsoft³

04 May 2015

TL;DR: SEI is presented, an algorithm that tolerates Arbitrary State Corruption faults and prevents data corruption from propagating across a distributed system and scales in three dimensions: memory, number of processing threads, and development effort.

...read moreread less

Abstract: In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two real-world applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPU-intensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

...read moreread less

10 citations

Proceedings Article•DOI•

Selective Checksum based On-line Error Correction for RRAM based Matrix Operations

[...]

Abhishek Das¹, Nur A. Touba¹•Institutions (1)

University of Texas at Austin¹

05 Apr 2020

TL;DR: A new on-line error correcting scheme is proposed based on partial and selective checksums which can correct errors in the field and can achieve low decoding latency and comparatively smaller memory and area overhead in order to guarantee protection against errors in a single column.

...read moreread less

Abstract: Resistive RAM technology with it’s in memory computation and matrix vector multiplication capabilities has paved the way for efficient hardware implementations of neural networks. The ability to store the training weights and perform a direct matrix vector multiplication with the applied inputs thus producing the outputs directly reduces a lot of memory transfer overhead. But such schemes are prone to various soft errors and hard errors due to immature fabrication processes creating marginal cells, read disturbance errors, etc. Soft errors are of concern in this case since they can potentially cause mi-classification of objects leading to catastrophic consequences for safety critical applications. Since the location of soft errors are not known previously, they can potentially manifest in the field leading to data corruption. In this paper, a new on-line error correcting scheme is proposed based on partial and selective checksums which can correct errors in the field. The proposed scheme can correct any number of errors in a single column of a given RRAM matrix. Two different checksum computation schemes are proposed, a majority voting-based scheme and a Hamming code-based scheme. The memory overhead and decoding area, latency and dynamic power consumption for both the proposed schemes are presented. It is seen that the proposed solutions can achieve low decoding latency and comparatively smaller memory and area overhead in order to guarantee protection against errors in a single column. Lastly, a scheme to extend the proposed scheme to multiple column errors is also discussed.

...read moreread less

10 citations

Patent•

Automated recovery from data corruption of data volumes in RAID storage

[...]

Oleg Kiselev, John A. Colgrove

01 Jul 2003

TL;DR: In this article, the parity and checksum data are stored in the RAID data storage system for each stripe that stores data, and the parity data is used to determine whether data in the corresponding stripe is corrupt.

...read moreread less

Abstract: The present invention relates to an apparatus or computer executable method of detecting and repairing corrupt data in a RAID data storage system. In one embodiment, parity and checksum data are stored in the RAID data storage system for each stripe that stores data. The parity data is used to determine whether data in the corresponding stripe is corrupt. If stripe data is determined to be corrupt, the checksum data is used to correct the corruption.

...read moreread less

10 citations

DOI•

Data corruption and information retrieval

[...]

Elke Mittendorf

01 Jan 1998

10 citations

Proceedings Article•DOI•

On the Need for Training Failure Prediction Algorithms in Evolving Software Systems

[...]

Ivano Irrera¹, Joao Duraes¹, Marco Vieira¹•Institutions (1)

University of Coimbra¹

09 Jan 2014

TL;DR: The performance of a failure predictor when used to forecast failures in a web-serving system subject to successive updates is studied and it is suggested that re-training is indeed necessary.

...read moreread less

Abstract: Failure prediction is a promising technique to improve dependability of computer systems, in particular when it is important to foresee incoming failures and take corrective actions to avoid downtime or data corruption. Failure prediction is especially adequate in long running systems where internal errors accumulate and eventually lead to failures. The problem is that such systems do evolve. The workload and even the system itself changes over time, and this may affect the performance of the failure predictor. However, training failure prediction algorithms is a complex and time-consuming task and should be performed only when needed. Thus, it is important to understand if a system change affects prediction performance, to avoid running the target system with an ineffective predictor and prevent unnecessary retraining efforts. In this work we study the performance of a failure predictor when used to forecast failures in a web-serving system subject to successive updates. We observe and analyze the variation of performance in terms of ROC-AUC using fault injection and virtualization for the generation of the data needed for the assessment. Our results suggest that re-training is indeed necessary.

...read moreread less

10 citations

Collapse

Network Information

Performance

Metrics

435

Papers

7,411

Citations

No. of papers in the topic in previous years
Year	Papers
2022	1
2021	21
2020	25
2019	27
2018	27
2017	27

Data Corruption

Papers published on a yearly basis

Papers

Trending Questions (3)

Network Information

Related Topics (5)

Performance

Metrics