scispace - formally typeset
Search or ask a question

Showing papers on "Data Corruption published in 2019"


Proceedings Article
01 Apr 2019
TL;DR: Pangolin uses a combination of checksums, parity, and micro-buffering to protect an application's objects from both media errors and corruption due to software bugs and achieves comparable performance relative to the current state-of-the-art fault-tolerant persistent object library.
Abstract: Non-volatile main memory (NVMM) allows programmers to build complex, persistent, pointer-based data structures that can offer substantial performance gains over conventional approaches to managing persistent state. This programming model removes the file system from the critical path which improves performance, but it also places these data structures out of reach of file system-based fault tolerance mechanisms (e.g., block-based checksums or erasure coding). Without fault-tolerance, using NVMM to hold critical data will be much less attractive. This paper presents Pangolin, a fault-tolerant persistent object library designed for NVMM. Pangolin uses a combination of checksums, parity, and micro-buffering to protect an application's objects from both media errors and corruption due to software bugs. It provides these protections for objects of any size and supports automatic, online detection of data corruption and recovery. The required storage overhead is small (1% for gigabyte-sized pools of NVMM). Pangolin provides stronger protection, requires orders of magnitude less storage overhead, and achieves comparable performance relative to the current state-of-the-art fault-tolerant persistent object library.

47 citations


Journal ArticleDOI
TL;DR: A principal component pursuit (PCP)-based interface is proposed between raw synchrophasor data and the algorithms used for wide-area monitoring application to provide resilience against malicious data corruption.
Abstract: A principal component pursuit (PCP)-based interface is proposed between raw synchrophasor data and the algorithms used for wide-area monitoring application to provide resilience against malicious data corruption. The PCP method-based preprocessor recovers a low rank matrix from the data matrix despite gross sparse errors originating from cyber-attacks by solving a convex program. The low-rank matrix consists of the basis vectors obtained from the system response and the sparse matrix represents corruption in each position of the data matrix. An augmented Lagrangian multiplier-based algorithm is applied to solve the PCP problem. The low rank matrix obtained after solving PCP represents the reconstructed data and can be used for estimation of poorly damped modes. A recursive oscillation monitoring algorithm is tested to validate the effectiveness of the proposed approach under both ambient and transient conditions.

21 citations


Proceedings ArticleDOI
24 Jun 2019
TL;DR: SANS-SOUCI is able to perform repair for both software (function) and sensor produced data corruption with very low overhead and is evaluated using a portable, open source, distributed IoT platform, example applications, and microbenchmarks.
Abstract: Motivated by the growth of Internet of Things (IoT) technologies and the volumes and velocity of data that they can and will produce, we investigate automated data repair for event-driven, IoT applications. IoT devices are heterogeneous in their hardware architectures, software, size, cost, capacity, network capabilities, power requirements, etc. They must execute in a wide range of operating environments where failures and degradations of service due to hardware malfunction, software bugs, network partitions, etc. cannot be immediately remediated. Further, many of these failure modes cause corruption in the data that these devices produce and in the computations "downstream" that depend on this data.To "repair" corrupted data from its origin through its computational dependencies in a distributed IoT setting, we explore SANS-SOUCI--a system for automatically tracking causal data dependencies and re-initiating dependent computations in event-driven IoT deployment frameworks. SANS-SOUCI presupposes an event-driven programming model based on cloud functions, which we extend for portable execution across IoT tiers (device, edge, cloud). We add fast, persistent, append-only storage and versioning for efficient data robustness and durability. SANS-SOUCI records events and their causal dependencies using a distributed event log and repairs applications dynamically, across tiers via replay. We evaluate SANS-SOUCI using a portable, open source, distributed IoT platform, example applications, and microbenchmarks. We find that SANS-SOUCI is able to perform repair for both software (function) and sensor produced data corruption with very low overhead.

17 citations


Proceedings ArticleDOI
28 Jul 2019
TL;DR: In this article, the authors explore an approach of assuring data integrity - considering either malicious or accidental corruption - for workflow executions orchestrated by the Pegasus Workflow Management System, and introduce Chaos Jungle, a toolkit providing an environment for validating integrity verification mechanisms by allowing researchers to introduce a variety of integrity errors during data transfers and storage.
Abstract: With the continued rise of scientific computing and the enormous increases in the size of data being processed, scientists must consider whether the processes for transmitting and storing data sufficiently assure the integrity of the scientific data. When integrity is not preserved, computations can fail and result in increased computational cost due to reruns, or worse, results can be corrupted in a manner not apparent to the scientist and produce invalid science results. Technologies such as TCP checksums, encrypted transfers, checksum validation, RAID and erasure coding provide integrity assurances at different levels, but they may not scale to large data sizes and may not cover a workflow from end-to-end, leaving gaps in which data corruption can occur undetected. In this paper we explore an approach of assuring data integrity - considering either malicious or accidental corruption - for workflow executions orchestrated by the Pegasus Workflow Management System. To validate our approach, we introduce Chaos Jungle - a toolkit providing an environment for validating integrity verification mechanisms by allowing researchers to introduce a variety of integrity errors during data transfers and storage. In addition to controlled experiments with Chaos Jungle, we provide analysis of integrity errors that we encountered when running production workflows.

10 citations


Journal ArticleDOI
TL;DR: This paper proposes several mechanisms for ensuring data consistency in drop computing, ranging from a rating system to careful analysis of the data received, and shows that the proposed solution is able to maximize the amount of correct data exchanged in the network.
Abstract: Drop computing is a network paradigm that aims to address the issues of the mobile cloud computing model, which has started to show limitations especially since the advent of the Internet of Things and the increase in the number of connected devices. In drop computing, nodes are able to offload data and computations to the cloud, to edge devices, or to the social-based opportunistic network composed of other nodes located nearby. In this paper, we focus on the lowest layer of drop computing, where mobile nodes offload tasks and data to and from each other through close-range protocols, based on their social connections. In such a scenario, where the data can circulate in the mobile network on multiple paths (and through multiple other devices), consistency issues may appear due to data corruption or malicious intent. Since there is no central entity that can control the way information is spread and its correctness, alternative methods need to be employed. In this paper, we propose several mechanisms for ensuring data consistency in drop computing, ranging from a rating system to careful analysis of the data received. Through thorough experimentation, we show that our proposed solution is able to maximize the amount of correct (i.e., uncorrupted) data exchanged in the network, with percentages as high as 100%.

9 citations


Journal ArticleDOI
TL;DR: Results of these experiments have shown that the mechanism will be able to demonstrate the effectiveness of this proposed algorithm which is in the replication of data as well as its recovery and the proposed SDS algorithm will minimize the replication cost of data.
Abstract: Cloud computing will provide scalable computing as well as storage resources where more data intensive applications will be developed in a computing environment. Owing to the existence of such security threats in the cloud, several mechanisms are being proposed for allowing the users to audit the integrity of data along with the public key of the owner of the data even before making use of the cloud data. Replicating of data in cloud servers through multiple data centers offers better availability, scalability, and durability. The correctness of choice of the right type of public key of the previous mechanisms is based on the security of the public key infrastructure (PKI). Although traditional PKI has been widely used in the construction of public key cryptography, it still faces many security risks, especially in the aspect of managing certificates. There are different applications having different types of quality of service (QoS) needs. In order to support the QoS requirement continuously, the application of such data corruption for this work will be an efficient integrity of data replication that makes use of a stochastic diffusion search (SDS) algorithm that has been proposed. This SDS is that technique of a multi-agent global optimisation which has been based on the behaviour of ants that has been rooted in the partial evaluation of that of an objective function along with direct communication among agents. The proposed SDS algorithm will minimize the replication cost of data. The results of these experiments have shown that the mechanism will be able to demonstrate the effectiveness of this proposed algorithm which is in the replication of data as well as its recovery. The proposed method when appropriately compared with the cost effective replication of dynamic data given by Li et al. proves that the average recovery time is less by 18.18% for the 250 number of requested nodes, by 14.28% for the 500 number of requested nodes, by 11.11% for the 750 number of requested nodes and by 8.69% for the 1000 number of requested nodes.

9 citations


Journal ArticleDOI
TL;DR: GFCache is proposed to cache corrupted data for the dual purposes of failure information sharing and eliminating unnecessary data recovery processes, and achieves good hit ratio with the sophisticated caching algorithm and manages to significantly boost system performance by reducing unnecessary data recoveries with vulnerable data in the cache.
Abstract: In the big data era, data unavailability, either temporary or permanent, becomes a normal occurrence on a daily basis. Unlike the permanent data failure, which is fixed through a background job, temporarily unavailable data is recovered on-the-fly to serve the ongoing read request. However, those newly revived data is discarded after serving the request, due to the assumption that data experiencing temporary failures could come back alive later. Such disposal of failure data prevents the sharing of failure information among clients, and leads to many unnecessary data recovery processes, (e.g. caused by either recurring unavailability of a data or multiple data failures in one stripe), thereby straining system performance. To this end, this paper proposes GFCache to cache corrupted data for the dual purposes of failure information sharing and eliminating unnecessary data recovery processes. GFCache employs a greedy caching approach of opportunism to promote not only the failed data, but also sequential failure-likely data in the same stripe. Additionally, GFCache includes a FARC (Failure ARC) catch replacement algorithm, which features a balanced consideration of failure recency, frequency to accommodate data corruption with good hit ratio. The stored data in GFCache is able to support fast read of the normal data access. Furthermore, since GFCache is a generic failure cache, it can be used anywhere erasure coding is deployed with any specific coding schemes and parameters. Evaluations show that GFCache achieves good hit ratio with our sophisticated caching algorithm and manages to significantly boost system performance by reducing unnecessary data recoveries with vulnerable data in the cache.

8 citations


Journal ArticleDOI
TL;DR: A new PoS scheme is constructed that is publicly verifiable and only requires simple cryptographic computations, and it is proved that the scheme is secure under the discrete logarithm assumption, in the random oracle model.
Abstract: With the rapid development of cloud computing platforms, cloud storage services are becoming widespread in recent years. Based on these services, clients are able to store data on remote cloud servers and thereby saving their local storage. This greatly reduces the burden of clients, while it also brings certain security risks to the outsourced data. Among the risks, a critical one is data corruption, for example cloud servers may delete some rarely used outsourced data for cost saving. To prevent this risk, proof of storage (PoS) schemes are invented, which can validate the integrity of cloud data without downloading the entire data. The existing PoS schemes, however, mostly either involve complex operations e.g. bilinear pairings, or don't support public verifiability. To fill this gap, in this paper we construct a new PoS scheme that is publicly verifiable and only requires simple cryptographic computations. We prove that our scheme is secure under the discrete logarithm assumption, in the random oracle model. Furthermore, we also show how to extend the scheme to support data updates. Finally, we implement our scheme. The simulation results demonstrate that our scheme is more computationally-efficient than the publicly-verifiable PoS schemes of Shacham and Waters (Journal of Cryptology 2013).

7 citations


Posted Content
TL;DR: Pangolin this paper uses a combination of checksums, parity, and micro-buffering to protect an application's objects from both media errors and corruption due to software bugs.
Abstract: Non-volatile main memory (NVMM) allows programmers to build complex, persistent, pointer-based data structures that can offer substantial performance gains over conventional approaches to managing persistent state. This programming model removes the file system from the critical path which improves performance, but it also places these data structures out of reach of file system-based fault tolerance mechanisms (e.g., block-based checksums or erasure coding). Without fault-tolerance, using NVMM to hold critical data will be much less attractive. This paper presents Pangolin, a fault-tolerant persistent object library designed for NVMM. Pangolin uses a combination of checksums, parity, and micro-buffering to protect an application's objects from both media errors and corruption due to software bugs. It provides these protections for objects of any size and supports automatic, online detection of data corruption and recovery. The required storage overhead is small (1% for gigabyte-sized pools of NVMM). Pangolin provides stronger protection, requires orders of magnitude less storage overhead, and achieves comparable performance relative to the current state-of-the-art fault-tolerant persistent object library.

6 citations


Proceedings ArticleDOI
30 Mar 2019
TL;DR: This paper shows the design and working model of the integration of blockchain in cloud storage that uses blockchain technology for storing the data in the cloud and removes the third party auditor by enabling the minors to take control over the blocks in the blockchain.
Abstract: Data has become an invaluable asset in recent times. Although, so many technologies for storing and processing these data, are in existence, cloud computing is known as the best in most of the parameters. Cloud computing allows its users to store and process a huge amount of data that are in the remote servers, thus reducing the burden on user side. Since, the users have to hand over the control over their data to an unknown authority, this method gives birth to a new challenge of data protection in terms of confidentiality, integrity and availability. There have been numerous researches on data integrity protection, so far, using cryptographic tools and data replication strategies. Yet, there is always a need to trust on a third party auditor for performing data integrity verifications. This has led to the threat of being cheated on, if the cloud authority colludes with the third party verifier. Also, the verification strategies that are already proposed and executed failed to identify the data corruption at the time of occurrence itself. To overcome this problem, we propose a data integrity verification scheme that uses blockchain technology for storing the data in the cloud. The blockchain properties like immutability and decentralization are integrated to identify the tampering of data immediately. Hence, the third party auditor is removed by enabling the minors to take control over the blocks in the blockchain. This paper shows the design and working model of the integration of blockchain in cloud storage.

5 citations


Journal ArticleDOI
TL;DR: This paper presents some schemes to increase the instruction TLB resilience to soft errors without requiring any extra storage space, by taking advantage of the spatial locality principle that takes place when executing a program.
Abstract: A translation lookaside buffer (TLB) is a type of cache used to speed up the virtual to physical memory translation process. Instruction TLBs store virtual page numbers and their related physical page numbers for the last accessed pages of instruction memory. TLBs like other memories suffer soft errors that can corrupt their contents. A false positive due to an error produced in the virtual page number stored in the TLB may lead to a wrong translation and, consequently, the execution of a wrong instruction that can lead to a program hard fault or to data corruption. Parity or error correction codes have been proposed to provide protection for the TLB, but they require additional storage space. This paper presents some schemes to increase the instruction TLB resilience to this type of errors without requiring any extra storage space, by taking advantage of the spatial locality principle that takes place when executing a program.

Posted Content
TL;DR: Simulation-based evaluation with seven data-intensive applications shows Tvarak's performance and energy efficiency, including Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.
Abstract: Tvarak efficiently implements system-level redundancy for direct-access (DAX) NVM storage. Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables detection of and recovery from data corruption due to device firmware bugs (e.g., reading data from the wrong physical location). Direct access to NVM penalizes software-only implementations of system-level redundancy, forcing a choice between lack of data protection or significant performance penalties. Offloading the update and verification of system-level redundancy to Tvarak, a hardware controller co-located with the last-level cache, enables efficient protection of data from such bugs in memory controller and NVM DIMM firmware. Simulation-based evaluation with seven data-intensive applications shows Tvarak's performance and energy efficiency. For example, Tvarak reduces Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.


Patent
21 Mar 2019
TL;DR: In this article, data targeted for storage into a drive array is divided into codewords with data and parity symbols, and the symbols of the codeword are randomly distributed across a stripe of the drive array.
Abstract: Data targeted for storage into a drive array is divided into codewords with data and parity symbols. The symbols of the codewords are randomly distributed across a stripe of the drive array. One or more drives affected by data corruption are found based on a probability that a subset of inconsistent codewords intersects the one or more drives.

Book ChapterDOI
03 Jan 2019
TL;DR: The main proposed objective of this chapter is to develop an auditing mechanism with a homomorphic token key for security purposes that will easily be able to locate errors and also the root cause of the error.
Abstract: Cloud computing is the greatest learning in the computing field and a dreamed vision of computing as a utility so to enjoy the on-demand high-quality applications. Cloud security is the critical factor that places an imperative role in maintaining the secure and reliable data services. In large-range cloud computing, a large pool of erasable, usable, and accessible virtualized resources are used as hardware development platforms and/or sources. These resources can be vigorously reconfigured to adjust a variable load allowing also for optimum resource utilization. The pool of resource is typically exploited by a peer-to-peer use model in which guarantees are presented by the infrastructure provided by means of customized service-level architecture (SLA).The hierarchical structure has been proven effective for solving data storage issues as well as data integrity by giving data protection during the full life span. Cloud computing is related to numerous technologies, and the convergence of diverse technologies has emerged to be called cloud computing. Storage in the cloud provides attractive cost and high-quality applications on large data storage. Security offerings and capability continue to increase and vary between cloud providers. Cloud offers greater convenience to users toward data because they will not bother about the direct hardware management. For security issues, a secret key is generated. Key consideration is to efficiently detect any unauthorized data corruption and modification which arises due to byzantine failures. Cloud service providers (CSP) are separate administrative entities, where data outsourcing is actually relinquishing user’s ultimate control over the fate of their data. As an outcome, the accuracy of the data in the cloud is being set at a high risk. In distributed cloud servers, all these inconsistencies are detected and data is guaranteed. The main proposed objective of this chapter is to develop an auditing mechanism with a homomorphic token key for security purposes. By using this secret token, we will easily be able to locate errors and also the root cause of the error. By the error recovery algorithm, we recover these corrupted files and locate the error.


Proceedings ArticleDOI
01 Jun 2019
TL;DR: It is demonstrated that IVP is in the recently proposed class of cryptographic constructions called ‘Random Oracles according to Observer functions’ (RO2) and supports implicit data integrity and is secure in input perturbing and oracle replacing adversary models.
Abstract: We present a cryptographic construction called IVP and study its security properties. IVP is a three level confusion-diffusion network that supports confidentiality and data integrity without requiring any message expansion of the content, such as, for example, for the computation of a MAC. We demonstrate that IVP is in the recently proposed class of cryptographic constructions called ‘Random Oracles according to Observer functions’ (RO2). These constructions support a new notion of data integrity called ‘implicit’ data integrity, which is based on the fact that user data usually demonstrate some patterns. If some ciphertext becomes corrupted, then the resulting plaintext no longer demonstrates such patterns. Thus, defense against data corruption attacks becomes possible by hardening the computation of ciphertext values, the plaintext of which demonstrates patterns. The encryption key is considered unknown.We show that IVP supports implicit data integrity and is secure in input perturbing and oracle replacing adversary models. The security of IVP is associated with a pattern which is frequently encountered among client and server data. This is the pattern of encountering 4 or more 16-bit words being equal to each other in a set of 32 words. The cryptographic strength of IVP is 30.215 bits, which is sufficient for defending against on-line data corruption and content replay attacks. Computationally, IVP is much lighter than other authenticated encryption approaches requiring only two additional rounds of AES, beyond the AES standard encryption rounds in the critical path. These correspond to some minimal computation overhead.

Proceedings ArticleDOI
Michael E. Kounavis1
01 Jun 2019
TL;DR: It is demonstrated that the class of cryptographic constructions known as ‘random oracles according to observer functions’, which has been proposed for mitigating data Corruption attacks, is actually simultaneously secure under two different adversary models: an input perturbing adversary performing content corruption attacks, and an oracle replacing adversary performingcontent replay attacks.
Abstract: We study the security of the recently proposed implicit integrity methodology. Implicit integrity is a novel methodology that supports corruption detection without producing, storing or verifying mathematical summaries of the content such as MACs or ICVs, as typically done today. The main idea behind implicit integrity is that, whereas typical user data demonstrate patterns such as repeated bytes or words, decrypted data resulting from corrupted ciphertexts no longer demonstrate such patterns. Thus, by checking the entropy of decrypted ciphertexts, corruption can be possibly detected. Past contributions to the implicit integrity methodology have focused on observed patterns on client and server data that motivate the methodology, entropy definitions for arbitrarily small messages, and constructions that mitigate data corruption attacks. In this paper, we extend the known analytical results concerning implicit integrity addressing content replay attacks as well. We demonstrate that the class of cryptographic constructions known as ‘random oracles according to observer functions’, which has been proposed for mitigating data corruption attacks, is actually simultaneously secure under two different adversary models: an input perturbing adversary performing content corruption attacks, and an oracle replacing adversary performing content replay attacks.

Book ChapterDOI
01 Jan 2019
TL;DR: Data quality is a term that can be broadly defined and used and broken down into six categories: loss, corruption, inaccurate representation, lack of precision, incorrect measurement identification, and excessive latency, which are analyzed for the principal categories of applications which are off-line (analysis), near real-time (operations), and real- time (controls).
Abstract: Data quality is a term that can be broadly defined and used. Here, it is broken down into six categories: loss, corruption, inaccurate representation, lack of precision, incorrect measurement identification, and excessive latency. These are discussed along with their causes and impacts. The overall problem of assuring high data quality starts with the measurement system itself. High quality can be built into the measurement system starting with planning and carried through the installation. A good maintenance program coupled with an error detection system can keep the quality high. Some data quality problems affect all applications, like lost data, incorrect values, and misidentified quantities. Other problems, like excessive latency, may have an impact on operational uses, but not off-line analysis. High resolution is important for small-signal analysis, but not displays. These types of impairments and their impacts are analyzed for the principal categories of applications which are off-line (analysis), near real-time (operations), and real-time (controls). The use of an LSE is discussed for both measurement assurance and extension of the measurement set.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper presents a computational model, refered to as mimic replication, that provides resilience against SDC errors through dynamic reexecution of processes that are vulnerable to having their data tainted due to a detected latent error and provides an analytical model that allows tradeoff between resource and energy consumption and resilience.
Abstract: The largest computing systems routinely run into silent data corruption (SDC) as part of its normal operation. The number of SDCs will increase drastically as computing systems approach the exascale mark, forcing a need to reconsider the resilience approach taken to counteract the effects of unmitigated data corruption errors. Yet any resilience method must be sensitive to both resource and energy requirements. In this paper we explore the propagation of data corruption errors caused in stencil computation, an iterative kernel with structured communication pattern that is found in a wide variety of scientific and engineering problems. We present a computational model, refered to as mimic replication, that provides resilience against SDC errors through dynamic reexecution of processes that are vulnerable to having their data tainted due to a detected latent error. We then provide an analytical model that allows tradeoff between resource and energy consumption and resilience.

Patent
15 Jan 2019
TL;DR: In this article, the authors propose an apparatus for implementing a method of creating a data chain which can be cryptographically proven to contain valid data, which is capable of providing a technical effect of making a data processing system robust against data corruption, data loss, failure in data communication synchronization and similar practical operational issues.
Abstract: Disclosed is an apparatus for implementing a method of creating a data chain, which can be cryptographically proven to contain valid data. The method comprises steps of: (a) creating a data chain with no elements; (b) validating the data chain for nodes before accepting the data chain; (c) verifying the size of close group to add the data chain; (d) adding a data block to the data chain; (e) removing old copies of entries from the data chain only if a chained consensus would not be broken, else maintaining the entry and marking it as deleted; (f) validating a majority of pre-existing nodes; and (g) validating a signature of the data chain via the data chain of signed elements. The apparatus is operable to support a data communication system, wherein the apparatus is operable to ensure that a given data structure has cryptographically valid data while relocating the data from a switching off node to a live node during a churn event. Such an apparatus is capable of providing a technical effect of making a data processing system robust against data corruption, data loss, failure in data communication synchronization and similar practical operational issues.

Patent
10 Oct 2019
TL;DR: In this paper, a system and method for identifying data corruption in a data transfer over an error-proof communication link, wherein additional structure checksums are formed to secure a data structure during transfer of the data structure, where representatives are associated with the data types, and the structure checksum is formed via the representatives to provide identification of data corruption.
Abstract: System and method for identifying data corruption in a data transfer over an error-proof communication link, wherein additional structure checksums are formed to secure a data structure during transfer of the data structure, where representatives are associated with the data types, and the structure checksum is formed via the representatives to provide identification of data corruption in a data transfer over an error-proof communication link between a first automation component and a second automation component in industrial control engineering.

Patent
15 Jan 2019
TL;DR: In this article, a data processing method and a storage device which is used for improving the security of data stored on the storage device is described. But the authors do not specify the data processing instructions themselves.
Abstract: The embodiment of the invention discloses a data processing method and a storage device, which are used for improving the security of data stored on the storage device The data processing method of the embodiment of the invention comprises the following steps: a storage device acquires a data processing instruction sent by a host computer; the data processing instruction is used for operating thedata stored on the storage device; the storage device judges whether the data processing instruction conforms to a preset data destruction rule; if the data processing instruction complies with the preset data corruption rule, the storage device executes a preset processing policy to protect data stored on the storage device In this way, by identifying and judging the data processing instructionfrom the host computer on the storage device, if it is recognized that the data processing instruction conforms to a preset data destruction rule, the storage device executes a preset processing strategy to protect the data stored on the storage device, thereby improving the safety of the data stored on the storage device

Posted Content
TL;DR: The effectiveness of the proposed novel RObust regression algorithm via Online Feature Selection (RoOFS) is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.
Abstract: The presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (\textit{RoOFS}) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. We also prove that our algorithm has a restricted error bound compared to the optimal solution. Extensive empirical experiments in both synthetic and real-world datasets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.

Patent
22 Oct 2019
TL;DR: In this paper, an object-level metadata structure corresponding to a stored object, wherein the stored object comprises a plurality of blocks, is disclosed, and for a block included in the plurality, two or more locations in the metadata structure at which to store a value computed based at least in part on data comprising the block.
Abstract: Detecting and pinpointing data corruption is disclosed, including: storing an object-level metadata structure corresponding to a stored object, wherein the stored object comprises a plurality of blocks; and determining for a block included in the plurality of blocks, based at least in part on a piece of identifying information of the block, two or more locations in the object-level metadata structure at which to store a value computed based at least in part on data comprising the block.

Patent
Mathur Rohitashva1
19 Dec 2019
TL;DR: In this paper, a data transaction thread may be allowed to continue to execute statements that modify data tables, or the data transaction threads may be terminated based on the contextual data returning an indication of a data corruption in one or more supporting data structures.
Abstract: In a multitenant data platform architecture, one or more supporting data tables are used to write and store tenant data responsive to data write requests. Based on the contextual data returning an indication of a data corruption in one or more supporting data structures, an action associated with the data transaction thread is performed. A log of data corruptions and corresponding call stack trace data may be generated. The data transaction thread may be allowed to continue to execute statements that modify data tables, or the data transaction thread may be terminated. Data corruptions may be compensated for by nullifying data changes caused by corruption causing call sites. Verification methods may be used to ensure correctness of data within a transaction thread.