scispace - formally typeset
Search or ask a question

Showing papers on "Distributed database published in 2021"


Journal ArticleDOI
Moming Duan1, Duo Liu1, Xianzhang Chen1, Renping Liu1, Yujuan Tan1, Liang Liang1 
TL;DR: A self-balancing FL framework named Astraea is built, which relieves global imbalance by adaptive data augmentation and downsampling, and for averaging the local imbalance, it creates the mediator to reschedule the training of clients based on Kullback–Leibler divergence (KLD) of their data distribution.
Abstract: Federated learning (FL) is a distributed deep learning method that enables multiple participants, such as mobile and IoT devices, to contribute a neural network while their private training data remains in local devices. This distributed approach is promising in the mobile systems where have a large corpus of decentralized data and require high privacy. However, unlike the common datasets, the data distribution of the mobile systems is imbalanced which will increase the bias of model. In this article, we demonstrate that the imbalanced distributed training data will cause an accuracy degradation of FL applications. To counter this problem, we build a self-balancing FL framework named Astraea, which alleviates the imbalances by 1) Z-score-based data augmentation, and 2) Mediator-based multi-client rescheduling. The proposed framework relieves global imbalance by adaptive data augmentation and downsampling, and for averaging the local imbalance, it creates the mediator to reschedule the training of clients based on Kullback–Leibler divergence (KLD) of their data distribution. Compared with FedAvg , the vanilla FL algorithm, Astraea shows +4.39 and +6.51 percent improvement of top-1 accuracy on the imbalanced EMNIST and imbalanced CINIC-10 datasets, respectively. Meanwhile, the communication traffic of Astraea is reduced by 75 percent compared to FedAvg .

199 citations


Journal ArticleDOI
TL;DR: In this article, a two-layer federated learning model is proposed to take advantage of the distributed end-edge-cloud architecture typical in 6G environment, and to achieve a more efficient and more accurate learning while ensuring data privacy protection and reducing communication overheads.
Abstract: The vision of the upcoming 6G technologies that have fast data rate, low latency, and ultra-dense network, draws great attentions to the Internet of Vehicles (IoV) and Vehicle-to-Everything (V2X) communication for intelligent transportation systems. There is an urgent need for distributed machine learning techniques that can take advantages of massive interconnected networks with explosive amount of heterogeneous data generated at the network edge. In this study, a two-layer federated learning model is proposed to take advantages of the distributed end-edge-cloud architecture typical in 6G environment, and to achieve a more efficient and more accurate learning while ensuring data privacy protection and reducing communication overheads. A novel multi-layer heterogeneous model selection and aggregation scheme is designed as a part of the federated learning process to better utilize the local and global contexts of individual vehicles and road side units (RSUs) in 6G supported vehicular networks. This context-aware distributed learning mechanism is then developed and applied to address intelligent object detection, which is one of the most critical challenges in modern intelligent transportation systems with autonomous vehicles. Evaluation results showed that the proposed method, which demonstrates a higher learning accuracy with better precision, recall and F1 score, outperforms other state-of-the-art methods under 6G network configuration by achieving faster convergence, and scales better with larger numbers of RSUs involved in the learning process.

126 citations


Journal ArticleDOI
TL;DR: This article proposes a lightweight sampling-based probabilistic approach, namely EDI-V, to help app vendors audit the integrity of their data cached on a large scale of edge servers, and proposes a new data structure named variable Merkle hash tree (VMHT) for generating the integrity proofs of those data replicas during the audit.
Abstract: Edge computing allows app vendors to deploy their applications and relevant data on distributed edge servers to serve nearby users. Caching data on edge servers can minimize users’ data retrieval latency. However, such cache data are subject to both intentional and accidental corruption in the highly distributed, dynamic, and volatile edge computing environment. Given a large number of edge servers and their limited computing resources, how to effectively and efficiently audit the integrity of app vendors’ cache data is a critical and challenging problem. This article makes the first attempt to tackle this Edge Data Integrity (EDI) problem. We first analyze the threat model and the audit objectives, then propose a lightweight sampling-based probabilistic approach, namely EDI-V, to help app vendors audit the integrity of their data cached on a large scale of edge servers. We propose a new data structure named variable Merkle hash tree (VMHT) for generating the integrity proofs of those data replicas during the audit. VMHT can ensure the audit accuracy of EDI-V by maintaining sampling uniformity. EDI-V allows app vendors to inspect their cache data and locate the corrupted ones efficiently and effectively. Both theoretical analysis and comprehensively experimental evaluation demonstrate the efficiency and effectiveness of EDI-V.

85 citations


Journal ArticleDOI
TL;DR: A new decentralized health architecture is proposed, called BEdgeHealth that integrates MEC and blockchain for data offloading and data sharing in distributed hospital networks and a data sharing scheme which enables data exchanges among healthcare users by leveraging blockchain and interplanetary file system.
Abstract: The healthcare industry has witnessed significant transformations in e-health services by using mobile-edge computing (MEC) and blockchain to facilitate healthcare operations. Many MEC-blockchain-based schemes have been proposed, but some critical technical challenges still remain, such as low Quality of Services (QoS), data privacy, and system security vulnerabilities. In this article, we propose a new decentralized health architecture, called BEdgeHealth that integrates MEC and blockchain for data offloading and data sharing in distributed hospital networks. First, a data offloading scheme is proposed where mobile devices can offload health data to a nearby MEC server for efficient computation with privacy awareness. Moreover, we design a data-sharing scheme, which enables data exchanges among healthcare users by leveraging blockchain and interplanetary file system. Particularly, a smart contract-based authentication mechanism is integrated with MEC to perform decentralized user access verification at the network edge without requiring any central authority. The real-world experiment results and evaluations demonstrate the effectiveness of the proposed BEdgeHealth architecture in terms of improved QoS with data privacy and security guarantees, compared to the existing schemes.

82 citations


Journal ArticleDOI
TL;DR: This work proposes a novel coded computing framework, CodedFedL, that injects structured coding redundancy into federated learning for mitigating stragglers and speeding up the training procedure.
Abstract: Federated learning enables training a global model from data located at the client nodes, without data sharing and moving client data to a centralized server. Performance of federated learning in a multi-access edge computing (MEC) network suffers from slow convergence due to heterogeneity and stochastic fluctuations in compute power and communication link qualities across clients. We propose a novel coded computing framework, CodedFedL, that injects structured coding redundancy into federated learning for mitigating stragglers and speeding up the training procedure. CodedFedL enables coded computing for non-linear federated learning by efficiently exploiting distributed kernel embedding via random Fourier features that transforms the training task into computationally favourable distributed linear regression. Furthermore, clients generate local parity datasets by coding over their local datasets, while the server combines them to obtain the global parity dataset. Gradient from the global parity dataset compensates for straggling gradients during training, and thereby speeds up convergence. For minimizing the epoch deadline time at the MEC server, we provide a tractable approach for finding the amount of coding redundancy and the number of local data points that a client processes during training, by exploiting the statistical properties of compute as well as communication delays. We also characterize the leakage in data privacy when clients share their local parity datasets with the server. Additionally, we analyze the convergence rate and iteration complexity of CodedFedL under simplifying assumptions, by treating CodedFedL as a stochastic gradient descent algorithm. Finally, for demonstrating gains that CodedFedL can achieve in practice, we conduct numerical experiments using practical network parameters and benchmark datasets, in which CodedFedL speeds up the overall training time by up to $15\times $ in comparison to the benchmark schemes.

80 citations


Journal ArticleDOI
TL;DR: A query processing problem in an EDMS, which aims to derive a distributed query plan with the minimum query response latency, is defined and it is proved that this problem is NP-Hard and a corresponding approximation algorithm is proposed.
Abstract: The massive amount of data generated by the Internet-of-Things (IoT) devices places enormous pressure on sensory data query processing. Due to the limitations of computation and data transmission capabilities in traditional wireless sensor networks (WSNs), the current query processing methods are no longer effective. Furthermore, processing vast amount of sensory data also overloads the cloud. To address these problems, we investigate query processing in an edge-assisted IoT data monitoring system (EDMS). Multiaccess edge computing (MEC) is an emerging topic in IoTs. Unlike WSNs, the edge servers in an EDMS can deploy the computation and storage resources to nearby IoT devices and offer data processing services. Therefore, queries toward massive sensory data can be processed in an EDMS in a distributed manner and the edge servers can handle the sensory data in a distributed manner, reducing the workload of the cloud. In this article, we define a query processing problem in an EDMS, which aims to derive a distributed query plan with the minimum query response latency. We prove that this problem is NP-Hard and propose a corresponding approximation algorithm. The performance of the proposed algorithm is bounded. Furthermore, we evaluate the performance of the proposed algorithm through extensive simulations.

57 citations


Journal ArticleDOI
TL;DR: In this article, a primal-dual optimization strategy was proposed to design federated learning algorithms that are provably fast and require as few assumptions as possible, which can deal with nonconvex objective functions, achieves the best possible optimization and communication complexity.
Abstract: Federated Learning (FL) is popular for communication-efficient learning from distributed data. To utilize data at different clients without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a computation then aggregation model, in which multiple local updates are performed using local data before aggregation. These algorithms fail to work when faced with practical challenges, e.g., the local data being non-identically independently distributed. In this paper, we first characterize the behavior of the FedAvg algorithm, and show that without strong and unrealistic assumptions on the problem structure, it can behave erratically. Aiming at designing FL algorithms that are provably fast and require as few assumptions as possible, we propose a new algorithm design strategy from the primal-dual optimization perspective. Our strategy yields algorithms that can deal with non-convex objective functions, achieves the best possible optimization and communication complexity (in a well-defined sense), and accommodates full-batch and mini-batch local computation models. Importantly, the proposed algorithms are communication efficient , in that the communication effort can be reduced when the level of heterogeneity among the local data also reduces. In the extreme case where the local data becomes homogeneous, only $\mathcal {O}(1)$ communication is required among the agents. To the best of our knowledge, this is the first algorithmic framework for FL that achieves all the above properties.

57 citations


Journal ArticleDOI
TL;DR: Cpds is proposed, a compressed and private data sharing framework that provides efficient andPrivate data management for product data stored on the blockchain and devises two new mechanisms to store compressed and policy-enforced product data on the Blockchain.
Abstract: Internet of Things (IoT) is a promising technology to provide product traceability for industrial systems. By using sensing and networking techniques, an IoT-enabled industrial system enables its participants to efficiently track products and record their status during production process. Current industrial IoT systems lack a unified product data sharing service, which prevents the participants from acquiring trusted traceability of products. Using emerging blockchain technology to build such a service is a promising direction. However, directly storing product data on blockchain incurs in efficiency and privacy issues in data management due to its distributed infrastructure. In response, we propose Cpds, a compressed and private data sharing framework, that provides efficient and private data management for product data stored on the blockchain. Cpds devises two new mechanisms to store compressed and policy-enforced product data on the blockchain. As a result, multiple industrial participants can efficiently share product data with fine-grained access control in a distributed environment without relying on a trusted intermediary. We conduct extensive empirical studies and demonstrate the feasibility of Cpds in improving the efficiency and security protection of product data storage on the blockchain.

48 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper utilize data as a tuning knob and propose two efficient polynomial-time algorithms to schedule different workloads on various mobile devices, when data is identically or non-identically distributed.
Abstract: Originated from distributed learning, federated learning enables privacy-preserved collaboration on a new abstracted level by sharing the model parameters only. While the current research mainly focuses on optimizing learning algorithms and minimizing communication overhead left by distributed learning, there is still a considerable gap when it comes to the real implementation on mobile devices. In this article, we start with an empirical experiment to demonstrate computation heterogeneity is a more pronounced bottleneck than communication on the current generation of battery-powered mobile devices, and the existing methods are haunted by mobile stragglers. Further, non-identically distributed data across the mobile users makes the selection of participants critical to the accuracy and convergence. To tackle the computational and statistical heterogeneity, we utilize data as a tuning knob and propose two efficient polynomial-time algorithms to schedule different workloads on various mobile devices, when data is identically or non-identically distributed. For identically distributed data, we combine partitioning and linear bottleneck assignment to achieve near-optimal training time without accuracy loss. For non-identically distributed data, we convert it into an average cost minimization problem and propose a greedy algorithm to find a reasonable balance between computation time and accuracy. We also establish an offline profiler to quantify the runtime behavior of different devices, which serves as the input to the scheduling algorithms. We conduct extensive experiments on a mobile testbed with two datasets and up to 20 devices. Compared with the common benchmarks, the proposed algorithms achieve 2-100× speedup epoch-wise, 2–7 percent accuracy gain and boost the convergence rate by more than 100 percent on CIFAR10.

48 citations


Journal ArticleDOI
18 Feb 2021
TL;DR: In this article, the authors provide a holistic overview of relevant communication and ML principles and, thereby, present communication-efficient and distributed learning frameworks with selected use cases, with the aim of improving the communication efficiency of distributed learning by optimizing communication payload types, transmission techniques, and scheduling.
Abstract: Machine learning (ML) is a promising enabler for the fifth-generation (5G) communication systems and beyond. By imbuing intelligence into the network edge, edge nodes can proactively carry out decision-making and, thereby, react to local environmental changes and disturbances while experiencing zero communication latency. To achieve this goal, it is essential to cater for high ML inference accuracy at scale under the time-varying channel and network dynamics, by continuously exchanging fresh data and ML model updates in a distributed way. Taming this new kind of data traffic boils down to improving the communication efficiency of distributed learning by optimizing communication payload types, transmission techniques, and scheduling, as well as ML architectures, algorithms, and data processing methods. To this end, this article aims to provide a holistic overview of relevant communication and ML principles and, thereby, present communication-efficient and distributed learning frameworks with selected use cases.

39 citations


Journal ArticleDOI
TL;DR: A distributed memetic algorithm (DMA) is proposed for enhancing database privacy and utility and a balanced best random distributed framework is designed to achieve high optimization efficiency.
Abstract: Data privacy and utility are two essential requirements in outsourced data storage. Traditional techniques for sensitive data protection, such as data encryption, affect the efficiency of data query and evaluation. By splitting attributes of sensitive associations, database fragmentation techniques can help protect data privacy and improve data utility. In this article, a distributed memetic algorithm (DMA) is proposed for enhancing database privacy and utility. A balanced best random distributed framework is designed to achieve high optimization efficiency. In order to enhance global search, a dynamic grouping recombination operator is proposed to aggregate and utilize evolutionary elements; two mutation operators, namely, merge and split, are designed to help arrange and create evolutionary elements; a two-dimension selection approach is designed based on the priority of privacy and utility. Furthermore, a splicing-driven local search strategy is embedded to introduce rare utility elements without violating constraints. Extensive experiments are carried out to verify the performance of the proposed DMA. Furthermore, the effectiveness of the proposed distributed framework and novel operators is verified.

Journal ArticleDOI
TL;DR: The proposed ZeKoC approach, a Zero Knowledge Clustering approach to mitigating adversarial attacks, successfully mitigates general attacks while outperforming state-of-art schemes.
Abstract: The simultaneous development of deep learning techniques and Internet of Things (IoT)/Cyber-physical Systems (CPS) technologies has afforded untold possibilities for improving distributed computing, sensing, and data analysis Among these technologies, federated learning has received increased attention as a privacy-preserving collaborative learning paradigm, and has shown significant potential in IoT/CPS-driven large-scale smart-world systems At the same time, the vulnerabilities of deep neural networks, especially to adversarial attacks, cannot be overstated and should not be minimized Moreover, the distributed nature of federated learning makes defense against such adversarial attacks a more challenging problem due to the unavailability of local data and resource heterogeneity To tackle these challenges, in this paper, we propose ZeKoC, a Zero Knowledge Clustering approach to mitigating adversarial attacks Particularly, we first formulate the problem of resource-constrained adversarial mitigation Specifically, noting that a global server has no access to training samples, we reformulate the unsupervised weight clustering problem Our proposed ZeKoC approach allows the server to automatically split and merge weight clusters for weight selection and aggregation Theoretical analysis demonstrates that convergence is guaranteed Further, our experimental results illustrate that, in a non-iid (ie, independent and identically distributed) data setting, the proposed ZeKoC approach successfully mitigates general attacks while outperforming state-of-art schemes

Journal ArticleDOI
TL;DR: An intelligent storage scheme to store data dynamically with reinforcement learning based on trustworthiness and popularity, which improves resource scheduling and storage space allocation is adopted, and trapdoor hashing based identity authentication protocol is proposed to secure transportation network access.
Abstract: A large scale fast-growing data generated in intelligent transportation systems (ITS) has become a ponderous burden on the coordination of heterogeneous transportation networks, which makes the traditional cloud-centric storage architecture no longer satisfy new data analytics requirements. Meanwhile, the lack of storage trust between ITS devices and edge servers could lead to security risks in the data storage process. However, a unified data distributed storage architecture for ITS with intelligent management and trustworthiness is absent in the previous works. To address these challenges, this paper proposes a distributed trustworthy storage architecture with reinforcement learning in ITS, which also promotes edge services. We adopt an intelligent storage scheme to store data dynamically with reinforcement learning based on trustworthiness and popularity, which improves resource scheduling and storage space allocation. Besides, trapdoor hashing based identity authentication protocol is proposed to secure transportation network access. Due to the interaction between cooperative devices, our proposed trust evaluation mechanism is provided with extensibility in the various ITS. Simulation results demonstrate that our proposed distributed trustworthy storage architecture outperforms the compared ones in terms of trustworthiness and efficiency.

Journal ArticleDOI
17 Jun 2021
TL;DR: In this article, a survey of state-of-the-art methods for processing remotely sensed big data and thoroughly investigates existing parallel implementations on diverse popular high-performance computing platforms are discussed in terms of capability, scalability, reliability, and ease of use.
Abstract: This article gives a survey of state-of-the-art methods for processing remotely sensed big data and thoroughly investigates existing parallel implementations on diverse popular high-performance computing platforms. The pros/cons of these approaches are discussed in terms of capability, scalability, reliability, and ease of use. Among existing distributed computing platforms, cloud computing is currently the most promising solution to efficient and scalable processing of remotely sensed big data due to its advanced capabilities for high-performance and service-oriented computing. We further provide an in-depth analysis of state-of-the-art cloud implementations that seek for exploiting the parallelism of distributed processing of remotely sensed big data. In particular, we study a series of scheduling algorithms (GSs) aimed at distributing the computation load across multiple cloud computing resources in an optimized manner. We conduct a thorough review of different GSs and reveal the significance of employing scheduling strategies to fully exploit parallelism during the remotely sensed big data processing flow. We present a case study on large-scale remote sensing datasets to evaluate the parallel and distributed approaches and algorithms. Evaluation results demonstrate the advanced capabilities of cloud computing in processing remotely sensed big data and the improvements in computational efficiency obtained by employing scheduling strategies.

Journal ArticleDOI
TL;DR: NEAL as discussed by the authors is a work-aware locality scheduling (NEAL) approach to reduce communication time for distributed big data operators (e.g., join and aggregation) in large data centers.
Abstract: Large data centers are currently the mainstream infrastructures for big data processing. As one of the most fundamental tasks in these environments, the efficient execution of distributed data operators (e.g., join and aggregation) are still challenging current data systems, and one of the key performance issues is network communication time. State-of-the-art methods trying to improve that problem focus on either application-layer data locality optimization to reduce network traffic or on network-layer data flow optimization to increase bandwidth utilization. However, the techniques in the two layers are totally independent from each other, and performance gains from a joint optimization perspective have not yet been explored. In this article, we propose a novel approach called NEAL (NEtwork-Aware Locality scheduling) to bridge this gap, and consequently to further reduce communication time for distributed big data operators. We present the detailed design and implementation of NEAL, and our experimental results demonstrate that NEAL always performs better than current approaches for different workloads and network bandwidth configurations.

Journal ArticleDOI
TL;DR: A pre-large weighted high-utility pattern (PWHUP) fusion framework for integrating HUPs from different sensed data sources and outperforms existing non-integration solutions in precision, recall, and runtime.
Abstract: Within the current transportation infrastructure, we have seen a steady increase in the use of sensor technologies. These sensors, individually produce large amounts of data that then need to be fused and understood. Data commingling and data integration are difficult tasks when it comes to processing such data centrally, which can require costly hardware and software techniques. Over the past few years, high-utility pattern mining (HUPM) has gained popularity due to its growing capability in identifying useful information and knowledge from stored database data, as compared to the traditional frequent pattern mining. Existing works of HUPM mostly focus on mining the set of HUPs from one data source, which cannot be implemented in real-world scenarios. In this paper, we present a pre-large weighted high-utility pattern (PWHUP) fusion framework for integrating HUPs from different sensed data sources. The proposed PWHUP algorithm considers the size of the data source for discovering more relevant HUPs for integration, which is more applicable to real-life applications and scenarios in transportation and also within other data fusion scenarios. Moreover, the pre-large concept is applied to maintain the suggested pattern for later integration, which greatly improves the effectiveness of the proposed algorithm. Our in-depth experiments show that the designed approach has good performance for knowledge integration and outperforms existing non-integration solutions in precision, recall, and runtime.

Journal ArticleDOI
TL;DR: A distributed data-driven intrusion detection approach is proposed to reveal the existence of the sparse stealthy FDI attack in a multi-area interconnected power system and reduces the risk of over-fitting, but also can locate the areas which have been attacked.
Abstract: The stealthy false data injection (FDI) attacks in smart grids can bypass the bad data detection, and thus make an incorrect state estimate in the control center. In this brief, a distributed data-driven intrusion detection approach is proposed to reveal the existence of the sparse stealthy FDI attack in a multi-area interconnected power system. The proposed distributed intrusion detection approach avoids the over-fitting issue that is extensively seen when implementing machine learning algorithms for large-scale systems. Firstly, each area estimates the entire system state based on a distributed state estimation algorithm. Then, the state of each local area is used as trained neural network input to detect the stealthy FDI attacks. Simulation results on the IEEE 118-bus system verify that the proposed method not only reduces the risk of over-fitting, but also can locate the areas which have been attacked.

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed comprehensive data partitioning strategies to cover the typical non-IID data cases and conduct extensive experiments to evaluate state-of-the-art federated learning algorithms.
Abstract: Due to the increasing privacy concerns and data regulations, training data have been increasingly fragmented, forming distributed databases of multiple ``data silos'' (eg, within different organizations and countries) To develop effective machine learning services, there is a must to exploit data from such distributed databases without exchanging the raw data Recently, federated learning (FL) has been a solution with growing interests, which enables multiple parties to collaboratively train a machine learning model without exchanging their local data A key and common challenge on distributed databases is the heterogeneity of the data distribution (ie, non-IID) among the parties There have been many FL algorithms to address the learning effectiveness under non-IID data settings However, there lacks an experimental study on systematically understanding their advantages and disadvantages, as previous studies have very rigid data partitioning strategies among parties, which are hardly representative and thorough In this paper, to help researchers better understand and study the non-IID data setting in federated learning, we propose comprehensive data partitioning strategies to cover the typical non-IID data cases Moreover, we conduct extensive experiments to evaluate state-of-the-art FL algorithms We find that non-IID does bring significant challenges in learning accuracy of FL algorithms, and none of the existing state-of-the-art FL algorithms outperforms others in all cases Our experiments provide insights for future studies of addressing the challenges in ``data silos''

Journal ArticleDOI
TL;DR: Results indicated that the designed model is well designed for handling large-scale databases with less memory usage, and the designed MapReduce framework can speed up the mining performance of closed high-utility patterns in the developed fusion system.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an edge blockchain empowered secure data access control scheme with fair accountability for the smart grid, where computation workloads of end user devices are outsourced to the edge nodes in a consortium blockchain system.
Abstract: Nowadays, the advance of smart grid technology has fostered the development of microgrids, which can efficiently control and manage the distributed energy resources (DERs). In smart grid, IoT devices generate huge amounts of data, which are collected and shared among DERs, microgrids, and the main grid. To protect the shared data, it is necessary to implement the secure and efficient data access control. Ciphertext policy attribute-based encryption (CP-ABE) is a promising solution for the distributed system. However, lightweight IoT devices with limited computing capability cannot handle the computationally intensive ABE algorithms. To overcome this constraint, the decryption phase of CP-ABE is usually outsourced to the cloud, but this is inefficient and not safe enough in the distributed environment. In this article, we propose an edge blockchain empowered secure data access control scheme with fair accountability for the smart grid. The computation workloads of end user devices are outsourced to the edge nodes in a consortium blockchain system We adopt an on-chain/off-chain approach to ensure the flexible data sharing. Additionally, we adopt the threshold secret sharing scheme to establish a distributed authority. The security analysis and performance evaluation are conducted to prove the security and efficiency of our scheme. We use the Raspberry Pi to simulate lightweight IoT devices in the Hyperledger fabric platform to prove the usability of our scheme.

Proceedings ArticleDOI
09 Jun 2021
TL;DR: In this article, the authors perform a twin study of blockchains and distributed database systems as two types of transactional systems and propose a taxonomy that illustrates the dichotomy across four dimensions, namely replication, concurrency, storage, and sharding.
Abstract: Blockchain has come a long way - a system that was initially proposed specifically for cryptocurrencies is now being adapted and adopted as a general-purpose transactional system. As blockchain evolves into another data management system, the natural question is how it compares against distributed database systems. Existing works on this comparison focus on high-level properties, such as security and throughput. They stop short of showing how the underlying design choices contribute to the overall differences. Our work fills this important gap. We perform a twin study of blockchains and distributed database systems as two types of transactional systems. We propose a taxonomy that illustrates the dichotomy across four dimensions, namely replication, concurrency, storage, and sharding. Within each dimension, we discuss how the design choices are driven by two goals: security for blockchains, and performance for distributed databases. We conduct an extensive and in-depth performance analysis of two blockchains, namely Quorum and Hyperledger Fabric, and three distributed databases, namely CockroachDB, TiDB, and etcd. Our analysis exposes the impact of different design choices on the overall performance. Concisely, our work provides a principled framework for analyzing the emerging trend of blockchain-database fusion.

Proceedings ArticleDOI
01 Jul 2021
TL;DR: In this article, a communication cost minimization (CCM) problem is formulated to minimize the communication cost raised by edge/cloud aggregations with making decisions on edge aggregator selection and distributed node association.
Abstract: Federated learning (FL) can enable distributed model training over mobile nodes without sharing privacy-sensitive raw data. However, to achieve efficient FL, one significant challenge is the prohibitive communication overhead to commit model updates since frequent cloud model aggregations are usually required to reach a target accuracy, especially when the data distributions at mobile nodes are imbalanced. With pilot experiments, it is verified that frequent cloud model aggregations can be avoided without performance degradation if model aggregations can be conducted at edge. To this end, we shed light on the hierarchical federated learning (HFL) framework, where a subset of distributed nodes are selected as edge aggregators to conduct edge aggregations. Particularly, under the HFL framework, we formulate a communication cost minimization (CCM) problem to minimize the communication cost raised by edge/cloud aggregations with making decisions on edge aggregator selection and distributed node association. Inspired by the insight that the potential of HFL lies in the data distribution at edge aggregators, we propose SHARE, i.e., SHaping dAta distRibution at Edge, to transform and solve the CCM problem. In SHARE, we divide the original problem into two sub-problems to minimize the per-round communication cost and mean Kullback-Leibler divergence of edge aggregator data, and devise two light-weight algorithms to solve them, respectively. Extensive experiments under various settings are carried out to corroborate the efficacy of SHARE.

Journal ArticleDOI
TL;DR: A novel secure energy policy and load sharing approach for renewable MGs for independent utilization of off-grid MGs with power electronic jointing (PEJ) on the basis of master-slave (M-S) which is formed in the IIoT environment is proposed.

Journal ArticleDOI
TL;DR: This work analytically characterize the optimal data transfer solution under different assumptions on the fog network scenario, showing for example that the value of offloading is approximately linear in the range of computing costs in the network when the cost of discarding is modeled as decreasing linearly in the amount of data processed at each node.
Abstract: Fog computing promises to enable machine learning tasks to scale to large amounts of data by distributing processing across connected devices. Two key challenges to achieving this goal are (i) heterogeneity in devices’ compute resources and (ii) topology constraints on which devices communicate with each other. We address these challenges by developing a novel network-aware distributed learning methodology where devices optimally share local data processing and send their learnt parameters to a server for periodic aggregation. Unlike traditional federated learning, our method enables devices to offload their data processing tasks to each other, with these decisions optimized to trade off costs associated with data processing, offloading, and discarding. We analytically characterize the optimal data transfer solution under different assumptions on the fog network scenario, showing for example that the value of offloading is approximately linear in the range of computing costs in the network when the cost of discarding is modeled as decreasing linearly in the amount of data processed at each node. Our experiments on real-world data traces from our testbed confirm that our algorithms improve network resource utilization substantially without sacrificing the accuracy of the learned model, for varying distributions of data across devices. We also investigate the effect of network dynamics on model learning and resource costs.

Journal ArticleDOI
TL;DR: Pangeo as discussed by the authors is a cloud-native data repository for scientific research, which offers several advantages over traditional data repositories, such as performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access and inclusion.
Abstract: Scientific data have traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow toward the petabyte scale. A “cloud-native data repository,” as defined in this article, offers several advantages over traditional data repositories—performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access and inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing’s full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.

Journal ArticleDOI
TL;DR: This work proposes a federated Markov chain Monte Carlo with a delayed rejection (FMCMC-DR) method to estimate the representative parameters of the global distribution, and formulates a problem on digital twin-assisted federated distribution discovery.
Abstract: We are now in an information era and the volume of data is growing explosively. However, due to privacy issues, it is very common that data cannot be freely shared among the data generating. Federated analytics was recently proposed aiming at deriving analytical insights among data-generating devices without exposing the raw data, but the intermediate analytics results. Note that the computing resources at the data generating devices are limited, thus making on-device execution of computing-intensive tasks challenging. We thus propose to apply the digital twin technique, which emulates the resource-limited physical/end side, while utilizing the rich resource at the virtual/computing side. Nevertheless, how to use the digital twin technique to assist federated analytics while preserving distributed data privacy is challenging. To address such a challenge, this work first formulates a problem on digital twin-assisted federated distribution discovery. Then, we propose a federated Markov chain Monte Carlo with a delayed rejection (FMCMC-DR) method to estimate the representative parameters of the global distribution. We combine a rejection–acceptance sampling technique and a delayed rejection technique, allowing our method to be able to explore the full state space. Finally, we evaluate FMCMC-DR against the Metropolis–Hastings (MH) algorithm and random walk Markov chain Monte Carlo method (RW-MCMC) using numerical experiments. The results show our algorithm outperforms the other two methods by 50% and 95% contour accuracy, respectively, and has a better convergence rate.

Journal ArticleDOI
TL;DR: This paper proposes CRANE, an effiCient Replica migrAtion scheme for distributed cloud Storage, and shows that it provides a sub-optimal solution for the replica migration problem with lower computational complexity than its integer linear program formulation.
Abstract: With the wide adoption of large-scale internet services and big data, the cloud has become the ideal environment to satisfy the ever-growing storage demand. In this context, data replication has been touted as the ultimate solution to improve data availability and reduce access time. However, replica management systems usually need to migrate and create a large number of data replicas over time between and within data centers, incurring a large overhead in terms of network load and availability. In this paper, we propose CRANE, an effiCient Replica migrAtion scheme for distributed cloud Storage systEms. CRANE complements any replica placement algorithm by efficiently managing replica creation in geo-distributed infrastructures in order to (1) minimize the time needed to copy the data to the new replica location, (2) avoid network congestion, and (3) ensure the minimum desired availability for the data. Through simulation and experimental results, we show that CRANE provides a sub-optimal solution for the replica migration problem with lower computational complexity than its integer linear program formulation. We also show that, compared to OpenStack Swift, CRANE is able to reduce by up to 60 percent the replica creation and migration time and by up to 50 percent the inter-data center network traffic while ensuring the minimum required data availability.

Journal ArticleDOI
TL;DR: In this article, the Byzantine fault tolerance consensus is used to build a distributed network of processing CSPs based on the client requirements, and the master hash values are preserved in Bitcoin or Ethereum blockchain networks.
Abstract: Due to its wide accessibility, cloud services are susceptible to attacks. Data manipulation is a serious threat to data integrity which can occur in cloud computing – a relatively new offering under the umbrella of cloud services. Data can be tampered with, and malicious actors could use this to their advantage. Cloud computing clients in various application domains want to be assured that their data is accurate and trustworthy. On another spectrum, blockchain is a tamper-proof digital ledger that can be used alongside cloud technology to provide a tamper-proof cloud computing environment. This paper proposes a scheme that combines cloud computing with blockchain that assures data integrity for all homomorphic encryption schemes. To overcome the cloud service provider’s (CSP) ultimate authority over the data, the proposed scheme relies on the Byzantine Fault Tolerance consensus to build a distributed network of processing CSPs based on the client requirements. After certain computations performed by all CSPs, they produce a master hash value for their database. To ensure immutable data is produced, master hash values are preserved in Bitcoin or Ethereum blockchain networks. The master hash values can be obtained by tracking the block header address for verification purposes. A theoretical analysis of the overhead costs associated with creating master hash values for each of the cryptocurrencies is presented. We found that Ethereum leads to lower client financial costs and better online performance than Bitcoin. We also specify the data security requirements the proposed scheme provides, the ground-level implementation, and future work. The proposed verification scheme is based on public cryptocurrency as a back-end service and does not require additional setup actions by the client other than a wallet for the chosen cryptocurrency.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this article, the authors propose a method for distributedly selecting relevant data, where they use a benchmark model trained on a small benchmark dataset that is task-specific, to evaluate the relevance of individual data samples at each client and select the data with sufficiently high relevance.
Abstract: Many image and vision applications require a large amount of data for model training. Collecting all such data at a central location can be challenging due to data privacy and communication bandwidth restrictions. Federated learning is an effective way of training a machine learning model in a distributed manner from local data collected by client devices, which does not require exchanging the raw data among clients. A challenge is that among the large variety of data collected at each client, it is likely that only a subset is relevant for a learning task while the rest of data has a negative impact on model training. Therefore, before starting the learning process, it is important to select the subset of data that is relevant to the given federated learning task. In this paper, we propose a method for distributedly selecting relevant data, where we use a benchmark model trained on a small benchmark dataset that is task-specific, to evaluate the relevance of individual data samples at each client and select the data with sufficiently high relevance. Then, each client only uses the selected subset of its data in the federated learning process. The effectiveness of our proposed approach is evaluated on multiple real-world image datasets in a simulated system with a large number of clients, showing up to 25% improvement in model accuracy compared to training with all data.

Journal ArticleDOI
TL;DR: This work proposes a distributed deep learning optimized system which contains a cloud server and multiple smartphone devices with computation capabilities and each device is served as a personal mobile data hub for enabling mobile computing while preserving data privacy.
Abstract: Deep learning has been becoming a promising focus in data mining research. With deep learning techniques, researchers can discover deep properties and features of events from quantitative mobile sensor data. However, many data sources are geographically separated and have strict privacy, security, and regulatory constraints. Upon releasing the privacy-sensitive data, these data sources generally no longer physically possess their data and cannot interfere with the way their personal data being used. Therefore, it is necessary to explore distributed data mining architecture which is able to conduct consensus learning based on needs. Accordingly, we propose a distributed deep learning optimized system which contains a cloud server and multiple smartphone devices with computation capabilities and each device is served as a personal mobile data hub for enabling mobile computing while preserving data privacy. The proposed system keeps the private data locally in smartphones, shares trained parameters, and builds a global consensus model. The feasibility and usability of the proposed system are evaluated by three experiments and related discussion. The experimental results show that the proposed distributed deep learning system can reconstruct the behavior of centralized training. We also measure the cumulative network traffic in different scenarios and show that the partial parameter sharing strategy does not only preserve the performance of the trained model but also can reduce network traffic. User data privacy is protected on two levels. First, local private training data do not need to be shared with other people and the user has full control of their personal training data all the time. Second, only a small fraction of trained gradients of the local model are selected for sharing, which further reduces the risk of information leaking.