scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Distributed, Parallel, and Cluster Computing in 2020"


Posted Content
TL;DR: A formal context model based on ontology using OWL is proposed to address issues including semantic context representation, context reasoning and knowledge sharing, context classification, context dependency and quality of context.
Abstract: Computing becomes increasingly mobile and pervasive today; these changes imply that applications and services must be aware of and adapt to their changing contexts in highly dynamic environments. Today, building context-aware systems is a complex task due to lack of an appropriate infrastructure support in intelligent environments. A context-aware infrastructure requires an appropriate context model to represent, manipulate and access context information. In this paper, we propose a formal context model based on ontology using OWL to address issues including semantic context representation, context reasoning and knowledge sharing, context classification, context dependency and quality of context. The main benefit of this model is the ability to reason about various contexts. Based on our context model, we also present a Service-Oriented Context-Aware Middleware (SOCAM) architecture for building of context-aware services.

438 citations


Posted Content
TL;DR: A literature review on blockchain interoperability is conducted, by collecting 262 papers, and 70 grey literature documents, constituting a corpus of 332 documents, showing that Blockchain interoperability has a much broader spectrum than cryptocurrencies.
Abstract: Blockchain interoperability is emerging as one of the crucial features of blockchain technology, but the knowledge necessary for achieving it is fragmented. This fact makes it challenging for academics and the industry to seamlessly achieve interoperability among blockchains. Given the novelty and potential of this new domain, we conduct a literature review on blockchain interoperability, by collecting 262 papers, and 70 grey literature documents, constituting a corpus of 332 documents. From those 332 documents, we systematically analyzed and discussed 80 documents, including both peer-reviewed papers and grey literature. Our review classifies studies in three categories: Cryptocurrency-directed interoperability approaches, Blockchain Engines, and Blockchain Connectors. Each category is further divided into sub-categories based on defined criteria. We discuss not only studies within each category and subcategory but also across categories, providing a holistic overview of blockchain interoperability, paving the way for systematic research in this domain. Our findings show that blockchain interoperability has a much broader spectrum than cryptocurrencies. The present survey leverages an interesting approach: we systematically contacted the authors of grey literature papers and industry solutions to obtain an updated view of their work. Finally, this paper discusses supporting technologies, standards, use cases, open challenges, and provides several future research directions.

178 citations


Journal ArticleDOI
TL;DR: This work proposes a task offloading method based on meta reinforcement learning, which can adapt fast to new environments with a small number of gradient updates and samples, and can reduce the latency by up to 25 percent compared to three baselines.
Abstract: Multi-access edge computing (MEC) aims to extend cloud service to the network edge to reduce network traffic and service latency. A fundamental problem in MEC is how to efficiently offload heterogeneous tasks of mobile applications from user equipment (UE) to MEC hosts. Recently, many deep reinforcement learning (DRL) based methods have been proposed to learn offloading policies through interacting with the MEC environment that consists of UE, wireless channels, and MEC hosts. However, these methods have weak adaptability to new environments because they have low sample efficiency and need full retraining to learn updated policies for new environments. To overcome this weakness, we propose a task offloading method based on meta reinforcement learning, which can adapt fast to new environments with a small number of gradient updates and samples. We model mobile applications as Directed Acyclic Graphs (DAGs) and the offloading policy by a custom sequence-to-sequence (seq2seq) neural network. To efficiently train the seq2seq network, we propose a method that synergizes the first order approximation and clipped surrogate objective. The experimental results demonstrate that this new offloading method can reduce the latency by up to 25% compared to three baselines while being able to adapt fast to new environments.

147 citations


Journal ArticleDOI
TL;DR: Empirical results show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.
Abstract: Function-as-a-Service (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.

128 citations


Posted Content
Siqi Luo1, Xu Chen1, Qiong Wu1, Zhi Zhou1, Shuai Yu1 
TL;DR: In this article, the authors proposed a hierarchical federated edge learning (HFEL) framework in which model aggregation is partially migrated to edge servers from the cloud and formulated a joint computation and communication resource allocation and edge association problem for device users under HFEL framework to achieve global cost minimization.
Abstract: Federated Learning (FL) has been proposed as an appealing approach to handle data privacy issue of mobile devices compared to conventional machine learning at the remote cloud with raw user data uploading. By leveraging edge servers as intermediaries to perform partial model aggregation in proximity and relieve core network transmission overhead, it enables great potentials in low-latency and energy-efficient FL. Hence we introduce a novel Hierarchical Federated Edge Learning (HFEL) framework in which model aggregation is partially migrated to edge servers from the cloud. We further formulate a joint computation and communication resource allocation and edge association problem for device users under HFEL framework to achieve global cost minimization. To solve the problem, we propose an efficient resource scheduling algorithm in the HFEL framework. It can be decomposed into two subproblems: \emph{resource allocation} given a scheduled set of devices for each edge server and \emph{edge association} of device users across all the edge servers. With the optimal policy of the convex resource allocation subproblem for a set of devices under a single edge server, an efficient edge association strategy can be achieved through iterative global cost reduction adjustment process, which is shown to converge to a stable system point. Extensive performance evaluations demonstrate that our HFEL framework outperforms the proposed benchmarks in global cost saving and achieves better training performance compared to conventional federated learning.

112 citations


Posted Content
TL;DR: A stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints is formulated, and a new algorithm is developed that utilizes only currently available wireless channel information but can achieve long- term performance guarantee.
Abstract: This paper studies federated learning (FL) in a classic wireless network, where learning clients share a common wireless link to a coordinating server to perform federated model training using their local data. In such wireless federated learning networks (WFLNs), optimizing the learning performance depends crucially on how clients are selected and how bandwidth is allocated among the selected clients in every learning round, as both radio and client energy resources are limited. While existing works have made some attempts to allocate the limited wireless resources to optimize FL, they focus on the problem in individual learning rounds, overlooking an inherent yet critical feature of federated learning. This paper brings a new long-term perspective to resource allocation in WFLNs, realizing that learning rounds are not only temporally interdependent but also have varying significance towards the final learning outcome. To this end, we first design data-driven experiments to show that different temporal client selection patterns lead to considerably different learning performance. With the obtained insights, we formulate a stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints, and develop a new algorithm that utilizes only currently available wireless channel information but can achieve long-term performance guarantee. Further experiments show that our algorithm results in the desired temporal client selection pattern, is adaptive to changing network environments and far outperforms benchmarks that ignore the long-term effect of FL.

109 citations


Journal ArticleDOI
TL;DR: Phone2Cloud offloads computation of an application running on smartphones to the cloud to improve energy efficiency of smartphones and at the same time, enhance the application’s performance through reducing its execution time.
Abstract: With prosperity of applications on smartphones, energy saving for smartphones has drawn increasing attention. In this paper we devise Phone2Cloud, a computation offloading-based system for energy saving on smartphones in the context of mobile cloud computing. Phone2Cloud offloads computation of an application running on smartphones to the cloud. The objective is to improve energy efficiency of smartphones and at the same time, enhance the application's performance through reducing its execution time. In this way, the user's experience can be improved. We implement the prototype of Phone2Cloud on Android and Hadoop environment. Two sets of experiments, including application experiments and scenario experiments, are conducted to evaluate the system. The experimental results show that Phone2Cloud can effectively save energy for smartphones and reduce the application's execution time.

106 citations


Posted ContentDOI
TL;DR: This work presents an FPGA-based monitoring approach by compiling an RTLola specification into synthesizable VHDL code, a stream-based specification language capable of expressing complex real-time properties while providing an upper bound on the execution time and memory requirements.
Abstract: An essential part of cyber-physical systems is the online evaluation of real-time data streams. Especially in systems that are intrinsically safety-critical, a dedicated monitoring component inspecting data streams to detect problems at runtime greatly increases the confidence in a safe execution. Such a monitor needs to be based on a specification language capable of expressing complex, high-level properties using only the accessible low-level signals. Moreover, tight constraints on computational resources exacerbate the requirements on the monitor. Thus, several existing approaches to monitoring are not applicable due to their dependence on an operating system. We present an FPGA-based monitoring approach by compiling an RTLola specification into synthesizable VHDL code. RTLola is a stream-based specification language capable of expressing complex real-time properties while providing an upper bound on the execution time and memory requirements. The statically determined memory bound allows for a compilation to an FPGA with a fixed size. An advantage of FPGAs is a simple integration process in existing systems and superb executing time. The compilation results in a highly parallel implementation thanks to the modular nature of RTLola specifications. This further increases the maximal event rate the monitor can handle.

106 citations


Proceedings ArticleDOI
TL;DR: This paper collects and summarizes the current accelerators that have been publicly announced with performance and power consumption numbers and highlights interesting trends regarding power consumption, numerical precision, and inference versus training.
Abstract: New machine learning accelerators are being announced and released each month for a variety of applications from speech recognition, video object detection, assisted driving, and many data center applications. This paper updates the survey of of AI accelerators and processors from last year's IEEE-HPEC paper. This paper collects and summarizes the current accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. This year, there are many more announced accelerators that are implemented with many more architectures and technologies from vector engines, dataflow engines, neuromorphic designs, flash-based analog memory processing, and photonic-based processing.

93 citations


Posted Content
TL;DR: Faaslets, a new isolation abstraction for serverless big data computing, is introduced and it is shown that, when training a machine learning model, it achieves a 2x speed-up with 10x less memory; for serving machine learning inference, Faasm doubles the throughput and reduces tail latency by 90%.
Abstract: Serverless computing is an excellent fit for big data processing because it can scale quickly and cheaply to thousands of parallel functions. Existing serverless platforms isolate functions in ephemeral, stateless containers, preventing them from directly sharing memory. This forces users to duplicate and serialise data repeatedly, adding unnecessary performance and resource costs. We believe that a new lightweight isolation approach is needed, which supports sharing memory directly between functions and reduces resource overheads. We introduce Faaslets, a new isolation abstraction for high-performance serverless computing. Faaslets isolate the memory of executed functions using software-fault isolation (SFI), as provided by WebAssembly, while allowing memory regions to be shared between functions in the same address space. Faaslets can thus avoid expensive data movement when functions are co-located on the same machine. Our runtime for Faaslets, Faasm, isolates other resources, e.g. CPU and network, using standard Linux cgroups, and provides a low-level POSIX host interface for networking, file system access and dynamic loading. To reduce initialisation times, Faasm restores Faaslets from already-initialised snapshots. We compare Faasm to a standard container-based platform and show that, when training a machine learning model, it achieves a 2x speed-up with 10x less memory; for serving machine learning inference, Faasm doubles the throughput and reduces tail latency by 90%.

89 citations


Posted Content
TL;DR: In this article, a dynamic fusion-based federated learning approach for medical diagnostic image analysis to detect COVID-19 infections is proposed, which can be used by the machine learning community for image analysis.
Abstract: Medical diagnostic image analysis (e.g., CT scan or X-Ray) using machine learning is an efficient and accurate way to detect COVID-19 infections. However, sharing diagnostic images across medical institutions is usually not allowed due to the concern of patients' privacy. This causes the issue of insufficient datasets for training the image classification model. Federated learning is an emerging privacy-preserving machine learning paradigm that produces an unbiased global model based on the received updates of local models trained by clients without exchanging clients' local data. Nevertheless, the default setting of federated learning introduces huge communication cost of transferring model updates and can hardly ensure model performance when data heterogeneity of clients heavily exists. To improve communication efficiency and model performance, in this paper, we propose a novel dynamic fusion-based federated learning approach for medical diagnostic image analysis to detect COVID-19 infections. First, we design an architecture for dynamic fusion-based federated learning systems to analyse medical diagnostic images. Further, we present a dynamic fusion method to dynamically decide the participating clients according to their local model performance and schedule the model fusion-based on participating clients' training time. In addition, we summarise a category of medical diagnostic image datasets for COVID-19 detection, which can be used by the machine learning community for image analysis. The evaluation results show that the proposed approach is feasible and performs better than the default setting of federated learning in terms of model performance, communication efficiency and fault tolerance.

Posted Content
Weijie Zhao1, Deping Xie1, Ronglai Jia1, Yulei Qian1, Ruiquan Ding1, Mingming Sun1, Ping Li1 
TL;DR: In this paper, a distributed GPU hierarchical parameter server for massive scale deep learning ad systems is proposed, which utilizes GPU High Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage.
Abstract: Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.

Journal ArticleDOI
TL;DR: This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA.
Abstract: Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require real-time and scalable approaches. Anomaly detection and security mechanisms require to quickly identify and react to unpredictable events while processing millions of heterogeneous events. At last, the system has to collect, store, and process massive sets of historical data for post-mortem analysis. Those are precisely the challenges faced by general big data approaches: Volume, Velocity, Variety, and Veracity. This survey brings together NTMA and big data. We catalog previous work on NTMA that adopt big data approaches to understand to what extent the potential of big data is being explored in NTMA. This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA. Finally, we provide guidelines for future work, discussing lessons learned, and research directions.

Journal ArticleDOI
TL;DR: This article performs a comprehensive survey of existing DL compilers by dissecting the commonly adopted design in details, with emphasis on the DL oriented multi-level IRs, and frontend/backend optimizations.
Abstract: The difficulty of deploying various deep learning (DL) models on diverse DL hardware has boosted the research and development of DL compilers in the community. Several DL compilers have been proposed from both industry and academia such as Tensorflow XLA and TVM. Similarly, the DL compilers take the DL models described in different DL frameworks as input, and then generate optimized codes for diverse DL hardware as output. However, none of the existing survey has analyzed the unique design architecture of the DL compilers comprehensively. In this paper, we perform a comprehensive survey of existing DL compilers by dissecting the commonly adopted design in details, with emphasis on the DL oriented multi-level IRs, and frontend/backend optimizations. Specifically, we provide a comprehensive comparison among existing DL compilers from various aspects. In addition, we present detailed analysis on the design of multi-level IRs and illustrate the commonly adopted optimization techniques. Finally, several insights are highlighted as the potential research directions of DL compiler. This is the first survey paper focusing on the design architecture of DL compilers, which we hope can pave the road for future research towards DL compiler.

Proceedings ArticleDOI
TL;DR: funcX as discussed by the authors is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution, which can transform existing clouds, clusters and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints.
Abstract: Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130000 concurrent workers.

Proceedings ArticleDOI
TL;DR: This work develops scalable bufferless protocols that implement the MPI-3.0 specification that are comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate and demonstrates application performance improvements with comparable programming complexity.
Abstract: Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.

Posted Content
TL;DR: This article presents state-of-the-art computing systems for autonomous driving, including seven performance metrics and nine key technologies, followed by 12 challenges to realize autonomous driving.
Abstract: The recent proliferation of computing technologies (e.g., sensors, computer vision, machine learning, and hardware acceleration), and the broad deployment of communication mechanisms (e.g., DSRC, C-V2X, 5G) have pushed the horizon of autonomous driving, which automates the decision and control of vehicles by leveraging the perception results based on multiple sensors. The key to the success of these autonomous systems is making a reliable decision in real-time fashion. However, accidents and fatalities caused by early deployed autonomous vehicles arise from time to time. The real traffic environment is too complicated for current autonomous driving computing systems to understand and handle. In this paper, we present state-of-the-art computing systems for autonomous driving, including seven performance metrics and nine key technologies, followed by twelve challenges to realize autonomous driving. We hope this paper will gain attention from both the computing and automotive communities and inspire more research in this direction.

Posted Content
TL;DR: Gavel is proposed, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies that allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies.
Abstract: Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems, making it easy to optimize for objectives in a heterogeneity-aware way, while also being cognizant of performance optimizations like space sharing. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies.

Posted Content
TL;DR: In this paper, the authors characterize the entire production FaaS workload of Azure Functions and propose a resource management policy that significantly reduces the number of function coldstarts, while spending fewer resources than state-of-the-practice policies.
Abstract: Function as a Service (FaaS) has been gaining popularity as a way to deploy computations to serverless backends in the cloud. This paradigm shifts the complexity of allocating and provisioning resources to the cloud provider, which has to provide the illusion of always-available resources (i.e., fast function invocations without cold starts) at the lowest possible resource cost. Doing so requires the provider to deeply understand the characteristics of the FaaS workload. Unfortunately, there has been little to no public information on these characteristics. Thus, in this paper, we first characterize the entire production FaaS workload of Azure Functions. We show for example that most functions are invoked very infrequently, but there is an 8-order-of-magnitude range of invocation frequencies. Using observations from our characterization, we then propose a practical resource management policy that significantly reduces the number of function coldstarts,while spending fewerresources than state-of-the-practice policies.

Posted Content
TL;DR: A comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations is provided, which provides the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.
Abstract: Distributed deep learning becomes very common to reduce the overall training time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of deep models and data sets increases. However, data communication between computing devices could be a potential bottleneck to limit the system scalability. How to address the communication problem in distributed deep learning is becoming a hot research topic recently. In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations. In the system-level, we demystify the system design and implementation to reduce the communication cost. In algorithmic-level, we compare different algorithms with theoretical convergence bounds and communication complexity. Specifically, we first propose the taxonomy of data-parallel distributed training algorithms, which contains four main dimensions: communication synchronization, system architectures, compression techniques, and parallelism of communication and computing. Then we discuss the studies in addressing the problems of the four dimensions to compare the communication cost. We further compare the convergence rates of different algorithms, which enable us to know how fast the algorithms can converge to the solution in terms of iterations. According to the system-level communication cost analysis and theoretical convergence speed comparison, we provide the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.

Posted Content
TL;DR: A wide range of consensus algorithms are analysed using a comprehensive taxonomy of properties and by examining the implications of different issues still prevalent in consensus algorithms in detail to fill the gap in comprehensive analysis of these algorithms.
Abstract: In recent years, blockchain technology has received unparalleled attention from academia, industry, and governments all around the world. It is considered a technological breakthrough anticipated to disrupt several application domains. This has resulted in a plethora of blockchain systems for various purposes. However, many of these blockchain systems suffer from serious shortcomings related to their performance and security, which need to be addressed before any wide-scale adoption can be achieved. A crucial component of any blockchain system is its underlying consensus algorithm, which in many ways, determines its performance and security. Therefore, to address the limitations of different blockchain systems, several existing as well novel consensus algorithms have been introduced. A systematic analysis of these algorithms will help to understand how and why any particular blockchain performs the way it functions. However, the existing studies of consensus algorithms are not comprehensive. Those studies have incomplete discussions on the properties of the algorithms and fail to analyse several major blockchain consensus algorithms in terms of their scopes. This article fills this gap by analysing a wide range of consensus algorithms using a comprehensive taxonomy of properties and by examining the implications of different issues still prevalent in consensus algorithms in detail. The result of the analysis is presented in tabular formats, which provides a visual illustration of these algorithms in a meaningful way. We have also analysed more than hundred top crypto-currencies belonging to different categories of consensus algorithms to understand their properties and to implicate different trends in these crypto-currencies. Finally, we have presented a decision tree of algorithms to be used as a tool to test the suitability of consensus algorithms under different criteria.

Posted Content
TL;DR: DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models, is proposed, which features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline Parallelism.
Abstract: It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategy of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23x under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6x speedup of training throughput and reduces the memory consumption of 12% at the same time.

Posted Content
TL;DR: Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators is designed and the design requirements of future scale-out training systems are discussed.
Abstract: Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

Journal ArticleDOI
TL;DR: Ensuring the success of big graph processing for the next decade and beyond is at the heart of this research.
Abstract: Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?

Posted Content
TL;DR: Fog learning enhances federated learning along three major dimensions: network, heterogeneity, and proximity, which will intelligently distribute ML model training across the continuum of nodes from edge devices to cloud servers.
Abstract: Machine learning (ML) tasks are becoming ubiquitous in today's network applications. Federated learning has emerged recently as a technique for training ML models at the network edge by leveraging processing capabilities across the nodes that collect the data. There are several challenges with employing conventional federated learning in contemporary networks, due to the significant heterogeneity in compute and communication capabilities that exist across devices. To address this, we advocate a new learning paradigm called fog learning which will intelligently distribute ML model training across the continuum of nodes from edge devices to cloud servers. Fog learning enhances federated learning along three major dimensions: network, heterogeneity, and proximity. It considers a multi-layer hybrid learning framework consisting of heterogeneous devices with various proximities. It accounts for the topology structures of the local networks among the heterogeneous nodes at each network layer, orchestrating them for collaborative/cooperative learning through device-to-device (D2D) communications. This migrates from star network topologies used for parameter transfers in federated learning to more distributed topologies at scale. We discuss several open research directions to realizing fog learning.

Proceedings ArticleDOI
TL;DR: The proposed AutoDNNchip is a DNN chip generator that can automatically produce both FPGA- and ASIC-based DNNChip implementation from DNNs developed by machine learning frameworks without humans in the loop and can achieve better performance than that of expert-crafted state-of-the-art FPGAs and ASICs.
Abstract: Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for DNN chips. However, designing DNN chips is non-trivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the large design space due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs to meet the application specifications. Therefore, DNN chips take a long time to design and require cross-disciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip - a DNN chip generator that can automatically generate both FPGA- and ASIC-based DNN chip implementation given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balancing, etc.), optimize chip design via the Chip Predictor, and then generate optimized synthesizable RTL to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from real-measured ones by < 10% when validated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, accelerators generated by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that of expert-crafted state-of-the-art accelerators.

Journal ArticleDOI
TL;DR: The results show that HybridFL improves the FL training process significantly in terms of shortening the federated round length, speeding up the global model’s convergence and reducing end device energy consumption.
Abstract: Mobile Edge Computing (MEC), which incorporates the Cloud, edge nodes and end devices, has shown great potential in bringing data processing closer to the data sources. Meanwhile, Federated learning (FL) has emerged as a promising privacy-preserving approach to facilitating AI applications. However, it remains a big challenge to optimize the efficiency and effectiveness of FL when it is integrated with the MEC architecture. Moreover, the unreliable nature (e.g., stragglers and intermittent drop-out) of end devices significantly slows down the FL process and affects the global model's quality Xin such circumstances. In this paper, a multi-layer federated learning protocol called HybridFL is designed for the MEC architecture. HybridFL adopts two levels (the edge level and the cloud level) of model aggregation enacting different aggregation strategies. Moreover, in order to mitigate stragglers and end device drop-out, we introduce regional slack factors into the stage of client selection performed at the edge nodes using a probabilistic approach without identifying or probing the state of end devices (whose reliability is agnostic). We demonstrate the effectiveness of our method in modulating the proportion of clients selected and present the convergence analysis for our protocol. We have conducted extensive experiments with machine learning tasks in different scales of MEC system. The results show that HybridFL improves the FL training process significantly in terms of shortening the federated round length, speeding up the global model's convergence (by up to 12X) and reducing end device energy consumption (by up to 58%).

Posted Content
TL;DR: Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level, and can reduce the cost of training large models in cloud environments by 25%.
Abstract: Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers will assign each job a number of resources requested by the user, which can allow jobs to use those resources inefficiently. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize those resources. Pollux simultaneously considers both aspects. By observing each job during training, Pollux models how their goodput (system throughput combined with statistical efficiency) would change by adding or removing resources. Leveraging these models, Pollux dynamically (re-)assigns resources to maximize cluster-wide goodput, while continually optimizing each DL job to better utilize those resources. In experiments with real DL training jobs and with trace-driven simulations, Pollux reduces average job completion time by 25%-50% relative to state-of-the-art DL schedulers, even when all jobs are submitted with ideal resource and training configurations. Based on the observation that the statistical efficiency of DL training can change over time, we also show that Pollux can reduce the cost of training large models in cloud environments by 25%.

Journal ArticleDOI
TL;DR: This work investigates the existing application management strategies in Fog computing and review them in terms of architecture, placement and maintenance, and proposes a comprehensive taxonomy and highlights the research gaps in Fog-based application management.
Abstract: The Internet of Things (IoT) paradigm is being rapidly adopted for the creation of smart environments in various domains. The IoT-enabled Cyber-Physical Systems (CPSs) associated with smart city, healthcare, Industry 4.0 and Agtech handle a huge volume of data and require data processing services from different types of applications in real-time. The Cloud-centric execution of IoT applications barely meets such requirements as the Cloud datacentres reside at a multi-hop distance from the IoT devices. \textit{Fog computing}, an extension of Cloud at the edge network, can execute these applications closer to data sources. Thus, Fog computing can improve application service delivery time and resist network congestion. However, the Fog nodes are highly distributed, heterogeneous and most of them are constrained in resources and spatial sharing. Therefore, efficient management of applications is necessary to fully exploit the capabilities of Fog nodes. In this work, we investigate the existing application management strategies in Fog computing and review them in terms of architecture, placement and maintenance. Additionally, we propose a comprehensive taxonomy and highlight the research gaps in Fog-based application management. We also discuss a perspective model and provide future research directions for further improvement of application management in Fog computing.

Posted Content
TL;DR: A description of federated neural architecture search that has recently been proposed is proposed, which is categorized into online and offline implementations, and single- and multi-objective search approaches.
Abstract: Federated learning is a recently proposed distributed machine learning paradigm for privacy preservation, which has found a wide range of applications where data privacy is of primary concern. Meanwhile, neural architecture search has become very popular in deep learning for automatically tuning the architecture and hyperparameters of deep neural networks. While both federated learning and neural architecture search are faced with many open challenges, searching for optimized neural architectures in the federated learning framework is particularly demanding. This survey paper starts with a brief introduction to federated learning, including both horizontal, vertical, and hybrid federated learning. Then, neural architecture search approaches based on reinforcement learning, evolutionary algorithms and gradient-based are presented. This is followed by a description of federated neural architecture search that has recently been proposed, which is categorized into online and offline implementations, and single- and multi-objective search approaches. Finally, remaining open research questions are outlined and promising research topics are suggested.