scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Distributed, Parallel, and Cluster Computing in 2015"


Posted Content
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

10,447 citations


Posted Content
TL;DR: The API design and the system implementation of MXNet are described, and it is explained how embedding of both symbolic expression and tensor operation is handled in a unified fashion.
Abstract: MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from mobile devices to distributed GPU clusters. This paper describes both the API design and the system implementation of MXNet, and explains how embedding of both symbolic expression and tensor operation is handled in a unified fashion. Our preliminary experiments reveal promising results on large scale deep neural network applications using multiple GPU machines.

2,153 citations


Proceedings ArticleDOI
TL;DR: "Gunrock," the high-level bulk-synchronous graph-processing system targeting the GPU, takes a new approach to abstracting GPU graph analytics: rather than designing an abstraction around computation, Gunrock implements a novel data-centric abstraction centered on operations on a vertex or edge frontier.
Abstract: For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We evaluate Gunrock on five key graph primitives and show that Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library.

355 citations


Posted Content
TL;DR: A graph processing benchmark suite that specifies graph kernels, input graphs, and evaluation methodologies, but it also provides optimized baseline implementations that can be used as a workload representative of graph processing.
Abstract: We present a graph processing benchmark suite with the goal of helping to standardize graph processing evaluations. Fewer differences between graph processing evaluations will make it easier to compare different research efforts and quantify improvements. The benchmark not only specifies graph kernels, input graphs, and evaluation methodologies, but it also provides optimized baseline implementations. These baseline implementations are representative of state-of-the-art performance, and thus new contributions should outperform them to demonstrate an improvement. The input graphs are sized appropriately for shared memory platforms, but any implementation on any platform that conforms to the benchmark's specifications could be compared. This benchmark suite can be used in a variety of settings. Graph framework developers can demonstrate the generality of their programming model by implementing all of the benchmark's kernels and delivering competitive performance on all of the benchmark's graphs. Algorithm designers can use the input graphs and the baseline implementations to demonstrate their contribution. Platform designers and performance analysts can use the suite as a workload representative of graph processing.

286 citations


Journal ArticleDOI
TL;DR: In this paper, the authors provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones for matrix multiplication and data shuffling in large-scale distributed systems.
Abstract: Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -- straggler nodes, system failures, or communication bottlenecks -- but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is $n$, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of $\log n$. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction $\alpha$ of the data matrix can be cached at each worker, and $n$ is the number of workers, \emph{coded shuffling} reduces the communication cost by a factor of $(\alpha + \frac{1}{n})\gamma(n)$ compared to uncoded shuffling, where $\gamma(n)$ is the ratio of the cost of unicasting $n$ messages to $n$ users to multicasting a common message (of the same size) to $n$ users. For instance, $\gamma(n) \simeq n$ if multicasting a message to $n$ users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms.

282 citations


Journal ArticleDOI
TL;DR: This paper evaluates which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way to identify optimal compositions in terms of raw trajectory production rate, performance‐to‐price ratio, energy efficiency, and several other criteria.
Abstract: The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well exploited with a combination of SIMD, multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions offloaded from the CPU. Here we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Though hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed since these cards do not support ECC memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.

155 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose an offline algorithm that solves for the optimal configuration in a specific look-ahead time-window, and an online approximation algorithm with polynomial time-complexity to find the placement in real-time whenever an instance arrives.
Abstract: Mobile micro-clouds are promising for enabling performance-critical cloud applications. However, one challenge therein is the dynamics at the network edge. In this paper, we study how to place service instances to cope with these dynamics, where multiple users and service instances coexist in the system. Our goal is to find the optimal placement (configuration) of instances to minimize the average cost over time, leveraging the ability of predicting future cost parameters with known accuracy. We first propose an offline algorithm that solves for the optimal configuration in a specific look-ahead time-window. Then, we propose an online approximation algorithm with polynomial time-complexity to find the placement in real-time whenever an instance arrives. We analytically show that the online algorithm is $O(1)$-competitive for a broad family of cost functions. Afterwards, the impact of prediction errors is considered and a method for finding the optimal look-ahead window size is proposed, which minimizes an upper bound of the average actual cost. The effectiveness of the proposed approach is evaluated by simulations with both synthetic and real-world (San Francisco taxi) user-mobility traces. The theoretical methodology used in this paper can potentially be applied to a larger class of dynamic resource allocation problems.

149 citations


Posted Content
TL;DR: The evolution of big data computing, differences between traditional data warehousing and big data, taxonomy ofbig data computing and underpinning technologies, integrated platform of bigdata and clouds known as big data clouds, layered architecture and components of bigData cloud, and finally open‐technical challenges and future directions are discussed.
Abstract: Advances in information technology and its widespread growth in several areas of business, engineering, medical and scientific studies are resulting in information/data explosion. Knowledge discovery and decision making from such rapidly growing voluminous data is a challenging task in terms of data organization and processing, which is an emerging trend known as Big Data Computing; a new paradigm which combines large scale compute, new data intensive techniques and mathematical models to build data analytics. Big Data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of Big Data computing, differences between traditional data warehousing and Big Data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big Data and Clouds known as Big Data Clouds, layered architecture and components of Big Data Cloud and finally discusses open technical challenges and future directions.

148 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors compare the performance of CPU and network running benchmarks in the two aforementioned models of microservices architecture and provide a benchmark analysis guidance for system designers, which can be used to develop applications based on monolithic architectures where the whole system runs inside a single container or inside a microservices architectures where one or few processes run inside the containers.
Abstract: Microservices architecture has started a new trend for application development for a number of reasons: (1) to reduce complexity by using tiny services; (2) to scale, remove and deploy parts of the system easily; (3) to improve flexibility to use different frameworks and tools; (4) to increase the overall scalability; and (5) to improve the resilience of the system. Containers have empowered the usage of microservices architectures by being lightweight, providing fast start-up times, and having a low overhead. Containers can be used to develop applications based on monolithic architectures where the whole system runs inside a single container or inside a microservices architecture where one or few processes run inside the containers. Two models can be used to implement a microservices architecture using containers: master-slave, or nested-container. The goal of this work is to compare the performance of CPU and network running benchmarks in the two aforementioned models of microservices architecture hence provide a benchmark analysis guidance for system designers.

125 citations


Proceedings ArticleDOI
TL;DR: In this article, partial key grouping (PKG) is proposed to adapt the classical power of two choices to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation.
Abstract: We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.

122 citations


Journal ArticleDOI
TL;DR: This article presents parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann-based simulations on nonuniform grids, and evaluates the performance on two current petascale supercomputers.
Abstract: The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale non-stationary flow simulations, reaching up to a trillion grid nodes. Additionally, grid refinement can lead to substantial savings in memory and compute time. These saving, however, come at the cost of much more complex data structures and algorithms. In particular, the interface between subdomains with different grid sizes must receive special treatment. In this article, we present parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann-based simulations on non-uniform grids. Additionally, we evaluate the performance of our approach on two current petascale supercomputers. On an IBM Blue Gene/Q system, the largest weak scaling benchmarks with refined grids are executed with almost two million threads, demonstrating not only near-perfect scalability but also an absolute performance of close to a trillion lattice Boltzmann cell updates per second. On an Intel-based system, the strong scaling of a simulation with refined grids and a total of more than 8.5 million cells is demonstrated to reach a performance of less than one millisecond per time step. This enables simulations with complex, non-uniform grids and four million time steps per hour compute time.

Posted Content
TL;DR: This paper forms the service migration problem as a Markov decision process (MDP), which captures general cost models and provides a mathematical framework to design optimal service migration policies and approximate the underlying state space by the distance between the user and service locations.
Abstract: In mobile edge computing, local edge servers can host cloud-based services, which reduces network overhead and latency but requires service migrations as users move to new locations. It is challenging to make migration decisions optimally because of the uncertainty in such a dynamic cloud environment. In this paper, we formulate the service migration problem as a Markov Decision Process (MDP). Our formulation captures general cost models and provides a mathematical framework to design optimal service migration policies. In order to overcome the complexity associated with computing the optimal policy, we approximate the underlying state space by the distance between the user and service locations. We show that the resulting MDP is exact for uniform one-dimensional user mobility while it provides a close approximation for uniform two-dimensional mobility with a constant additive error. We also propose a new algorithm and a numerical technique for computing the optimal solution which is significantly faster than traditional methods based on standard value or policy iteration. We illustrate the application of our solution in practical scenarios where many theoretical assumptions are relaxed. Our evaluations based on real-world mobility traces of San Francisco taxis show superior performance of the proposed solution compared to baseline solutions.

Posted Content
TL;DR: In this article, the authors propose two effective message reduction techniques: vertex mirroring with message combining and an additional request-respond API, which not only reduce the total number of messages exchanged through the network, but also bound the message sent/received by any single vertex.
Abstract: Massive graphs, such as online social networks and communication networks, have become common today. To efficiently analyze such large graphs, many distributed graph computing systems have been developed. These systems employ the "think like a vertex" programming paradigm, where a program proceeds in iterations and at each iteration, vertices exchange messages with each other. However, using Pregel's simple message passing mechanism, some vertices may send/receive significantly more messages than others due to either the high degree of these vertices or the logic of the algorithm used. This forms the communication bottleneck and leads to imbalanced workload among machines in the cluster. In this paper, we propose two effective message reduction techniques: (1)vertex mirroring with message combining, and (2)an additional request-respond API. These techniques not only reduce the total number of messages exchanged through the network, but also bound the number of messages sent/received by any single vertex. We theoretically analyze the effectiveness of our techniques, and implement them on top of our open-source Pregel implementation called Pregel+. Our experiments on various large real graphs demonstrate that our message reduction techniques significantly improve the performance of distributed graph computation.

Posted Content
TL;DR: This work proposes Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements and persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows.
Abstract: Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots.

Posted Content
TL;DR: This work presents an extensive up-to-date review of the most relevant VMP literature in order to identify research opportunities.
Abstract: Cloud Computing Datacenters host millions of virtual machines (VMs) on real world scenarios. In this context, Virtual Machine Placement (VMP) is one of the most challenging problems in cloud infrastructure management, considering also the large number of possible optimization criteria and different formulations that could be studied. VMP literature include relevant topics such as energy-efficiency, Service Level Agreements (SLA), cloud service markets, Quality of Service (QoS) and carbon dioxide emissions, all of them with high economical and ecological impact. This work presents an extensive up-to-date review of the most relevant VMP literature in order to identify research opportunities.

Proceedings ArticleDOI
TL;DR: CNNdroid as mentioned in this paper is a GPU-accelerated library for execution of trained deep CNNs on Android-based mobile devices, which achieves up to 60X speedup and 130X energy saving on current mobile devices.
Abstract: Many mobile applications running on smartphones and wearable devices would potentially benefit from the accuracy and scalability of deep CNN-based machine learning algorithms. However, performance and energy consumption limitations make the execution of such computationally intensive algorithms on mobile devices prohibitive. We present a GPU-accelerated library, dubbed CNNdroid, for execution of trained deep CNNs on Android-based mobile devices. Empirical evaluations show that CNNdroid achieves up to 60X speedup and 130X energy saving on current mobile devices. The CNNdroid open source library is available for download at this https URL

Posted Content
TL;DR: In this paper, a dynamic resource scheduler for cloud-based data stream management system (DSMSs) is proposed, where the user must receive each result update within a given period after the update occurs.
Abstract: In a data stream management system (DSMS), users register continuous queries, and receive result updates as data arrive and expire. We focus on applications with real-time constraints, in which the user must receive each result update within a given period after the update occurs. To handle fast data, the DSMS is commonly placed on top of a cloud infrastructure. Because stream properties such as arrival rates can fluctuate unpredictably, cloud resources must be dynamically provisioned and scheduled accordingly to ensure real-time response. It is quite essential, for the existing systems or future developments, to possess the ability of scheduling resources dynamically according to the current workload, in order to avoid wasting resources, or failing in delivering correct results on time. Motivated by this, we propose DRS, a novel dynamic resource scheduler for cloud-based DSMSs. DRS overcomes three fundamental challenges: (a) how to model the relationship between the provisioned resources and query response time (b) where to best place resources; and (c) how to measure system load with minimal overhead. In particular, DRS includes an accurate performance model based on the theory of \emph{Jackson open queueing networks} and is capable of handling \emph{arbitrary} operator topologies, possibly with loops, splits and joins. Extensive experiments with real data confirm that DRS achieves real-time response with close to optimal resource consumption.

Journal ArticleDOI
TL;DR: In this article, the authors present the first implementation of the 3D SpGEMM formulation that exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrency.
Abstract: Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.

Journal ArticleDOI
TL;DR: In this article, the authors used algebraic methods for studying distance computation and subgraph detection tasks in the congested clique model, obtaining an O(n^{1-2/\omega})$ round matrix multiplication algorithm, where ρ < 2.3728639$ is the exponent of matrix multiplication.
Abstract: In this work, we use algebraic methods for studying distance computation and subgraph detection tasks in the congested clique model. Specifically, we adapt parallel matrix multiplication implementations to the congested clique, obtaining an $O(n^{1-2/\omega})$ round matrix multiplication algorithm, where $\omega < 2.3728639$ is the exponent of matrix multiplication. In conjunction with known techniques from centralised algorithmics, this gives significant improvements over previous best upper bounds in the congested clique model. The highlight results include: -- triangle and 4-cycle counting in $O(n^{0.158})$ rounds, improving upon the $O(n^{1/3})$ triangle detection algorithm of Dolev et al. [DISC 2012], -- a $(1 + o(1))$-approximation of all-pairs shortest paths in $O(n^{0.158})$ rounds, improving upon the $\tilde{O} (n^{1/2})$-round $(2 + o(1))$-approximation algorithm of Nanongkai [STOC 2014], and -- computing the girth in $O(n^{0.158})$ rounds, which is the first non-trivial solution in this model. In addition, we present a novel constant-round combinatorial algorithm for detecting 4-cycles.

Posted Content
TL;DR: A comprehensive set of benchmarks for hardware accelerated matrix computations from the J VM, which is interesting in its own right, as many cluster programming frameworks use the JVM.
Abstract: We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.

Posted Content
TL;DR: A general redundancy strategy is designed that achieves a good latency-cost trade-off for an arbitrary service time distribution and generalizes and extends some results in the analysis of fork-join queues.
Abstract: In cloud computing systems, assigning a task to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers, and reduce latency. But adding redundancy may result in higher cost of computing resources, as well as an increase in queueing delay due to higher traffic load. This work helps understand when and how redundancy gives a cost-efficient reduction in latency. For a general task service time distribution, we compare different redundancy strategies in terms of the number of redundant tasks, and time when they are issued and canceled. We get the insight that the log-concavity of the task service time creates a dichotomy of when adding redundancy helps. If the service time distribution is log-convex (i.e. log of the tail probability is convex) then adding maximum redundancy reduces both latency and cost. And if it is log-concave (i.e. log of the tail probability is concave), then less redundancy, and early cancellation of redundant tasks is more effective. Using these insights, we design a general redundancy strategy that achieves a good latency-cost trade-off for an arbitrary service time distribution. This work also generalizes and extends some results in the analysis of fork-join queues.

Posted Content
TL;DR: In this paper, the authors describe an implementation of the well-known consensus protocol, Paxos, in the P4 programming language, a language for programming the behavior of network forwarding devices (i.e., the network data plane).
Abstract: This paper describes an implementation of the well-known consensus protocol, Paxos, in the P4 programming language. P4 is a language for programming the behavior of network forwarding devices (i.e., the network data plane). Moving consensus logic into network devices could significantly improve the performance of the core infrastructure and services in data centers. Moreover, implementing Paxos in P4 provides a critical use case and set of requirements for data plane language designers. In the long term, we imagine that consensus could someday be offered as a network service, just as point-to-point communication is provided today.

Posted Content
TL;DR: It is shown that any randomised Monte Carlo distributed algorithm for the Lovász local lemma requires Omega(log log n) communication rounds, assuming that it finds a correct assignment with high probability.
Abstract: We show that any randomised Monte Carlo distributed algorithm for the Lovasz local lemma requires $\Omega(\log \log n)$ communication rounds, assuming that it finds a correct assignment with high probability. Our result holds even in the special case of $d = O(1)$, where $d$ is the maximum degree of the dependency graph. By prior work, there are distributed algorithms for the Lovasz local lemma with a running time of $O(\log n)$ rounds in bounded-degree graphs, and the best lower bound before our work was $\Omega(\log^* n)$ rounds [Chung et al. 2014].

Posted Content
TL;DR: In this paper, it was shown that any population protocol that stably elects a leader requires O(n) expected "parallel time" to reach such a stable configuration.
Abstract: A population protocol *stably elects a leader* if, for all $n$, starting from an initial configuration with $n$ agents each in an identical state, with probability 1 it reaches a configuration $\mathbf{y}$ that is correct (exactly one agent is in a special leader state $\ell$) and stable (every configuration reachable from $\mathbf{y}$ also has a single agent in state $\ell$). We show that any population protocol that stably elects a leader requires $\Omega(n)$ expected "parallel time" --- $\Omega(n^2)$ expected total pairwise interactions --- to reach such a stable configuration. Our result also informs the understanding of the time complexity of chemical self-organization by showing an essential difficulty in generating exact quantities of molecular species quickly.

Posted Content
TL;DR: It is shown that, if k ≤ nα, where α is a suitable positive constant, the 3-majority dynamics converges in time polynomial in k and log n with high probability even in the presence of an adversary who can affect up to o([EQUATION]) nodes at each round.
Abstract: We consider the following distributed consensus problem: Each node in a complete communication network of size $n$ initially holds an \emph{opinion}, which is chosen arbitrarily from a finite set $\Sigma$. The system must converge toward a consensus state in which all, or almost all nodes, hold the same opinion. Moreover, this opinion should be \emph{valid}, i.e., it should be one among those initially present in the system. This condition should be met even in the presence of an adaptive, malicious adversary who can modify the opinions of a bounded number of nodes in every round. We consider the \emph{3-majority dynamics}: At every round, every node pulls the opinion from three random neighbors and sets his new opinion to the majority one (ties are broken arbitrarily). Let $k$ be the number of valid opinions. We show that, if $k \leqslant n^{\alpha}$, where $\alpha$ is a suitable positive constant, the 3-majority dynamics converges in time polynomial in $k$ and $\log n$ with high probability even in the presence of an adversary who can affect up to $o(\sqrt{n})$ nodes at each round. Previously, the convergence of the 3-majority protocol was known for $|\Sigma| = 2$ only, with an argument that is robust to adversarial errors. On the other hand, no anonymous, uniform-gossip protocol that is robust to adversarial errors was known for $|\Sigma| > 2$.

Posted Content
TL;DR: C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably are presented.
Abstract: Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in practice requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3-approximation ratio. We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.

Posted Content
TL;DR: In this paper, MapReduce is used to speed up and make possible three large-scale medical image processing use-cases: (i) parameter optimization for lung texture classification using support vector machines (SVM), (ii) content-based medical image indexing, and (iii) three-dimensional directional wavelet analysis for solid texture classification.
Abstract: The growth of the amount of medical image data produced on a daily basis in modern hospitals forces the adaptation of traditional medical image analysis and indexing approaches towards scalable solutions. The number of images and their dimensionality increased dramatically during the past 20 years. We propose solutions for large-scale medical image analysis based on parallel computing and algorithm optimization. The MapReduce framework is used to speed up and make possible three large-scale medical image processing use-cases: (i) parameter optimization for lung texture segmentation using support vector machines, (ii) content-based medical image indexing, and (iii) three-dimensional directional wavelet analysis for solid texture classification. A cluster of heterogeneous computing nodes was set up in our institution using Hadoop allowing for a maximum of 42 concurrent map tasks. The majority of the machines used are desktop computers that are also used for regular office work. The cluster showed to be minimally invasive and stable. The runtimes of each of the three use-case have been significantly reduced when compared to a sequential execution. Hadoop provides an easy-to-employ framework for data analysis tasks that scales well for many tasks but requires optimization for specific tasks.

Proceedings ArticleDOI
TL;DR: It is shown that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance, and the inefficiencies at micro-architecture level for various data analysis workloads are quantified.
Abstract: In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.

Posted Content
TL;DR: Cross Fault Tolerant (XFT) as mentioned in this paper is an approach to building reliable and secure distributed systems and applies it to the classical state-machine replication (SMR) problem.
Abstract: Despite years of intensive research, Byzantine fault-tolerant (BFT) systems have not yet been adopted in practice. This is due to additional cost of BFT in terms of resources, protocol complexity and performance, compared with crash fault-tolerance (CFT). This overhead of BFT comes from the assumption of a powerful adversary that can fully control not only the Byzantine faulty machines, but at the same time also the message delivery schedule across the entire network, effectively inducing communication asynchrony and partitioning otherwise correct machines at will. To many practitioners, however, such strong attacks appear irrelevant. In this paper, we introduce cross fault tolerance or XFT, a novel approach to building reliable and secure distributed systems and apply it to the classical state-machine replication (SMR) problem. In short, an XFT SMR protocol provides the reliability guarantees of widely used asynchronous CFT SMR protocols such as Paxos and Raft, but also tolerates Byzantine faults in combination with network asynchrony, as long as a majority of replicas are correct and communicate synchronously. This allows the development of XFT systems at the price of CFT (already paid for in practice), yet with strictly stronger resilience than CFT --- sometimes even stronger than BFT itself. As a showcase for XFT, we present XPaxos, the first XFT SMR protocol, and deploy it in a geo-replicated setting. Although it offers much stronger resilience than CFT SMR at no extra resource cost, the performance of XPaxos matches that of the state-of-the-art CFT protocols.

Posted Content
TL;DR: In this article, the authors present Kira SE, a distributed astronomy image processing toolkit using Apache Spark, which is used to implement a Source Extractor application for astronomy images, called Kira SE.
Abstract: Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications.